This topic provides answers to some frequently asked questions about Data Lake Analytics (DLA).

What is DLA?

DLA is a next-generation big data solution that uses a compute-storage separation architecture. DLA can archive messages and database data and create data warehouses in real time. The databases include relational databases, PolarDB, and NoSQL databases. In addition, DLA provides the serverless Spark and Presto engines to meet the requirements of online interactive search, stream processing, batch processing, and machine learning. DLA is a competitive solution that migrates a traditional Hadoop solution to the cloud.

In traditional data analytics solutions, you must first purchase an analytics instance with both computing and storage resources to analyze data. If the allowed computing or storage resources are exhausted, you must add servers to meet your business requirements. In addition, you must pay for all the computing resources even though you do not use them.

DLA is a data analytics engine that utilizes a serverless architecture. You do not need to purchase or manage servers for data analytics, and the upgrade is completely transparent. DLA supports scalability. You can pay for resources based on actual usage and scale out storage resources on demand. If no analytics tasks are running, you need only to pay for the storage resources that are used.

The error message "Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout' on the server." appears. What do I do?

This error message appears when you use DLA to read data from a MySQL database. The default value of the net_write_timeout parameter is small. As a result, DLA may be disconnected from the MySQL database even though the data has not been completely read. To resolve this issue, increase the value of the net_write_timeout parameter.

The error message "You have no right to access this object because of bucket acl" appears. What do I do?

This error message appears when you create a table in DLA and write the table data to Object Storage Service (OSS). By default, DLA is not allowed to write data to an OSS table. You must grant the related permissions to DLA before you write data to OSS. For more information about how to write data to OSS, see Write data to a partitioned table.

An SQL statement is executed to synchronize data from HDFS to OSS. The number of synchronized data records queried by using DLA is 110,000, but that queried by using a self-managed Spark cluster is 190,000. Why?

In most cases, this issue does not lie in the engines of DLA. You can check whether data is normally synchronized from HDFS to OSS.

Does the serverless Presto engine of DLA support gzip-compressed data?

Yes, the serverless Presto engine of DLA supports gzip-compressed data.

Can the serverless Presto engine of DLA process compressed files and uncompressed files in a directory at the same time?

Yes, the serverless Presto engine of DLA can process compressed files and uncompressed files in a directory at the same time.

Why is data read from a gzip-compressed table garbled characters?

The following statement is used to create a gzip-compressed table:
CREATE EXTERNAL TABLE data (
rmid string,
tag string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION 'oss://xxx-data/20190110/'
TBLPROPERTIES (
'compression.type'='gzip'
);

This issue occurs because 'compression.type'='gzip' is valid only when you use DLA to write data to a table. When you use DLA to read data from a table, DLA determines the compression method based on the file name extension. However, the files that contain the sum of values do not have file name extensions.

Do SQL statements of DLA automatically and recursively traverse all subdirectories and files in a directory?

Yes, SQL statements of DLA automatically and recursively traverse all subdirectories and files in a directory.

How do I use SQL statements to create a MaxCompute schema in DLA?

For more information, see https://yq.aliyun.com/articles/705889.

When I use DataX to export data from DLA, which reader can I use?

MySQL Reader.

Does DLA support the IN (SELECT UID FROM **) statement?

Yes, DLA supports the IN (SELECT UID FROM **) statement. The IN (SELECT UID FROM **) statement is not as efficient as the JOIN statement.

The following error message appears if the data type supported by a MySQL database is inconsistent with that supported by DLA. Why?

because '3.00199E9' in colum
n '4' is outside valid range for the datatype INTEGER. 

This issue occurs because the data type supported by the MySQL database is UNSIGNED INT but that supported by DLA is INT. The data types do not match.

Why am I unable to load partitions of a table by using the MSCK REPAIR TABLE ACTION command?

You can load partitions of a table only when the subdirectories that are mapped to the partitions are named in the format of key=value, for example, year=2018 in oss://abc/yyy/action/year=2018/.

How do I specify multiple hints in an SQL statement of DLA?

Sample statement:

/*+ cluster=public , max-running-time-sec=50 */
/*+ run-async=true, result-meta-included=true */
select * from t1

If you specify multiple hints in an SQL statement, separate these hints with commas (,), for example, /+ xxx=aaa, yyy=bbbb /.

In which scenarios do I need to add LIMIT 500 to SQL statements for data queries in DLA?

To stably display pages in the DLA console, you can add LIMIT 500 to SQL statements that are executed in the DLA console. If the number of output results displayed in the DLA console exceeds 500, we recommend that you write result data to OSS and download the data from OSS when you need to view the data.

In what situations is the size of data that is scanned by the serverless Presto engine of DLA free of charge?

A data scan is free of charge when data is scanned by using Data Definition Language (DDL) statements or the size of the scanned data is less than 0.1 KB.

When I execute an SQL statement in DLA to query data by using the credentials of a RAM user, why does the operation fail?

You cannot use the credentials of a RAM user to query data of your Alibaba Cloud account. For example, if you execute the SHOW QUERY_TASK WHERE ID = 'xxxx' statement, no results are returned.

How do I view the number of partitions in a table?

You can run a SELECT statement to query the number of partitions in a table. Sample statement: select count(distinct <partiton_key>) from tbl;

How do I obtain the result of asynchronous execution of SQL statements in the DLA console?

In the left-side navigation pane of the DLA console, choose Serverless Presto > Execute. On the Execute page, enter your SQL statements in the code editor and click Async Execute(F9). Then, click Result Set to obtain the value of ASYNC_TASK_ID. After that, enter SHOW QUERY_TASK WHERE ID = 'Value of ASYNC_TASK_ID' in the code editor and click Sync Execute(F8). Then, click Result Set to obtain the value of the result_file_oss_file field. This field indicates the OSS directory where the result data is saved.
Note The value of the result_file_oss_file field is returned only after the SHOW QUERY_TASK WHERE ID = 'Value of ASYNC_TASK_ID' statement is executed.

The error message "Error Code:User.Unauthorized" appears in the DLA console. What do I do?

Check whether you use the credentials of a RAM user to log on to the DLA console. If you use the credentials of a RAM user to log on to the DLA console, make sure that you have used your Alibaba Cloud account to attach the AliyunDLAReadOnlyAccess policy to the RAM user. If the policy is not attached to the RAM user, use your Alibaba Cloud account to attach the policy to the RAM user. For more information, see Manage RAM users.

Where do I use the credentials of a RAM user to execute SQL statements of DLA?

You can use the credentials of a RAM user to execute SQL statements of DLA in the Data Management (DMS) console. A new solution will be provided in the future.

All data that DLA reads from a CSV file is garbled characters when I execute SQL statements of DLA to create a table for this file. What do I do?

You can run the file filepath command in Linux to view the encoding format of the CSV file and then use LazySimpleSerDe to create a table. If the encoding format of the CSV file is ISO-8859, set the serialization.encoding parameter to gbk for the table that you want to create. For more information, see CSV.

What do I do when the error message "Null IO Stream" appears?

This error message appears due to one of the following reasons:
  • The storage class of your bucket is Archive.
  • The files that can be queried by using SQL statements are mistakenly deleted by another extract, transform, load (ETL) task.

The error message "Query exceeded distributed user memory limit of 2.00TB" or "Query execeeded per-node user memory limit of xxGB" appears when I execute SQL statements of DLA. What do I do?

English Version: Presto service like DLA or the similar service provided by AWS Athena do have memory limit, e.g.: https://forums.aws.amazon.com/message.jspa?messageID=894737.

What do I do when the error message "Unexpected end of input stream" appears?

This issue occurs due to data file errors. You can provide your query ID to technical support to obtain the directory of the data file in which an error occurs from logs. Then, download the file and check its integrity. The check method varies based on the file compression method. If a file is compressed by using the gunzip command, check whether the file can be properly decompressed by using this command.

When I scan data by using SQL statements of DLA, the size of ORC data that is scanned is greater than expected. What do I do?

Column-oriented tables have higher analytical query efficiency than row-oriented tables because an analytical query obtains only a few columns of data in a table. For example, if the serverless Presto engine of DLA needs to scan data in an underlying table, it needs only to scan data in the required columns. In this case, data in underlying tables must be stored in columns. For example, you can execute the following statement to query a column of data:
SELECT col1
FROM tbl1
WHERE col2 = 'hello'
In practical application, if an analytical query is performed on a small- or medium-sized table or on a table with a large number of columns, the size of data that is scanned is still large. This is because the dataSizeSessionProperty parameter is used to specify whether the serverless Presto engine scans data in a stripe or the entire table.
dataSizeSessionProperty(
        ORC_TINY_STRIPE_THRESHOLD,
        "ORC: Threshold below which an ORC stripe or file will read in its entirety", hiveClientConfig.getOrcTinyStripeThreshold(),
        false)
         

The default value of the dataSizeSessionProperty parameter is 8. Unit: megabytes. If the size of each stripe in a table is less than 8 MB, the serverless Presto engine queries data in the entire table instead of data in each stripe because it is not cost-effective to query data in each stripe.

DLA is unable to parse the DDL statements that I use to create a table in the JSON format. Why?

DDL statements cannot be parsed due to one of the following reasons:
  • A space exists in BusinessPeriodType.
  • The data type of EnterpriseId is INT rather than STRING.

The error message "Access denied for user" appears when I connect DLA to a database during one-click data warehousing. What do I do?

Check whether the username or password that is used to log on to the database is changed. If the username or password is changed, log on to the DLA console. In the left-side navigation pane, click Metadata management. On the Metadata management page, find your one-click data warehousing task and click Library table details in the Actions column. On the page that appears, click Configuration and then Update to modify the configurations of the task.

The error message "because Could not create connection to database server. Attempted reconnect 3 times. Giving up" appears when I connect DLA to an ApsaraDB RDS database during one-click data warehousing. What do I do?

The error message appears when the ApsaraDB RDS database from which DLA synchronizes data is heavily loaded. We recommend that you perform one of the following operations:
  • Adjust the time at which data is synchronized for a one-click data warehousing task. This operation is used to limit the number of read and write operations that are performed on the instance in parallel.
  • Adjust the advanced parameters to reduce the parallelism of requests. For example, if you set the connections-per-job parameter to 10 and the total-allowed-connections parameter to 30, a table can be read by 10 requests in parallel, and the data of a maximum of three tables can be read in parallel. You can adjust the specifications of your ApsaraDB RDS instance.

What do I do if the following error message appears when I connect DLA to a database during one-click data warehousing: "because Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes"

This issue occurs because the one-click data warehousing task that you run has a high load. You can reconnect DLA to the database a few minutes later. If this issue persists, contact DLA technical support.

When I create a data warehouse by merging databases or run a one-click data warehousing task, the error message "No such instance with ID: rm-xxxxxx" appears. What do I do?

This issue occurs when the ApsaraDB RDS instance rm-xxxxxx releases the one-click data warehousing task and returns the following error message: "because Communications link failureThe last packet successfully received from the server was 900,120 milliseconds ago. The last packet sent successfully to the server was 900,120 milliseconds ago.), 23 out of 110 jobs are finished." This issue occurs due to one of the following reasons: 1. The serverless Presto engine of DLA cannot read data from a view for a long period of time. 2. The size of data read by the serverless Presto engine is large, and the ApsaraDB RDS database fails to restart or does not respond. To address this issue, we recommend that you delete the view and reduce the parallelism of requests.

How do I execute SQL statements of DLA to adjust the parallelism of queries that are sent over a JDBC connection?

When you use DLA to scan online data such as data in an AparaDB RDS database, the default parallelism of queries that are sent over a JDBC connection may not be supported if the AparaDB RDS instance uses small specifications. In this case, you can add the hint /*+jdbc-scan-splits=2*/ to a statement. This ensures that two queries are sent over a JDBC connection when you scan data of a data table.

​mysql> /*+jdbc-scan-splits=2*/select count(*) from customer;
+----------+
| count(*) |
+----------+
| 15000002 |​ 

In the preceding statement, customer indicates the customer table in the ApsaraDB RDS for MySQL database. DLA uses two queries that are sent over a JDBC connection to scan data of the customer table in the ApsaraDB RDS for MySQL database.

Usage notes:
  • The table that you want to scan must have an auto-increment primary key. Otherwise, you cannot add a hint to adjust the parallelism of queries that are sent over a JDBC connection.

    If a table has an auto-increment primary key, you can split a query into multiple subqueries based on the auto-increment primary key. This increases the parallelism of queries that are sent over a JDBC connection.

    -- Original SQL statement
      select * from customer;
      -- SQL statements with subqueries
      select * from customer where id >=0 and id <=10000;
      select * from customer where id >=10000 and id <=20000;​
  • The maximum parallelism of queries that are sent over a JDBC connection is 500. The default parallelism of queries that are sent over a JDBC connection is 1. If you do not use the default parallelism, you can reduce the parallelism of queries that are sent over a JDBC connection for the databases that use small specifications.

How do I execute SQL statements to submit jobs to a specified VC?

After you create a VC, for example, the dladw VC, you can use one of the following methods to submit jobs to the VC.
  • Add a hint to the SQL statement that you want to execute. Sample statement:
    /*+cluster=dladw*/ SELECT * FROM tableName LIMIT 20;
  • Configure the dladw VC as the default VC. For more information, see #section_hq6_fd5_y4a. After you configure the default VC, you can directly execute a SELECT statement to submit a job to the dladw VC. Sample statement:
    SELECT * FROM tableName LIMIT 20;
Note Only one default VC can be configured. After you configure the dladw VC as the default VC, the configuration takes effect immediately, and the previous configuration automatically becomes ineffective. If the default VC is not configured, the public VC is used as the default VC.

How do I configure a VC ID to execute SQL statements for a specified VC?

A VC ID uniquely identifies a VC. If you want to use a business intelligence (BI) tool to connect to DLA and submit a job to a specified VC but the BI tool does not support hints in SQL statements, you can append @VC ID to the username in the SQL statement that you want to execute.

For example, if the username is oa_101302231286633637523c and the VC ID is a2df, you can change the username to oa_101302231286633637523c@a2df. This way, the SQL statement is automatically executed for the VC with an ID of a2df.

How do I manually configure a default VC for the serverless Presto engine of DLA?

To manually configure a default VC, you can log on to the DLA console. In the left-side navigation pane, click System settings. On the page that appears, select the default VC from the SQL default cluster drop-down list. This way, all query jobs are automatically submitted to the default VC.

When I create a database in DLA, the error message "Generate vpc reverse access mapping ip & mapping port failed, please check your vpc_id(vpc-xxxx) and instance_id(rm-xxx)!" appears. What do I do?

This issue occurs due to one of the following reasons:
  • The VPC ID and instance ID that you entered are incorrect.
  • The instance ID that appears in the DLA console is incorrect if data migration has been performed on the ApsaraDB RDS database. In this case, add USE_VPC_CLOUD_INSTANCE_ID = 'true' to the statement that is used to create a schema. Sample statement:
    CREATE SCHEMA `test_db` WITH DBPROPERTIES
    (
    CATALOG = 'mysql',
      LOCATION = 'jdbc:mysql://rm-xxx.mysql.rds.aliyuncs.com:3306/dla_test',
      USER='dla_test',
      PASSWORD='xxxxx',
      INSTANCE_ID = 'rm-xxx',
      VPC_ID = 'vpc-xxxx',
      USE_VPC_CLOUD_INSTANCE_ID = 'true'
    ); 

Can I execute an SQL statement to query the number of files in an OSS table?

Yes, you can execute a SELECT statement to query the number of files in an OSS table. Sample statement:
select count(distinct `$path`) from tbl;
tbl indicates the name of the OSS table. Replace tbl with the actual table name.

How do I use an SQL statement of DLA to enable the big query feature for a VC?

If a data query consumes large amounts of memory resources and the query always fails, you can enable the big query feature for the VC where the query is run. To enable this feature, you can add a hint to the query statement. Sample statement:
/*+big_query=true*/ insert into table1 SELECT * FROM table2;
Note
  • The big query feature is supported for DLA that uses the pay-per-CU billing method. If DLA uses the pay-per-byte billing method, this feature is not supported.
  • After this feature is enabled, you can query up to 10 TB TPC-H data when DLA uses 64 compute units (CUs). One CU equals one CPU core and 4 GB of memory.
  • The catalog of the schema in DLA must be Hive. Otherwise, the following error message appears:
    big_query only support hive catalog ...

When I use DLA to write data to a table in Tablestore, the error message "Code: OTSParameterInvalid, Message: Invalid update row request: missing cells in request" appears. What do I do?

This error occurs because the value of attribute_columns is empty for some records. You need to manually filter out the records whose attribute_columns is empty.