All Products
Search
Document Center

Data Lake Analytics - Deprecated:FAQ related to the serverless Presto engine

Last Updated:Jul 22, 2021

This topic provides answers to some frequently asked questions about the serverless Presto engine of Data Lake Analytics (DLA).

High-frequency issues

What are the scenarios in which I am not charged for the public cluster?

You are not charged for DDL statements executed in the public cluster. If other statements are executed in the public cluster, you are charged based on the pay-per-byte billing type. For more information about the pay-per-byte billing type, see Pay-per-byte.

What is the computing power of the public cluster?

The public cluster is a cluster that is shared by all users. The computing power provided for each user approximately equals the computing power provided by a cluster with 10 CPU cores.

Why do the SELECT statements that are executed in the DLA console use the LIMIT 500 clause? Can I cancel this limit?

Cause: If the number of rows returned by the SELECT statement is excessively large, the page of the DLA console may close unexpectedly. The LIMIT 500 clause is added to the SELECT statement to ensure a stable page display of the DLA console.

Solution: To cancel this limit, you can use a MySQL client to connect to DLA.

How do I use the credentials of a RAM user to execute SQL statements of DLA?

You must use an Alibaba Cloud account to execute SQL statements in the DLA console. If you use the credentials of a RAM user, you must use the RAM user to create a DLA sub-account in the DLA console. Then, use the username and password of the DLA sub-account to execute SQL statements in the Data Management (DMS) console.

What is asynchronous execution? How do I use asynchronous execution?

Asynchronous execution is used for SQL statements that are related to extract, transform, load (ETL) tasks, such as INSERTand SELECT statements. The execution of these SQL statements is time-consuming. If you use synchronous execution, the connection between the client and the server is always occupied during this period of time. If a network exception occurs, the query fails. If you use asynchronous execution, an ID is returned immediately after you execute an SQL statement. Then, you can use the SHOW QUERY_TASK statement to view the status of the query.

Syntax for asynchronous execution:

/*+ run-async=true*/ SELECT * FROM tbl;

Syntax for querying the status of asynchronous execution:

SHOW QUERY_TASK WHERE ID = '.....'

How do I obtain the asynchronous execution results of SQL statements in the DLA console?

In the left-side navigation pane of the DLA console, choose Serverless Presto > Execute. On the Execute page, enter your SQL statements in the code editor and click Async Execute(F9). Click Result Set to obtain the value of ASYNC_TASK_ID. After that, enter SHOW QUERY_TASK WHERE ID = 'Value of ASYNC_TASK_ID' in the code editor and click Sync Execute(F8). Then, click Result Set to obtain the value of the result_file_oss_file field. This field indicates the OSS directory in which the result data is stored.

Notice

The value of the result_file_oss_file field is returned only after you execute the SHOW QUERY_TASK WHERE ID = 'Value of ASYNC_TASK_ID' statement.

How do I specify multiple hints in an SQL statement of DLA?

Sample statement:

/*+ cluster=public,run-async=true */ 
SELECT * FROM tbl

Issues related to virtual clusters

How do I configure a default virtual cluster for the serverless Presto engine of DLA?

To enable the serverless Presto engine of DLA to automatically submit all query jobs to a specified cluster, go to the DLA console and click Overview in the left-side navigation pane. On the Overview page, click Options in the upper-right corner. On the Options Management page, select a cluster from the SQL default cluster drop-down list and click OK.

How do I specify a default virtual cluster in a statement?

You can append @<Cluster ID> to the username in the SQL statement that you want to execute.

How do I submit SQL statements to a specified virtual cluster?

After you create a virtual cluster, for example, a cluster named dladw, you can use one of the following methods to submit SQL statements to the cluster:

  • Add a hint to the SQL statement that you want to execute. Sample statement: /*+cluster=dladw*/SELECT * FROM tbl;

  • Append @<Cluster ID> to the username in the SQL statement that you want to execute.

Note

You can view the cluster ID on the cluster details page.

  • Set the default cluster to dladw.

Issues related to metadata and DDL

Why am I unable to load partitions to a table by using the MSCK REPAIR TABLE command?

You can load partitions to a table only if the names of the subdirectories that are mapped to the partitions are in the XXX=XXX format, for example, year=2018 in oss://abc/yyy/action/year=2018/.

When I create a database in DLA, the error message "Generate vpc reverse access mapping ip & mapping port failed, please check your vpc_id(vpc-xxxx) and instance_id(rm-xxx)!" appears. What do I do?

  • Cause: The VPC ID or instance ID that you entered is incorrect.
    Solution: Enter the correct VPC ID and instance ID.
  • Cause: Data migration has been performed on your RDS instance. The instance ID that appears in the DLA console is incorrect.

    Solution: Add USE_VPC_CLOUD_INSTANCE_ID = 'true' to the statement that is used to create the database schema. Sample statement:

    CREATE SCHEMA `test_db` WITH DBPROPERTIES (   
    CATALOG = 'mysql',   
    LOCATION = 'jdbc:mysql://rm-xxx.mysql.rds.aliyuncs.com:3306/dla_test',  
    USER='dla_test',   PASSWORD='xxxxx',   INSTANCE_ID = 'rm-xxx',   
    VPC_ID = 'vpc-xxxx',   
    USE_VPC_CLOUD_INSTANCE_ID = 'true' );

How do I view the number of partitions in a table?

You can use a statement with the following syntax to view the number of partitions in a table:

select count(distinct <partition_key>) from tbl;

Issues related to OSS and Hive

How do I use an SQL statement to query the number of files in an OSS table?

You can query the number of files in an OSS table by using the following statement:

select count(distinct `$path`) from tbl;
Notice

In the preceding SQL statement, tbl indicates the name of the OSS table. Replace tbl with the actual table name.

What do I do if the data that is queried from a CSV file is garbled characters?

Cause: The file command is used to view the encoding format of the CSV file in Linux, and LazySimpleSerDe is used to create a table.

Solution: If the original encoding format of the CSV file is ISO-8859, set serialization.encoding to gbk.

How do I reduce the number of output files when I use INSERT SQL statements in DLA?

You can use hints to adjust the values of the following parameters:

  • table_write_worker_count: the parallelism of the tasks that are run to insert data into files.

  • task_writer_count: the parallelism of the data write operations that are performed by each task.

The number of output files is calculated by using the following formula: table_write_worker_count × task_writer_count. If you need to reduce the number of output files, we recommend that you set the table_write_worker_count parameter to 10 and the task_writer_count parameter to 2.

Note
  • The number of output files is difficult to control because the number is affected by a variety of factors, such as the cluster size and actual data distribution.

  • You can configure the number of output files only for DLA that uses the pay-per-CU billing type.

Does the serverless Presto engine of DLA support gzip-compressed data?

Yes, the serverless Presto engine of DLA supports gzip-compressed data. The serverless Presto engine of DLA supports the GZIP and Snappy compression algorithms.

Can the serverless Presto engine of DLA process compressed files and uncompressed files in a directory at the same time?

Yes, the serverless Presto engine of DLA can process compressed files and uncompressed files in a directory at the same time.

Can a table of DLA automatically and recursively read all subdirectories and files in the directory that stores the table?

Yes, a table of DLA can automatically and recursively read all subdirectories and files in the directory that stores the table.

When I execute an SQL statement to synchronize data from HDFS to OSS, the number of synchronized data records queried by using DLA is 110,000 and the number of synchronized data records queried by using a self-managed cluster is 190,000. Why?

Cause: An issue occurs when the data is synchronized to OSS.
Solution: You can check whether data is synchronized from HDFS to OSS as expected.

Performance-related issues

The error message "Query exceeded distributed user memory limit of 2.00TB/Query exceeded per-node user memory limit of xxGB" appears when I execute SQL statements of DLA to query data. What do I do?

In the serverless Presto engine of DLA, the memory that is allocated to a query request on a single node and the total memory of a cluster are limited. You can use one of the following methods to resolve this issue:

  • Optimize joins to reduce the memory usage. For example, use the left table to store a large amount of data and the right table to store a small amount of data.

  • Partition data.

  • Split your query into sub-queries.

The serverless Presto engine of DLA scans the entire ORC file instead of the columns that are specified in the SQL statement. Why?

An analytical query extracts only a few columns of data in a table. Therefore, an execution engine, such as the serverless Presto engine, needs to scan only the data in the required columns. To achieve this purpose, make sure that the data in underlying tables is stored in columns. For example, you can execute the following statement to query a column of data:
SELECT col1 FROM tbl1 WHERE col2 = 'hello;

In practical application, if an analytical query is performed on a small- or medium-sized table or on a table with a large number of columns, the size of data that is scanned by the serverless Presto engine is still large. This is because the ORC_TINY_STRIPE_THRESHOLD parameter is used to specify whether the serverless Presto engine scans data in a column or the entire table. Sample code:

dataSizeSessionProperty(         
    ORC_TINY_STRIPE_THRESHOLD,         
    "ORC: Threshold below which an ORC stripe or file will read in its entirety",           
    hiveClientConfig.getOrcTinyStripeThreshold(),         
    false);

The default value of the ORC_TINY_STRIPE_THRESHOLDparameteris 8. Unit: megabytes. If the size of each stripe in a table is less than 8 MB, the serverless Presto engine queries the data of the entire table instead of the data in each stripe because it is not cost-effective to query data in each stripe.

How do I enable the big query feature?

If a data query consumes large amounts of memory resources and the query always fails, you can enable the big query feature for the virtual cluster on which the query is run. To enable this feature, you can add a hint to the query statement. Sample statement:

/*+big_query=true*/insertintotable1SELECT*FROMtable2;
Note
  • The big query feature is supported only for DLA that uses the pay-per-CU billing type.

  • The big query feature limits the size of data that you want to query and the complexity of SQL statements. If the issue persists after you enable this feature, contact DLA technical support engineers.

  • The catalog of the schema in DLA must be Hive. Otherwise, the following error message appears: "big_query only support hive catalog ..."

How do I adjust the parallelism of data queries for RDS data sources?

When you use DLA to scan online data, such as data in an RDS database, the default parallelism of data queries that are sent over a Java Database Connectivity (JDBC) connection may not be supported if the RDS instance uses small specifications. In this case, you can add the hint /*+jdbc-scan-splits=2*/ to the following statement. This ensures that two queries are sent over a JDBC connection when you scan the data of a data table.
mysql> /*+jdbc-scan-splits=2*/select count(*) from customer; 
+----------+ 
| count(*) | 
+----------+ 
| 15000002 |

In the preceding statement, customer indicates the customer table in the ApsaraDB RDS for MySQL database. DLA uses two data queries that are sent over a JDBC connection to scan the data of the customer table in the ApsaraDB RDS for MySQL database.

Notice
  • The table that you want to scan must have an auto-increment primary key. Otherwise, you cannot add a hint to adjust the parallelism of data queries that are sent over a JDBC connection. If a table has an auto-increment primary key, you can split a query into multiple subqueries based on the auto-increment primary key. This way, the parallelism of data queries that are sent over a JDBC connection increases. Sample statements:

-- Original SQL statement   
select * from customer;   
-- SQL statement for subqueries   
select * from customer where id >=0 and id <=10000;   
select * from customer where id >=10000 and id <=20000;
  • The maximum parallelism of data queries that are sent over a JDBC connection is 500. The default parallelism of data queries that are sent over a JDBC connection is 1. If the default parallelism is not used, you can reduce the parallelism of data queries that are sent over a JDBC connection for the databases that use small specifications.

Other issues

What do I do if the error message "because '3.00199E9' in colum n '4' is outside valid range for the datatype INTEGER" appears?

Cause: The data type of underlying data does not map the INT data type that is supported by DLA.

Solution: Use the BIGINT data type to map the data type of underlying data.

The error message "Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout' on the server." appears. What do I do?

Cause: The default value of the net_write_timeout parameter is small. When you use DLA to read data from a MySQL database, DLA may be disconnected from the MySQL database even though the data has not been completely read.

Solution: Increase the value of the net_write_timeout parameter.

When I use DLA to write data to a table in Tablestore, the error message "Code: OTSParameterInvalid, Message: Invalid update row request: missing cells in request" appears. What do I do?

Cause: The value of attribute_columns is empty for some records.

Solution: You must manually filter out the records whose attribute_columns is empty.