By default, Data Lake Analytics (DLA) synchronously executes SQL statements to query data. The client waits until the server returns the query results. However, if a large amount of data is involved, the execution of some statements for extract, transform, load (ETL) jobs, such as INSERT INTO and SELECT FROM, takes an extended period of time. In this situation, the server does not return any data to the client for a long time, and the connection between the client and DLA may be terminated. The asynchronous execution feature is introduced to resolve this issue. After you asynchronously execute an SQL statement, the server immediately returns an ID. You can then query the execution result of the SQL statement based on this ID.

Syntax

​mysql -h<end_point_host> -P<end_point_port>  -u<user_name> -p<password> -c -A
……………………
mysql> use public_dataset_tpch_1x_text;
Database changed
mysql> /*+ run-async=true */select count(*) from customer;
+--------------------------------+
| ASYNC_TASK_ID                  |
+--------------------------------+
| q202006161430hz89aab7ef0245339 |
+--------------------------------+
1 row in set (0.04 sec)

In MySQL 5.7.7 and earlier versions, the MySQL client deletes comments in SQL statements that are sent to the server. The comments include optimizer hints. If you use a version earlier than MySQL 5.7.7, you must use the -c option to add comments to the SQL statements.

General class

run-async

  • Description: specifies whether to asynchronously execute SQL statements.
  • Valid values: true and false. Default value: false.

max-running-time-sec

  • Description: specifies the maximum query duration, in seconds.
  • Valid values: 1 to 21600.

You can use this hint to control the SQL execution duration. If the SQL execution duration exceeds the value of this hint, the SQL statement is automatically terminated by the system.

join-distribution-type

  • Description: specifies the data distribution type that is used when tables are joined.
  • Valid values: AUTOMATIC, BROADCAST, and PARTITIONED.
  • Default value: AUTOMATIC.

If the right table is small and can be stored in memory, you can set this hint to BROADCAST to improve query performance. If you set this hint to PARTITIONED, large tables can be joined. However, the performance of joining small tables is low.

Cluster

  • Description: specifies the resource pool to which SQL query requests are sent.
  • Valid values: -.

OSS

insert-overwrite-ignore-conflict

  • Description: specifies whether to forcibly overwrite an existing path when you execute the INSERT OVERWRITE statement. If you set this hint to false, only the existing path that is recorded in metadata is overwritten. If you set this hint to true, the existing path is overwritten.
  • Valid values: true and false.
  • Default value: false.

JDBC

jdbc-scan-splits

  • Description: specifies the number of splits to create when you query the JDBC data source.
  • Valid values: 1 to 500.
  • Default value: 1.

jdbc-split-column

  • Description: specifies the column based on which a specified table is split when you query the JDBC data source. The column must be an INT or BIGINT column or a CHAR or VARCHAR indexing column. Otherwise, this hint is ignored.
  • Valid values: the name of a column.
  • By default, this parameter is left empty.

ots-query-version

  • Description: specifies whether to start the V2 protocol to query data.
  • Valid values: 1 and 2.
  • Default value: 2.

ots-filter-version

  • Description: specifies whether to start the filter of the V2 protocol to query data.
  • Valid values: 1 and 2.
  • Default value: 2.

ots-split-unit-mb

  • Description: specifies the default splitting unit.
  • Valid values: 1 and 2.
  • Default value: 2.

ots-fetch-size

  • Description: specifies the number of rows of data returned by Tablestore for each request.
  • Valid values: 100 and 100000.
  • Default value: 10000.

ots-start-version

  • Description: specifies the start version of the data. The value of this hint is a timestamp, in milliseconds. This hint is used to quickly filter out a large amount of multi-version data.
  • Default value: -1.

ots-end-version

  • Description: specifies the end version of the data. The value of this hint is a timestamp, in milliseconds. This hint is used to quickly filter out a large amount of multi-version data.
  • Default value: -1.

ots-split-optimize

  • Description: specifies whether to start the optimization logic such as regrouping the splits.
  • Valid values: true and false.
  • Default value: false.

ots-split-size-ratio

  • Description: specifies the proportion of splits after the optimization logic such as regrouping the splits is started.
  • Valid values: 0.0001 to 1.0.
  • Default value: 0.5.

ots-partition-prune

  • Description: specifies whether to cut away unneeded splits to improve performance.
  • Valid values: true and false.
  • Default value: true.

ots-insert-as-update

  • Description: specifies whether to replace the INSERT statement with the UPDATE statement.
  • Valid values: true and false.
  • Default value: false.

ots-loose-cast

  • Description: specifies whether loose casting is allowed. For example, cast LONG to DOUBLE or DOUBLE to LONG.
  • Valid values: true and false.
  • Default value: false.

ots-index-first

  • Description: specifies whether to perform an index-first query on a table that meets the specified conditions.
  • Valid values:
    • auto: DLA searches for indexes that are related to the table. If an index that meets the conditions is found, the index-first query is forcibly performed.
    • custom: DLA automatically searches for indexes that meet the conditions based on tables that you select. In tbl1, the specified database name does not need to be displayed because a database specified in the use xxx statement is already bound to the current connection. In the following example, only tbl1 and tbl2 are indexed.
​/*+ ots-index-first=[tbl1, dla_schema2.tbl2, ...] */ select * from tbl1 
join dla_schema2.tbl2 join dla_schema3.tbl3 where ...​
  • threshold: An index-first query is performed based on the data volume and the value of the threshold parameter. You can set the threshold parameter to a specific number of rows or a proportion. After you set the threshold parameter, if the data volume is less than the value of the parameter, the index-first query is performed. For example, if you set the threshold parameter to 200, the index-first query is performed only when the number of rows that meet the condition specified in the WHERE clause does not exceed 200. If you set the threshold parameter to 5%, the index-first query is performed only when the matching proportion in the WHERE clause does not exceed 5%. DLA can call the count interface of Tablestore to check the data volume.
​/*+ ots-index-first=threshold:200 */ select * from tbl1 where ...
/*+ ots-index-first=threshold:5% */ select * from tbl1 where ...​
  • Default value: No index is used.

ots-index-parallel-scan-mode

  • Description: specifies whether to use the parallel scan feature provided by Tablestore.
  • Valid values: true and false.
  • Default value: false.

For more information, visit https://developer.aliyun.com/article/776638?groupCode=datalakanalytics.

Notifications

After an SQL statement is executed, DLA asynchronously sends a notification. The following content introduces the hints that correspond to the two notification channels of DLA.

Message Queue

​/*+ run-async=true, mq-notify-by=ons, mq-topic=${A topic in Message Queue}, 
mq-producer-id=${The ID of a Message Queue group}, mq-endpoint=${Your Message Queue endpoint that is in the same region as DLA} */​
  • mq-notify-by: the name of the notification channel
  • mq-topic: the topic in Message Queue
  • mq-producer-id: the ID of a Message Queue group
  • mq-endpoint: the Message Queue endpoint

MNS

/*+ run_async=true, mq-notify-by=mns, mq-queue=${Your MNS queue} */
  • mq-notify-by: the name of the notification channel
  • mq-queue: the MNS queue

Export asynchronous query results

When you use asynchronous execution, the execution results are exported to OSS and are stored in the CSV format by default. DLA provides some hints to change the format in which query results are stored.

​/*+ force-persist-result=true, result-file-format=csv, result-col-del=[,], 
result-row-del=\r\n, result-meta-included=true */ 
select * from tbl1 ...​

If data is asynchronously exported to OSS, a single-threaded execution is used to write data to OSS. If the data volume is large, the execution requires a long period of time. If a large amount of data is exported, we recommend that you create an OSS table and then write data to this table.

  • result-file-format: the format of the result file, which can be CSV or JSON.
  • result-col-del: the column delimiter, which takes effect only for CSV files.
  • result-row-del: the row delimiter, which takes effect only for CSV files.
  • result-meta-included: specifies whether to generate the column name in the first row. Default value: false.
  • result-oss-location: the OSS path in which the result file is stored.

A number of customers want to use commas (,) rather than vertical bars (|) as delimiters. This is because commas (,) have special semantics and need to be escaped by [,], for example, result-col-del=[,], result-row-del=\r\n.

If you want to save a synchronous query result to OSS, you can use the following hint:

  • force-persist-result=true