By default, Data Lake Analytics (DLA) synchronously executes SQL statements to query data. The client waits until the server returns the query results. However, the execution of some SQL statements, such as INSERT INTO and SELECT FROM is time-consuming. These SQL statements involve extracting, transforming, and loading large amounts of data. In this situation, the server does not return any data to the client for a long time, and the connection between the client and DLA may be closed. The asynchronous execution function is introduced to resolve this issue. After you asynchronously execute an SQL statement, the server immediately returns an ID. You can then query the execution result of the SQL statement based on this ID.

​mysql -h<end_point_host> -P<end_point_port>  -u<user_name> -p<password> -c -A
……………………
mysql> use public_dataset_tpch_1x_text;
Database changed
mysql> /*+ run-async=true */select count(*) from customer;
+--------------------------------+
| ASYNC_TASK_ID                  |
+--------------------------------+
| q202006161430hz89aab7ef0245339 |
+--------------------------------+
1 row in set (0.04 sec)​

If you use a version earlier than MySQL 5.7.7, the MySQL client deletes comments in SQL statements that are sent to the server. The comments include optimizer hints. In this situation, use the -c option to add comments to the SQL statements.

General class

run-async

  • Description: specifies whether to asynchronously execute SQL statements.
  • Valid values: true and false. Default value: false.

max-running-time-sec

  • Description: specifies the maximum query time, in seconds.
  • Valid values: 1 to 21600.

You can use this hint to control the SQL execution time. If the SQL execution time exceeds the value of this hint, the SQL statement is automatically terminated by the system.

join-distribution-type

  • Description: specifies the data distribution type that is used when tables are joined.
  • Valid values: AUTOMATIC, BROADCAST, and PARTITIONED.
  • Default value: PARTITIONED.

If the right table is small and can be stored in memory, you can set this hint to BROADCAST to improve query performance. If you set this hint to PARTITIONED, the joining of larger tables is supported. However, the performance of joining small tables is low.

pool-selector

  • Description: specifies the resource pool to which SQL queries are sent.
  • Valid values: the name of a resource pool.

OSS

insert-overwrite-ignore-conflict

  • Description: specifies whether to forcibly overwrite an existing path when you execute the INSERT OVERWRITE statement. If you set this hint to false, only the existing path that is recorded in metadata is overwritten. If you set this hint to true, the existing path is overwritten.
  • Valid values: true and false.
  • Default value: false.

JDBC

jdbc-scan-splits

  • Description: specifies the number of splits to create when you query the JDBC data source.
  • Valid values: 1 to 500.
  • Default value: 1.

jdbc-split-column

  • Description: specifies the column based on which to split a specified table when you query the JDBC data source. The column must be an INT or BIGINT column or a CHAR or VARCHAR indexing column. Otherwise, this hint is ignored.
  • Valid values: the name of a column.
  • This parameter is empty by default.

OTS

ots-query-version

  • Description: specifies whether to start the V2 protocol to query data.
  • Valid values: 1 and 2.
  • Default value: 2.

ots-filter-version

  • Description: specifies whether to start the filter of the V2 protocol to query data.
  • Valid values: 1 and 2.
  • Default value: 2.

ots-split-unit-mb

  • Description: specifies the default splitting unit.
  • Valid values: 1 and 2.
  • Default value: 2.

ots-fetch-size

  • Description:
  • Valid values:
  • Default value:

ots-start-version

  • Description: specifies the start version of the data. The value of this hint is a timestamp, in milliseconds. This hint is used to quickly filter out a large amount of multi-version data.
  • Default value: -1.

ots-end-version

  • Description: specifies the end version of the data. The value of this hint is a timestamp, in milliseconds. This hint is used to quickly filter out a large amount of multi-version data.
  • Default value: -1.

ots-split-optimize

  • Description: specifies whether to start the optimization logic such as regrouping the splits.
  • Valid values: true and false.
  • Default value: false.

ots-split-size-ratio

  • Description: specifies the number of splits after the optimization logic such as regrouping the splits is started.
  • Valid values: 0.0001 to 1.0.
  • Default value: 0.5.

ots-partition-prune

  • Description: specifies whether to cut away unneeded splits to improve performance.
  • Valid values: true and false.
  • Default value: true.

ots-insert-as-update

  • Description: specifies whether to replace the INSERT statement with the UPDATE statement.
  • Valid values: true and false.
  • Default value: false.

ots-loose-cast

  • Description: specifies whether loose rate casting is allowed. For example, cast LONG to DOUBLE or DOUBLE to LONG.
  • Valid values: true and false.
  • Default value: false.

ots-index-first

  • Description: specifies whether to perform index-first query on a table that meets the specified conditions.
  • Valid values:
    • auto: DLA searches for indexes that are related to the table. If an index that meets the conditions is found, the index-first query is forcibly performed.
    • custom: DLA automatically searches for indexes that meet the conditions based on tables that you select. In tbl1, the specified database name does not need to be displayed because a database, such as use xxx is already bound to the current connection. In the following example, only tbl1 and tbl2 are indexed.
​/*+ ots-index-first=[tbl1, dla_schema2.tbl2, ...] */ select * from tbl1 
join dla_schema2.tbl2 join dla_schema3.tbl3 where ...​
  • threshold: Index-first query is performed based on the data volume and the value of the threshold parameter. You can set the threshold parameter to a specific number of rows or a proportion. After you set the threshold parameter, if the data volume is less than the value of the parameter, the index-first query is performed. For example, if you set the threshold parameter to 200, the index-first query is performed only when the number of rows that meet the condition specified in the WHERE clause does not exceed 200. If you set the threshold parameter to 5%, the index-first query is performed only when the matching proportion in the WHERE clause does not exceed 5%. DLA can call the count interface of Tablestore to check the data volume.
​/*+ ots-index-first=threshold:200 */ select * from tbl1 where ...
/*+ ots-index-first=threshold:5% */ select * from tbl1 where ...​
  • Default value: No index is used.

Notifications

After an SQL statement is executed, DLA asynchronously sends a notification. The following content introduces the hints that correspond to the two notification channels of DLA.

ONS

/*+ run-async=true, mq-notify-by=ons, mq-topic=${A topic in MQ}, 
mq-producer-id=${The group ID of an MQ group}, mq-endpoint=${Your MQ endpoint that is of the same region as DLA} */
  • mq-notify-by: the name of the notification channel
  • mq-topic: the topic in MQ
  • mq-producer-id: the ID of an MQ group
  • mq-endpoint: the MQ endpoint

MNS

/*+ run_async=true, mq-notify-by=mns, mq-queue=${An MNS queue} */
  • mq-notify-by: the name of the notification channel
  • mq-queue: the MNS queue

Export asynchronous query results

When you use asynchronous execution, the execution results are exported to OSS and are stored in the CSV format by default. DLA provides some hints to change the format in which query results are stored.

​/*+ force-persist-result=true, result-file-format=csv, result-col-del=[,], 
result-row-del=\r\n, result-meta-included=true */ 
select * from tbl1 ...​

If data is asynchronously exported to OSS, a single-threaded execution is used to write data to OSS. If the data volume is large, the execution is time consuming. If large amounts of data is exported, we recommend that you create an OSS table and then write data to this table.

  • result-file-format: the format of the result file, which can be CSV or JSON.
  • result-col-del: the column delimiter, which takes effect only for CSV files.
  • result-row-del: the row delimiter, which takes effect only for CSV files.
  • result-meta-included: specifies whether to generate the column name in the first row. Default value: false.
  • result-oss-location: the OSS path in which the result file is stored.

A number of customers want to use commas (,) rather than vertical bars (|) as delimiters. However, commas (,) have special semantics and need to be escaped by [,], for example, result-col-del=[,], result-row-del=\r\n.

If you want to save a synchronous query result to OSS, you can use the following hint:

  • force-persist-result=true