If you use Data Lake Analytics (DLA) to query data from Tablestore, you can add hints to SQL statements. This allows you to optimize query performance by using one of the following methods:
Shard tables
DLA is a distributed computing system that shards each table in Tablestore based on specific rules. The shards do not overlap with each other. The computing layer concurrently executes these shards to improve the concurrency of computing. You can use the following parameters to shard tables:
-
ots-split-unit-mb
The
ots-split-unit-mb
parameter specifies the size of each shard. Valid values: 1 to 102400. Default value: 10.Notes
-
If this parameter is set to a too small value for a table that contains large amounts of data, a large number of shards are generated. As a result, timeouts frequently occur because Tablestore cannot promptly execute these shards. This affects the stability of data writing. If the table you want to access contains more than 10 GB of data, we recommend that you set this parameter to 256. If the table contains more than 100 GB of data, we recommend that you set this parameter to a value greater than 2048.
-
If this parameter is set to a too large value for a table that contains very small amounts of data, DLA may not concurrently execute the shards, which causes a long latency. In this situation, decrease the value of this parameter.
-
-
ots-split-optimize
The
ots-split-optimize
parameter specifies whether to enable the function to automatically optimize sharding. If this parameter is set to true, this function is enabled. Default value: false.In some cases, SQL statements automatically generate a large number of shards. As a result, timeouts occur because Tablestore cannot promptly execute these shards. In addition, the shard list generated by Tablestore is mostly sequential. Adjacent shards are stored on the same host. As a result, hotspot problems may occur when DLA accesses Tablestore.
DLA allows you to prevent these problems by taking measures, such as automatically dispersing shards, merging adjacent shards, and distributing shards at random.
-
ots-split-size-ratio
The
ots-split-size-ratio
parameter specifies the proportion that is used to control the total number of shards after adjacent shards are merged. You can specify this parameter only afterots-split-optimize
is set to true. Valid values: 0.0001‒1.0000.If you have 100,000 shards and set the
ots-split-size-ratio
parameter to 0.3, the system merges adjacent shards and reduces the total number of shards to about 30,000. The total number of shards after merging is based on the service capacity of Tablestore.
Adjust statements
-
ots-insert-as-update
The
ots-insert-as-update
parameter specifies whether to replace the INSERT statement with the UPDATE statement. Valid values: true and false. Default value: false.Tablestore SDKs provide the
RowPutChange
andRowUpdateChange
operations to update data. This update does not change the values of the primary keys. You can call theRowPutChange
operation to update data by overwriting an entire row based on primary keys. You can also call theRowUpdateChange
operation to update data in some non-primary key columns based on primary keys. DLA does not support the UPDATE statement. You can only use the INSERT statement and hints to update the data stored in Tablestore. -
ots-loose-cast
The
ots-loose-cast
parameter specifies whether to use loose type casting in SQL statements, for example, whether to convert the data type LONG to DOUBLE or DOUBLE to LONG. Valid values: true and false. Default value: false.DLA runs a strong-typed language. If the data type of a field in Tablestore is BIGINT and DLA defines the field type as VARCHAR or STRING, DLA returns an error when SQL statements are executed. In this case, you can set the
ots-loose-cast
parameter to true to enable loose type casting. After loose type casting is enabled, DLA uses this function to automatically convert the data type. For example,"123"
of the STRING type is converted to123
of the INTEGER type.
Use search indexes
The ots-index-first
parameter specifies whether to query tables by using search indexes first. If this
parameter is set to auto, DLA queries tables by using search indexes first.
In most cases, DLA queries and analyzes data from the base table of Tablestore. If you enable search indexes from which you want to query data, DLA uses search indexes whenever possible.
Search indexes of DLA differ from strong consistent indexes such as the secondary indexes of MySQL in the following ways:
-
Search indexes of DLA are based on asynchronous replication and not strongly consistent with the base table.
-
Search indexes of DLA are based on inverted indexes rather than B-Tree indexes.
After search indexes are enabled, DLA compares indexes with what you query to determine whether search indexes can be used.
Control the number of concurrent requests on a single DLA node
The task-concurrency
parameter specifies the number of concurrent requests on a single DLA node. Valid
values: integers from 1 to 32. Default value: 32.
A DLA virtual cluster (VC) has multiple nodes, and data is obtained from Tablestore
in highly concurrent requests. If a large number of requests time out or are retried
on Tablestore, the number of requests on DLA becomes too large. This affects the stability
of the DLA VC. For security purposes, if the execution of SQL statements fails due
to timeouts that occur on Tablestore, you must control the concurrency of requests
on DLA, for example, set the task-concurrency
parameter to 2.