This topic describes how todetermine the optimal number of values in the IN clause of a query when you perform the query in Distributed Relational Database Service (DRDS).

Features

In actual scenarios, you often need to use some constant metrics as the query conditions to perform IN queries. For each IN query, the field values in the IN clause are the values of the partition key. For example, in an e-commerce scenario, all orders are recorded in a table named Order. This table is partitioned by order ID. A buyer often queries the details about specific orders based on the order list. If the buyer has two orders, the number of values in the IN clause of the query is two. Theoretically, the query is routed to two shards. The following code provides an example of the SQL statement:

SELECT * FROM ORDER WHERE ORDER_ID IN (id1,id2)

The number of values in an IN clause increases as the number of the buyer's orders increases. An IN query may need to be routed to all the shards. This increases the response time (RT) of the query. The following figure shows the relationship among the number of values in the IN clause, the RT, and the number of shards to scan.

25337301

To minimize the scan workload and reduce the number of SQL statements caused by increases in the number of values within an IN clause, DRDS V5.4.8-16069335 and later support dynamic partition pruning based on the number of values in the IN clause.

If the Order table in the preceding example has 128 shards and the number of values in the IN clause is 128, you need to run the following query:

SELECT * FROM ORDER WHERE ORDER_ID IN (id1,id2,id3....id128)

In versions earlier than DRDS V5.4.8-16069335, the IN query can be routed to up to 128 shards if each value resides in a different shard. The IN clause in each physical query statement that is sent to ApsaraDB RDS for MySQL or PolarDB for MySQL contains 128 values because the values are not pruned. This increases the workload of executing SQL statements. The following code provides an example of the SQL statements:

SELECT * FROM ORDER WHERE ORDER_ID_1 IN (id1,id2,id3....id128);
SELECT * FROM ORDER WHERE ORDER_ID_2 IN (id1,id2,id3....id128);
SELECT * FROM ORDER WHERE ORDER_ID_3 IN (id1,id2,id3....id128);
.....
SELECT * FROM ORDER WHERE ORDER_ID_128 IN (id1,id2,id3....id128);

In DRDS V5.4.8-16069335 or later, DRDS calculates the shards to which the IN query is to be routed at the computing layer. Before each physical query statement is sent to ApsaraDB RDS for MySQL or PolarDB for MySQL, DRDS uses the dynamic partition pruning feature to ensure that the IN clause in the SQL statement of the query contains only the values stored in the shard. This increases the throughput and decreases the RT for IN queries. The following code provides an example of the SQL statements:

SELECT * FROM ORDER WHERE ORDER_ID_1 IN (id1);
SELECT * FROM ORDER WHERE ORDER_ID_2 IN (id2,id12);
SELECT * FROM ORDER WHERE ORDER_ID_3 IN (id3,id4,id5);
.....
SELECT * FROM ORDER WHERE ORDER_ID_32 IN (id100....id128);

In addition, DRDS uses the single-node parallel execution feature to execute the IN query on different shards in parallel. For example, if the values in the IN clause of a query are distributed across 32 shards, the degree of parallelism for each query is equal to the number of cores of your node. If your node in DRDS has 16 cores, the default degree of parallelism is 16. In this case, the IN query routed to 32 shards is completed in 2 batches.

Alibaba Cloud provides the following recommendations on determining the number of values in the IN clause of a query based on its experience:
  • Specify a number that is significantly smaller than the number of shards so that not all shards need to be scanned each time a query is performed.
  • Make sure that the number of values does not increase due to business development. This prevents the query performance from being compromised because of business development.
  • To maintain low RT and a high throughput, specify a number that ranges from 8 to 32.

If the preceding requirements are met, linear scalability can be performed for your business to handle concurrent IN queries, with nearly no obvious increase in the RT. For example, DRDS in which a 16-core node is deployed supports the concurrent execution of 10,000 IN queries. After another 16-core node is added, DRDS supports the concurrent execution of 20,000 IN queries.

25337302

Comparison test

This test aims to help you determine the optimal number of values in the IN clause of a query so that you can maintain low RT and a high throughput. In the test environment, two nodes and a table that has 64 shards are used. The specification for each of the nodes is 16 cores and 64 GB memory. Each of the table shard contains millions of data records. The test result shows the RT and throughput changes as the number of values in the IN clause and the number of concurrent queries increase. In DRDS V5.4.8-16069335 and later, the dynamic column pruning feature is optimized to process IN queries. The unnecessary values in the IN clause of each SQL statements in your query are pruned. The following figures show the test results.

  1. The following figure shows the RT changes as the number of concurrent queries and the number of values in the IN clause increase, after the dynamic column pruning feature is enabled. 25337303
  2. The following figure shows the throughput changes as the number of concurrent queries and the number of values in the IN clause increase, after the dynamic column pruning feature is enabled. 25337304
  3. The following figure shows the RT changes as the number of concurrent queries and the number of values in the IN clause increase, after the dynamic column pruning feature is disabled. 25337305
  4. The following figure shows the throughput changes as the number of concurrent queries and the number of values in the IN clause increase, after the dynamic column pruning feature is disabled. 25337306
From the preceding figures, you can draw the following conclusions:
  • When the number of values in the IN clause is 8 to 32, you can maintain low RT and a high throughput. In addition, the degree of parallelism is almost identical to the default degree of parallelism of the single-node parallel execution feature provided by DRDS. The default degree of parallelism is equal to the number of the CPU cores of a node.
  • After the dynamic column pruning feature provided by DRDS V5.4.8-16069335 or later is enabled, the RT for IN queries is decreased and the throughput for IN queries is increased. Therefore, we recommend that you upgrade your DRDS to V5.4.8 or later so that you can use the dynamic partition pruning feature.