Developer Content

Scene

In actual scenarios, you often need to perform IN queries based on some constant indicators, and the IN value is often the partition key. For example, in the e-commerce scenario, there are two tables, the buyer table and the order table. The specific information of the order will be recorded in the order table, which will be hash split according to the order ID; The buyer table saves the buyer ID and its associated order ID. A buyer often needs to query all the orders he has purchased. A common practice is to first query the buyer table to obtain all the order IDs of the buyer, and then query the specific information of the order according to the above order IDs. Assuming that the order table has 4 partitions, and the order ID of buyer A's order is 1, 2, 4, 5, and 9, the following SQL will be generated, that is, an IN condition query with 5 values.

Logical sql: select * from order where order_ id in (1, 2, 4, 5, 9);

Execution mode

Unlike MySQL, PolarDB-X architecture can be divided into computing layer and storage layer. The DN node (data node) of the storage layer stores all data. For more background information, refer to the column Introduction to PolarDB-X.

How can a computing node pull data from the DN node and perform calculations after receiving this logical SQL?

Naive's execution mode

All in values are included in the physical SQL that is distributed to all partitions, as shown below. This naive implementation method will bring two problems. First, with the increase of the number of fragments in the order table, the number of fragments to be scanned will increase sharply, while the number of fragments where the buyer's order data is located will not increase; Second, with the increase of in values, the issued physical SQL will change from index scanning to full table scanning during execution. The above two reasons cause a sharp rise in RT.

Implementation based on dynamic clipping

From the above discussion, we hope that we can cut out the partitions that do not contain the required data. Since the in field is the partition key, we can use the partition algorithm to calculate the partition of each in value, as shown below.

hash(1) = 1, hash(2) = 2, hash(4) = 4, hash(5) = 1, hash(9) = 1

Taking the above example, it is obvious that we do not need to issue physical SQL to partition 3. Furthermore, since shard 1 can only contain data with order IDs 1, 5, and 9, and cannot contain data with order IDs 2 or 4, we can further trim the in values contained in the physical SQL.

Now that we know that the SQL to be issued involves three partitions, how to issue the three physical SQL, or what is the parallelism of issuing SQL and waiting for the results? Obviously, there are two extremes: one is all serial, which is extremely inefficient; The second is that they are all parallel. At the same time, several threads can be effectively partitioned to send all physical SQL to each partition. The problem with this method is that if the number of partitions is too large, it will bring great pressure to the system, and there will be a large number of thread context switches. To sum up, we set the parallelism of physical SQL to the number of CPU cores of the machine by default, and issue physical SQL in a sliding window like manner.

In addition, we hope that the physical SQL that is distributed to each partition does not contain too many in values. Otherwise, the SQL is likely to perform a full table scan instead of the expected index scan when the DN node executes. Therefore, when there are many in values, we will cut the in values and distribute them in batches. The execution process is shown in the following figure.

Taking the above example as an example, suppose that our CN node has only two cores, and that the in value in the physical SQL that is issued each time does not exceed two (of course, it is not set so small in practice), then we will issue a total of four physical SQL, as shown below, which will be divided into two batches, that is, the computing node and the data node need to have two network interactions. Imagine that if we only issue two physical SQL statements, we can issue all the physical SQL statements in one batch. At this time, the computing node and the data node have only one network interaction. Therefore, when users have high requirements for RT, we suggest that the number of in values should be small to ensure that the number of partitions involved does not exceed the number of CPU cores and that each partition will only issue one physical SQL.

Challenges brought by unfixed number of in values

In order to speed up the execution of SQL, we cache the execution plan of parameterized SQL. The number of in values in business code is sometimes different, which may cause the cache space of the execution plan to be full of SQL with different in values, leading to the invalidation of other SQL execution plans. The solution to this problem is simple. We will replace the in list with a question mark to avoid the above situation.

Test

We are in the specification of 2 × On the 16C64G node, a table with 64 sub tables and millions of sub table records was tested under different in values and concurrency. The partitioning method is hash partitioning. The test results are as follows.SS

PolarDB-X's in constant query

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse