Community Blog About Database Kernel | PolarDB Optimizer Query Transformation: IN-List Transformation

About Database Kernel | PolarDB Optimizer Query Transformation: IN-List Transformation

This article focuses on the IN-list transformation of PolarDB, which showcases the query rewriting capabilities of PolarDB.

By Jiang Chen (Tiantai)

The database query optimizer is like the brain of the entire system. The efficiency of executing an SQL statement can vary significantly based on different optimization decisions. Therefore, the optimizer is one of the core components and competitive advantages of a database system. Alibaba Cloud PolarDB for MySQL is a cloud-native database that can handles various user scenarios and loads, helping enterprises go online and increase their data value. To enhance its capabilities, we have focused on improving the query transformation functionality of PolarDB for MySQL.

This series of articles will focus on the query transformation capability of PolarDB for MySQL and introduce some of our accumulated work in this direction.

This is the fourth article in the PolarDB optimizer query transformation series. The first three articles describe the following topics:

1. Join elimination

2. Window functions

3. Join condition pushdown


As an HTAP database, PolarDB for MySQL has conducted extensive research on optimizing complex SQL queries. Early users had simple SQL statements, and the standalone capabilities of MySQL were limited. However, as business data grows and scenarios become more complex, there is a growing need for powerful databases to meet statistical and reporting requirements.

PolarDB has made in-depth efforts in terms of parallelism, query transformation, and optimizer. The main objective of these efforts is to enable users to execute complex queries faster. This article focuses on the IN-list transformation of PolarDB, which showcases the query rewriting capabilities of PolarDB. Below is an example of a common slow SQL query, specifically involving the in-function operation with multiple constants.

select        sum(l_extendedprice) / 7.0 as avg_yearly
            l_partkey in (
9628136,19958441,10528766,.......); There are thousands of constant values in the in# list.

SQL statements are commonly used to filter queries on a single table and then perform aggregation. The actual execution of these queries can be time-consuming, which is because the IN-List contains thousands of constant values.

Native MySQL

The native MySQL execution plan is as follows:

| EXPLAIN                                                                                           |
| -> Aggregate: sum(lineitem.L_EXTENDEDPRICE)
    -> Filter: (lineitem.L_PARTKEY in (9628136,19958441,10528766,....) (cost=60858714.81 rows=297355930)
        -> Table scan on lineitem  (cost=60858714.81 rows=594711859)

The execution process involves linearly scanning 590 million lineitem data and checking whether they exist in the IN-List one by one. This operation is performed by the Item_func_in operator, which handles a relatively large number of in-set elements. For testing purposes, we used 100,000 constant values. However, this operator takes a significant amount of time to evaluate, requiring approximately 375 seconds to complete the test.

Let's take a look at the execution logic of the Item_func_in code:

● First, it determines if a binary search is possible. If so, it converts the IN-List into an ordered array.

● If an ordered array is generated, it attempts a binary search during execution.

● Otherwise, it linearly scans the data to check whether the left expression is equal to any item in the IN-List.

From this evaluation logic, we can see that it is already quite efficient, and there is little room for optimization for this operator. The main issue lies in the high number of iterations in the outer loop. If the number of iterations in the outer loop can be reduced, the latency can also be reduced.


The solution of PolarDB is to transform the SQL query and transform the IN-List into a materialized table and add the join list to the table. The transformation process is as follows:

Step 1: Convert the preceding SQL statement to an in subquery

select ... from lineitem where l_partkey in (...)
select ... from lineitem where l_partkey in 
 (select dt._col_1 from (values (9628136),(19958441),...) dt)

Step 2: SubQuery Unnest - Eliminate subqueries

Subqueries are already uncorrelated, and through SU technology, subqueries can be eliminated and transformed into semi-join. As the materialized table is deduplicated, and the join column is not empty, it can be transformed into an inner-join.

SQL will continue to be rewritten as:

select ... from lineitem, (values (9628136),(19958441),...) dt) where l_partkey = dt._col_1

The following benefits can be obtained through this transformation:

● Eliminating the need for filtering one by one because the MySQL actuator follows a volcano model. Adding a filter operator introduces an extra layer of virtual function calls.

● Utilizing the join buffer in JOIN operations, allowing for joining batch by batch. This is a key advantage of transforming into the join list.

● Exploiting numerous optimizations related to join operations, such as join order and access path. This enables the selection of the most optimal execution plan.

The final plan is as follows:

| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                 |
| -> Aggregate: sum(lineitem.L_EXTENDEDPRICE)
    -> Nested loop inner join
        -> Table scan on dt
            -> Materialize with deduplication
                -> scan on in-list: 100000 rows
        -> Index lookup on lineitem using LINEITEM_FK2 (L_PARTKEY=dt._col_1), with index condition: (lineitem.L_PARTKEY = dt._col_1)  (cost=7.34 rows=29)

The amount of data in the materialized table is small. As an external table, inner-join successfully uses the lineitem index. You only need to scan 100,000 materialized table records and then use the LINEITEM_FK2 index to join. The entire SQL statement takes only 32 seconds to execute.

Test Effect

After optimization, the PolarDB IN-List performance of the TPCH 100 GB dataset is improved by 11.5 times compared with that of the native mode. In addition, as PolarDB supports parallel queries, it is improved by more than 100 times in 32 parallelism mode.



In theory, after PolarDB converts an IN-list to a Join-list, the following improvements can be achieved:

● Constants in the IN-List are materialized and deduplicated, and the cardinality may be greatly reduced, which depends on duplicate values.

● After the IN-List is eliminated, it becomes a materialized table. After participating in Join-List, there are more access path choices, including the selection of better indexes and various join methods such as hash join or nest loop join.

To complete the IN-List conversion, other tasks need to be carried out, such as adapting to the prepared statement protocol and parallel query protocol. In the future, we will introduce more content on query rewriting. We appreciate your attention.

0 1 0
Share on


358 posts | 49 followers

You may also like