About Database Kernel | PolarDB IMCI Streamlines Complex Subqueries

By Yukun Liang

1. Architecture of the IMCI Query Optimizer

In-memory column index (IMCI) is a technology that is used to accelerate complex SQL queries in PolarDB for MySQL. It improves real-time data analysis performance by one to two orders of magnitude through the addition of columnar indexes to the PolarDB storage engine and the use of parallel vectorized SQL operators. This allows customers to run transaction processing and complex analysis workloads simultaneously in a single PolarDB cluster, simplifying the business architecture and reducing usage costs.

IMCI retains 100% compatibility with MySQL syntax, allowing transparent acceleration of complex queries without modifying the syntax. This is achieved through the unique architecture of the SQL parser and optimizer of PolarDB.

After an SQL statement is processed by the parser, a LogicalPlan is generated. Then, the row-based optimizer generates a PhysicalPlan and calculates the execution cost. For SQL statements whose cost exceeds the configured threshold, the IMCI executor performs rule optimization and cost optimization again. IMCI operators do not support direct subqueries, so the key process of the second step includes subquery unnesting and join reordering. This article mainly explains the key technology of subquery unnesting in IMCI.

2. Role of Correlated Subqueries

Correlated subqueries are a feature widely used in various business scenarios. Without optimizations like indexing, the fundamental execution of correlated subqueries resembles that of a nested loop. Under queries involving large datasets, this execution method becomes excessively complex and often unacceptable for users. As a result, unnesting subqueries within the optimizer is essential to rewrite them into subqueries without correlated elements, followed by the application of more efficient join algorithms to enhance query performance. Since IMCI generally does not rely on indexes to execute queries, the efficiency of processing nested loop join-style correlated subqueries is too slow for customer acceptance. Therefore, in PolarDB-IMCI, we have developed a set of unnesting rules for subqueries, effectively transforming the majority of subqueries and significantly speeding up the execution of correlated subqueries in IMCI.

3. Introduction to Correlated Subqueries: An Example

The following SQL statement is a classic example of a correlated subquery:

SELECT COUNT(*) FROM t1
WHERE 
  t1.c1 > (SELECT MAX(t2.c2) 
             FROM t2 
             WHERE t1.c1 = t2.c1); -- subquery

In the preceding SQL statement, the condition in this subquery is t1.c1 = t2.c1, among which t1.c1 references the value of the outer query. The original query plan of this query is as follows:

Due to the existence of this filter with associated items in the lower left corner, for each row in t1, we have to substitute it into the query on the right and execute it in a similar way to nested loop join. In other words, for each row in t1, we have to execute the query on the right again. If t1 and t2 have large amounts of data, and the join algorithm used here is nested loop join, the queries will take too long.

Some articles on SQL statement optimization may mention that for the preceding SQL statement, we can rewrite it as follows to accelerate queries:

SELECT COUNT(*) 
FROM t1, 
   (SELECT t2.c1 as c1, MAX(t2.c2) as max_c2
      FROM t2 
      GROUP BY t2.c1) AS tmp
WHERE 
  t1.c1 = tmp.c1 AND t1.c1 > tmp.max_c2;

After rewriting, the associated items in the original subquery are extracted and the correlated subquery disappears. Now the query plan is as follows:

As you can see, the original nested loop join disappears, and we don't have to execute subqueries over and over again as before, which is inefficient.

Comparing the new query plan which has unnested subqueries with the original one, it can be found that this rewriting only does one simple thing: pulling the associated filter up to a location where it can directly obtain the associated item, as shown in the following figure.

The filter with the associated item disappears after this operation, and the nested loop join becomes a hash join due to the added equivalence conditions. Let's extend this idea further. Can all subqueries do this? The answer is yes.

4. Algorithm for Subquery Unnesting

In the query plan mentioned earlier, the nested loop join, referred to as the dependent join (also known as apply in SQL Server). The main differences between a nested loop join and a regular join are:

• A nested loop join is an inner join with no predicate.
• Its outer plan references the lines that are output by the inner plan, such as the t1.c1 = t2.c1.

As previously mentioned, the fundamental idea behind unnesting is to continuously push the associated item upwards until it crosses the dependent join, effectively eliminating it. In the following sections, we will achieve the goal of removing any correlated subqueries through the combination of multiple rules. Subsequently, we will categorize the rules based on the root node of the outer plan of the dependent join. If we can handle any type of root node, then we can eliminate the associated items of the query by repeatedly applying the rules.

4.1 Pushdown Rules

The following is an evident rule, where F(T) ∩ A(D) = ∅ indicates that T does not reference any result of D. This rule means that if T does not reference any result of D (meaning no association), this is a normal join.

Another rule is a general one with no usage restrictions and is primarily used to enhance execution efficiency.

In this context, D represents the distinct result of all columns referenced by T₂ in T₁, known as MagicSet in SQL Server. The benefit here is that for the query on the left side of the equation, we must execute the T₂ subquery for every row of T₁, whereas for the query on the right side, we only execute the T₂ subquery for each row of D. As the result set of D is guaranteed to be smaller than that of T₂, this method can speed up the execution of correlated subqueries. MagicSet will also be utilized in later sections about subquery unnesting.

Filters and Projects

If the root node of the outer plan is a filter, we can apply the following rules to pull this filter to the join.

It is the inverse operation of ordinary predicate pushdown operations.

That is also the case for projects. A(D) represents all the columns output by D, and we only need to combine them with column A to be output by the project.

Group by

This method can be used in the SQL syntax of PostgreSQL if the root node of the outer plan is a group by clause. The aggregate remains unchanged, and the union of the grouping column and all columns of D is taken. (It is to add the unique key of D to the group by column).

For the case of A={∅}, that is, scalar aggregation, the situation is different. First, the pushed-down inner join should be changed to the outer join. Second, some aggregate functions need to be rewritten, such as the following SQL:

SELECT c_custkey, (
    SELECT COUNT(*)
    FROM ORDERS
    WHERE o_custkey = c_custkey
) AS count_orders
FROM CUSTOMER;

If you do not make any changes to the SQL after the transformation, the SQL becomes:

SELECT c_custkey, COUNT(*)
FROM CUSTOMER LEFT JOIN ORDERS ON o_custkey = c_custkey
GROUP BY c_custkey;

If a user does not have any orders, SQL1 should return [c_custkey_1, 0]. But in SQL2, LEFT JOIN will first return a row of [c_custkey_1, NULL], and then the aggregate function will return [c_custkey_1, 1], which is inconsistent with the result before the transformation.

The reason for this result is that the transformed aggregation cannot distinguish whether NULL comes from the ORDERS table or is generated by LEFT JOIN. Therefore, we need to rewrite the aggregate function to distinguish the two types of NULL. The change is also very simple. You just need to change COUNT(*) into COUNT(not_null_column). For example, the correct rewrite of this SQL is as follows:

SELECT c_custkey, COUNT(o_orderkey) -- Replace it with the primary key of the orders table (not null).
FROM CUSTOMER LEFT JOIN ORDERS ON o_custkey = c_custkey
GROUP BY c_custkey;

Join

If the root node of the outer plan is a join, there are different unnesting methods according to the type of join. The following is the simplest inner join:

F( T₂) ∩ A(D) = ∅ means that T₂ does not reference the columns in D. This rule actually plays the role of join reordering. By rearranging the join order, D is first joined with the subquery with the associated items so as to carry out the next unnesting.

The method for outers and semi-joins is similar.

There are some other pushdown rules, which are not repeated here due to space limitations. If you are interested, please refer to the original paper. [2]

4.2 Rule Running Process

Let's take a simplified version of TPC-H Q17 as an example:

select
  COUNT(*)
from
  lineitem l1
where
  l_quantity < (
    select
      avg(l_quantity) as l_avg
    from
      lineitem l2
    where
      l1.l_partkey = l2.l_partkey
  );

The plan for the query that has not been unnested is as follows:

After we apply the rules mentioned earlier to this query plan:

In this example, we apply two rules to pull the group by and filter to the top of the join, and the associated items are eliminated, thus completing the unnesting of the subquery.

4.3 Some Exceptions

All of the preceding rules are aimed at the case where the dependent join is an inner join. However, this is not always the case in users' SQL statements. Here is an example of a simplified SQL statement of a user:

SELECT
  c1,
  (SELECT t2.c2 FROM t2 LEFT JOIN t3 ON t2.c1 = t3.c1 AND t3.c3 = t1.c3 GROUP BY t2.c2)
FROM t1;

According to the semantics of scalar subqueries, we may convert an execution plan of the following shape:

This plan is different from other queries in this article in one way: t1 and the correlated subquery are connected by the outer join instead of the inner join.But all of the preceding rules are for the inner join. Here IMCI uses a very "mathematical" way to deal with this situation: It converts the semi join and the left join to the preceding inner join. IMCI converts these two joins to inner joins in the following way.

The following figure shows the case of an outer join. The rules for semi and anti-semi joins are almost the same as those for the outer join.

The MagicSet mentioned earlier is used here to accelerate queries. Subq X can also be used directly to complete the corresponding conversion. After the conversion, the previous outer join is no longer dependent. Instead, it is changed to the following inner join. Then, we can process this generated subquery by using the preceding conversion rules to remove all associations in the query.

Combined with these rules, IMCI can solve almost all subqueries that are common in user scenarios. The following is an example of a correlated subquery:

select 
  * 
from 
  t1 left join t2 on 
    (t1.a + t2.a = 
       select 
         a 
       from 
         t3 
       where 
         t1.a = t3.a and t2.b = t3.b)

For this SQL, we first list the initial execution plan and pull up the filter on the t3 table.

Here, we pull up the filter and the Max1Row check together into the left join to generate a left single join. It is to perform the check while joining: For each row of t2, at most one row in t3 can match it. The left outer apply is then converted to an inner apply.

Note that the predicate of the left join is changed from t1.a + t2.a = t3.a to t1.c = X1.c, which is equivalent to t1 natural join MagicSet(X1). Then, we use the rule that contains left join for the apply to push down the magic set:

5. Future Work

5.1 Cost-based Subquery Unnesting

For some patterns, there may be more than one subquery unnesting mode, such as the following SQL statements.

SELECT c_custkey
FROM customer
WHERE 1000000 <
      (SELECT SUM(o_totalprice)
       FROM orders
       WHERE o_custkey = c_custkey)

There are two ways to unnest subqueries. The first way is to do the group by before the join.

SELECT c_custkey
FROM customer,
     (SELECT o_custkey, SUM(o_totalprice)
      FROM orders
      GROUP BY o_custkey
      HAVING 1000000 < SUM(o_totalprice)) AS agg_result
WHERE c_custkey = agg_result.o_custkey

The other is to do the join before the group by.

SELECT c_custkey
FROM customer LEFT JOIN orders ON c_custkey = o_custkey
GROUP BY c_custkey
HAVING 1000000 < SUM(o_totalprice);

The performance of these two different algorithms varies greatly under different data volumes. Currently, IMCI always chooses the latter, that is, join before group by for subquery unnesting. As the left join may generate a large amount of data, the efficiency might be low in some cases. Later, IMCI will introduce this optimization into cost-based query optimization, and select the better one from the two algorithms based on the query cost.

5.2 Select Subquery Unnesting as Needed

As we said at the beginning, the motivation for IMCI to directly introduce subquery unnesting is as follows:

At present, IMCI barely uses indexes to execute queries. In this scenario, the processing efficiency of nested loop join-style correlated subqueries is too low to be accepted by customers.

However, if nested loop join also becomes very fast, do we still need to unnest all subqueries? For example:

SELECT * 
FROM customer 
WHERE (SELECT COUNT(*) FROM orders WHERE o_custkey = c_custkey) > 1

If this subquery can use the secondary index established on o_custkey, this nested loop join can actually be completed quickly, and we don't have to go through all the difficulties to eliminate it! In fact, index joins are a special kind of correlated subquery whose query cost is low. When IMCI supports using indexes to accelerate queries in the future, we can directly execute some queries as correlated subqueries, or even construct correlated subqueries to accelerate queries.

5.3 Subquery Unnesting Outside the Relational Algebra Framework

All the queries mentioned above can be expressed by relational algebra, but SQL often contains some operations that cannot be expressed by relational algebra, such as the order by and limit. In the future, IMCI will continue to expand the function of subquery unnesting so that all correlated subqueries can be efficiently executed on IMCI.

References

[1] Mostafa Elhemali, César A. Galindo-Legaria, Torsten Grabs, and Milind M. Joshi. 2007. Execution strategies for SQL subqueries.
[2] Neumann, Thomas; Kemper, Alfons (2015): Unnesting Arbitrary Queries.

Community

About Database Kernel | PolarDB IMCI Streamlines Complex Subqueries

1. Architecture of the IMCI Query Optimizer

2. Role of Correlated Subqueries

3. Introduction to Correlated Subqueries: An Example

4. Algorithm for Subquery Unnesting

4.1 Pushdown Rules

Filters and Projects

Group by

Join

4.2 Rule Running Process

4.3 Some Exceptions

5. Future Work

5.1 Cost-based Subquery Unnesting

5.2 Select Subquery Unnesting as Needed

5.3 Subquery Unnesting Outside the Relational Algebra Framework

References

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Database for FinTech Solution