About Database Kernel | PolarDB Optimizer Query Transformation - Join Elimination

By Chen Liang (Yaoling)

Introduction

The database query optimizer is like the brain of the entire system. The efficiency of executing an SQL statement can vary significantly based on different optimization decisions. Therefore, the optimizer is one of the core components and competitive advantages of a database system. Alibaba Cloud PolarDB for MySQL is a cloud-native database that can handles various user scenarios and loads, helping enterprises go online and increase their data value. To enhance its capabilities, we have focused on improving the query transformation functionality of PolarDB for MySQL.

This article series will focus on the query transformation capability of PolarDB for MySQL and introduce some of our accumulated work in this direction.

PolarDB Optimizer Query Transformation - Window Functions

The Concept of Query Optimization

The concept of query transformation is simple: transforming one form of query into another equivalent but more efficient form based on the equivalent transformation rules of relational algebra. This transformation ensures the correctness of query results and improves execution efficiency.

The optimizer can perform many transformations, with hundreds of common rules if each transformation is considered a rewrite rule. Some of these transformations, known as heuristic transformations, always make the query more efficient, while others depend on cost.

This article will describe a heuristic query transformation implemented by PolarDB called join elimination.

This feature is available in PolarDB for MySQL 8.0.2.2.9.

Join Elimination

Join is the most common and time-consuming operator in SQL statements. It splices relations (tables) based on specified connection columns in the join condition and outputs them to the upper operator. Join has a complexity of M * N, higher than scan or aggregation operations. Therefore, implementing a good join algorithm and determining the best execution order are core challenges for every database system.

Can we eliminate unnecessary join operations from the beginning based on specific semantics in SQL queries? For example, consider the following simple case:

create table t1 (id int, PRIMARY KEY(id));
create table t2 (id int, t1_id int, 
                 constraint `t1_id_fk` foreign key (`t1_id`) references `t1` (`id`)
                 );

select t2.id from t2 join t1 on t2.t1_id = t1.id;

In this case, t2's t1_id column is the foreign key of t1's id column. Hence, for each row in t2, there is only one corresponding row in t1 that can be joined. Additionally, the query doesn't need the data from the t1 table. Therefore, the query can be simplified as follows:

select t2.id from t2;

This avoids accessing the t1 table and performing join operations on a large number of rows, resulting in significant performance improvement.

PolarDB Implementation

The optimizer of PolarDB, which is based on MySQL, initially lacked join elimination capability. During online operations, some customers noticed a significant decrease in query performance after migrating from MariaDB. Upon investigation, it was discovered that MariaDB had a join elimination capability. By changing the customer's query from three tables to a single table using left join, the execution time was greatly reduced.

To address this issue, we studied the implementation of join elimination in MariaDB. While the principle was insightful, its implementation only covered limited scenarios. It lacked sufficient support for simple cases and complex nested scenarios (semi-join). As a result, we developed our own implementation based on MySQL 8.0 codebase.

Principles

The basic idea is not complicated. Consider the following join.

outer_table LEFT JOIN (inner_table) ON condition(outer_table,inner_table)

Since it is a left join, the data of the outer table will not be lost. To eliminate the left join safely, two conditions must be satisfied:

For any row in the outer table, there is only one matching row in the inner table based on the join condition.
Except for left join operations, there is no need to reference the data in the inner table.

How can we ensure this uniqueness? The first idea is to use a unique or primary key index. If the join column of the inner table is the unique index column, it naturally guarantees the output of a single row.

However, real queries are not always so simple. Let's consider the following scenarios:

What if the unique index contains multiple columns?
How many tables does the inner table contain?
What if the inner table contains a new left join?
Is the inner table itself a derived table?

It seems that simple join elimination becomes complex. But by decomposing it gradually and solving each subproblem, we can determine the eliminability:

The entire left join can be eliminated only when all tables are guaranteed to output a unique row inside a left join (a single table or multiple tables).
For each single table on the inner side, if all columns involved in the join condition are a superset of a unique index, the table can be guaranteed to output a row.
If the inner side contains left join again, you must first recurse deeply to determine whether the inner left join can be eliminated, and then return to examine the outer layer.
If the inner layer contains the derived table and the derived table contains the group by, the group by column is the primary key column of the derived table from the perspective of the outer layer.

By following the above ideas and addressing each subproblem, we can determine the eliminability of various complex scenarios.

Look at the effects of the above questions:

create table t1 (a int);
create table t2 (a int primary key, b int);
create table t3 (a int primary key, b int);

1. How many tables does the inner table contain? 
select t1.a from t1 left join (t2 join t3) on t2.a=t1.a and t3.a=t1.a;
=>
select t1.a from t1;

2. What if the inner table contains a new left join? 
select t1.* from t1 left join (t2 left join t3 on t3.a=t2.b) on t2.a=t1.a;
=>
select t1.* from t1;

3. Is the inner table itself a derived table?
select t1.* from t1 left join
(
  select t2.b as v2b, count(*) as v2c
  from t2 left join t3 on t3.a=t2.b
  group by t2.b
) v2
on v2.v2b=t1.a;
=>
select t1.* from t1;

There are many other scenarios, but as long as you follow the several judgment principles mentioned above, you can eliminate them one by one.

Comparison With PostgreSQL

PostgreSQL has always been proud of its computing capabilities. Its statistical information and cost model are excellent. Coupled with a complete join ordering algorithm, it can generate high-quality execution plans.

However, when it comes to query transformation, PostgreSQL's performance is relatively mediocre. It also implements the left join elimination function, which follows a similar basic idea as PolarDB - determining based on the uniqueness of the inner table. However, the scenarios it supports are much simpler:

It only supports a base relation or a subquery on the inner side of the left join.

This severely limits the scope of the join elimination application, but the good thing is that when the inner side is a subquery (derived table), it not only supports the uniqueness check based on the group by column, but also has corresponding judgment mechanisms for distinct and set operations. For specific implementation, see:

query_is_distinct_for(Query *query, List *colnos, List *opids)

This function has a simple idea and implementation.

Comparison With MariaDB

We also compared with MariaDB in various scenarios and found that MariaDB's strategy is rough or unreasonable in supporting some scenarios:

The semi-join treatment

In MariaDB, if a subquery is in the ON condition of left join, it is directly prevented from being transformed to semi-join. The purpose is to keep it as a predicate condition, thus avoiding complicated processing of semi-join in the join elimination logic. However, there is an obvious problem: if this left join itself cannot be eliminated, wouldn't semi-join be useless?

The preceding schema of t0/t1/t2/t3 is still used:

EXPLAIN SELECT count(*)
FROM t1
  LEFT JOIN (t2
    LEFT JOIN t3
    ON t2.b = t3.b
      AND EXISTS (
        SELECT t0.a
        FROM t0
        WHERE t0.a = t3.b
      )
) ON t1.a = t2.a;

Here, due to the join condition t2.b = t3.b and t3.b is not the only key, this query cannot eliminate the join. MariaDB's execution plan is as follows:

+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
| id   | select_type        | table | type   | possible_keys | key     | key_len | ref       | rows | Extra       |
+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
|    1 | PRIMARY            | t1    | ALL    | NULL          | NULL    | NULL    | NULL      | 4    |             |
|    1 | PRIMARY            | t2    | eq_ref | PRIMARY       | PRIMARY | 4       | test.t1.a | 1    | Using where |
|    1 | PRIMARY            | t3    | ALL    | NULL          | NULL    | NULL    | NULL      | 2    | Using where |
|    2 | DEPENDENT SUBQUERY | t0    | ALL    | NULL          | NULL    | NULL    | NULL      | 4    | Using where |
+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------------+

You can see that the related subquery of t0 is executed in the most primitive way. If the t0 table has a large amount of data, the performance will be poor, but PolarDB still support it:

+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
| id | select_type  | table       | partitions | type   | possible_keys       | key                 | key_len | ref          | rows | filtered | Extra       |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
|  1 | SIMPLE       | t1          | NULL       | ALL    | NULL                | NULL                | NULL    | NULL         |    4 |   100.00 | NULL        |
|  1 | SIMPLE       | t2          | NULL       | ALL    | PRIMARY             | NULL                | NULL    | NULL         |    2 |   100.00 | Using where |
|  1 | SIMPLE       | t3          | NULL       | ALL    | NULL                | NULL                | NULL    | NULL         |    2 |   100.00 | Using where |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 5       | je_test.t3.b |    1 |   100.00 | Using where |
|  2 | MATERIALIZED | t0          | NULL       | ALL    | NULL                | NULL                | NULL    | NULL         |    4 |   100.00 | NULL        |
+----+--------------+-------------+------------+--------+------------------

T0 is still implemented by semi-join MATERIALIZATION, and the efficiency is much higher.

The sequence left join treatment

Even in some simple scenarios, MariaDB's treatment is incomplete:

EXPLAIN SELECT count(*)
FROM t1
  LEFT JOIN t2 ON t1.a = t2.a
LEFT JOIN t3 ON t2.b = t3.a;

Limited by algorithms and implementation, MariaDB's effects are as follows:

+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
| id   | select_type | table | type   | possible_keys | key     | key_len | ref       | rows | Extra       |
+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
|    1 | SIMPLE      | t1    | ALL    | NULL          | NULL    | NULL    | NULL      | 4    |             |
|    1 | SIMPLE      | t2    | eq_ref | PRIMARY       | PRIMARY | 4       | test.t1.a | 1    | Using where |
+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------------+

In such a simple query, it is obvious that t2 and t3 can be eliminated, but MariaDB can only handle join elimination of t3.

PolarDB is implemented more thoroughly:

+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
|  1 | SIMPLE      | t1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |   100.00 | NULL  |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+

Performance Improvement

The join elimination is a heuristic transformation that must generate benefits. According to the amount of data in the table and the access methods, the performance differences may vary greatly, but in general, the performance will be greatly improved. Here is a comparison between the actual queries before and after the elimination of an online customer:

SELECT count(*)
FROM `shop_customer` `sc`
  LEFT JOIN `car` `ca` ON `sc`.`car_id` = `ca`.`id`
  LEFT JOIN `company` `co` ON `sc`.`company_id` = `co`.`id`;

=>
SELECT count(*) FROM `shop_customer` `sc`;

Obviously, the final query can be transformed into a single-table query. Prior to elimination, the execution time was 7.5 seconds, but after elimination, it reduced to 0.1 seconds, resulting in a 75-fold improvement.

There are numerous examples where the impact is even more significant, especially when the inner layer contains multiple nested tables. Generally speaking, joins often incur substantial overhead, and eliminating them can significantly reduce this overhead.

Summary

Currently, we are continuously adding more advanced query transformation capabilities to MySQL. This is because the native support for transformations in MySQL is quite limited. It is a gradual process that requires accumulating materials from online customer scenarios and actual needs.

Throughout this process, we have encountered three fundamental problems:

The original transformations in MySQL were added in a somewhat arbitrary manner. The coupling between MySQL and the original processing flow makes it challenging to add new transformations.
The timing of transformation execution is not reasonable. There is a certain sequence dependency between different transformations, but this sequence is not effectively utilized.
Some transformations may not necessarily bring benefits, and decisions need to be made based on statistics and costs. Unfortunately, MySQL does not support such decision-making. Transformation is transformation, regardless of its effectiveness.

To address these three fundamental problems, the team is diligently working on refactoring MySQL's query transformation process. This is a highly important and complex task, which includes:

Decoupling the original resolve and transform processes, treating transformation as a separate step. Each transformation follows a "rule" unit and is completed independently. This allows for continuous expansion by adding new rules.
Utilizing a rule-based abstraction and an enumeration framework to determine the order of different transformations (rules). Through iterative iterations, the interdependent rules can be executed in a more optimal order.
Implementing the reentrant and reusable capability of the CBO module, which enables decisions on whether to perform a transformation based on its cost.

Free Tier & Special Offers of Alibaba Cloud Database Products: https://www.alibabacloud.com/product/databases

Community

About Database Kernel | PolarDB Optimizer Query Transformation - Join Elimination

Introduction

The Concept of Query Optimization

Join Elimination

PolarDB Implementation

Principles

Comparison With PostgreSQL

Comparison With MariaDB

Performance Improvement

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Database for FinTech Solution