By Chen Liang (Yaoling)
The database query optimizer is like the brain of the entire system. The efficiency of executing an SQL statement can vary significantly based on different optimization decisions. Therefore, the optimizer is one of the core components and competitive advantages of a database system. Alibaba Cloud PolarDB for MySQL is a cloud-native database that can handles various user scenarios and loads, helping enterprises go online and increase their data value. To enhance its capabilities, we have focused on improving the query transformation functionality of PolarDB for MySQL.
This article series will focus on the query transformation capability of PolarDB for MySQL and introduce some of our accumulated work in this direction.
The concept of query transformation is simple: transforming one form of query into another equivalent but more efficient form based on the equivalent transformation rules of relational algebra. This transformation ensures the correctness of query results and improves execution efficiency.
The optimizer can perform many transformations, with hundreds of common rules if each transformation is considered a rewrite rule. Some of these transformations, known as heuristic transformations, always make the query more efficient, while others depend on cost.
This article will describe a heuristic query transformation implemented by PolarDB called join elimination.
This feature is available in PolarDB for MySQL 8.0.2.2.9.
Join is the most common and time-consuming operator in SQL statements. It splices relations (tables) based on specified connection columns in the join condition and outputs them to the upper operator. Join has a complexity of M * N, higher than scan or aggregation operations. Therefore, implementing a good join algorithm and determining the best execution order are core challenges for every database system.
Can we eliminate unnecessary join operations from the beginning based on specific semantics in SQL queries? For example, consider the following simple case:
create table t1 (id int, PRIMARY KEY(id));
create table t2 (id int, t1_id int,
constraint `t1_id_fk` foreign key (`t1_id`) references `t1` (`id`)
);
select t2.id from t2 join t1 on t2.t1_id = t1.id;
In this case, t2's t1_id column is the foreign key of t1's id column. Hence, for each row in t2, there is only one corresponding row in t1 that can be joined. Additionally, the query doesn't need the data from the t1 table. Therefore, the query can be simplified as follows:
select t2.id from t2;
This avoids accessing the t1 table and performing join operations on a large number of rows, resulting in significant performance improvement.
The optimizer of PolarDB, which is based on MySQL, initially lacked join elimination capability. During online operations, some customers noticed a significant decrease in query performance after migrating from MariaDB. Upon investigation, it was discovered that MariaDB had a join elimination capability. By changing the customer's query from three tables to a single table using left join, the execution time was greatly reduced.
To address this issue, we studied the implementation of join elimination in MariaDB. While the principle was insightful, its implementation only covered limited scenarios. It lacked sufficient support for simple cases and complex nested scenarios (semi-join). As a result, we developed our own implementation based on MySQL 8.0 codebase.
The basic idea is not complicated. Consider the following join.
outer_table LEFT JOIN (inner_table) ON condition(outer_table,inner_table)
Since it is a left join, the data of the outer table will not be lost. To eliminate the left join safely, two conditions must be satisfied:
How can we ensure this uniqueness? The first idea is to use a unique or primary key index. If the join column of the inner table is the unique index column, it naturally guarantees the output of a single row.
However, real queries are not always so simple. Let's consider the following scenarios:
It seems that simple join elimination becomes complex. But by decomposing it gradually and solving each subproblem, we can determine the eliminability:
By following the above ideas and addressing each subproblem, we can determine the eliminability of various complex scenarios.
Look at the effects of the above questions:
create table t1 (a int);
create table t2 (a int primary key, b int);
create table t3 (a int primary key, b int);
1. How many tables does the inner table contain?
select t1.a from t1 left join (t2 join t3) on t2.a=t1.a and t3.a=t1.a;
=>
select t1.a from t1;
2. What if the inner table contains a new left join?
select t1.* from t1 left join (t2 left join t3 on t3.a=t2.b) on t2.a=t1.a;
=>
select t1.* from t1;
3. Is the inner table itself a derived table?
select t1.* from t1 left join
(
select t2.b as v2b, count(*) as v2c
from t2 left join t3 on t3.a=t2.b
group by t2.b
) v2
on v2.v2b=t1.a;
=>
select t1.* from t1;
There are many other scenarios, but as long as you follow the several judgment principles mentioned above, you can eliminate them one by one.
PostgreSQL has always been proud of its computing capabilities. Its statistical information and cost model are excellent. Coupled with a complete join ordering algorithm, it can generate high-quality execution plans.
However, when it comes to query transformation, PostgreSQL's performance is relatively mediocre. It also implements the left join elimination function, which follows a similar basic idea as PolarDB - determining based on the uniqueness of the inner table. However, the scenarios it supports are much simpler:
This severely limits the scope of the join elimination application, but the good thing is that when the inner side is a subquery (derived table), it not only supports the uniqueness check based on the group by column, but also has corresponding judgment mechanisms for distinct and set operations. For specific implementation, see:
query_is_distinct_for(Query *query, List *colnos, List *opids)
This function has a simple idea and implementation.
We also compared with MariaDB in various scenarios and found that MariaDB's strategy is rough or unreasonable in supporting some scenarios:
In MariaDB, if a subquery is in the ON condition of left join, it is directly prevented from being transformed to semi-join. The purpose is to keep it as a predicate condition, thus avoiding complicated processing of semi-join in the join elimination logic. However, there is an obvious problem: if this left join itself cannot be eliminated, wouldn't semi-join be useless?
The preceding schema of t0/t1/t2/t3 is still used:
EXPLAIN SELECT count(*)
FROM t1
LEFT JOIN (t2
LEFT JOIN t3
ON t2.b = t3.b
AND EXISTS (
SELECT t0.a
FROM t0
WHERE t0.a = t3.b
)
) ON t1.a = t2.a;
Here, due to the join condition t2.b = t3.b and t3.b is not the only key, this query cannot eliminate the join. MariaDB's execution plan is as follows:
+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
| 1 | PRIMARY | t1 | ALL | NULL | NULL | NULL | NULL | 4 | |
| 1 | PRIMARY | t2 | eq_ref | PRIMARY | PRIMARY | 4 | test.t1.a | 1 | Using where |
| 1 | PRIMARY | t3 | ALL | NULL | NULL | NULL | NULL | 2 | Using where |
| 2 | DEPENDENT SUBQUERY | t0 | ALL | NULL | NULL | NULL | NULL | 4 | Using where |
+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
You can see that the related subquery of t0 is executed in the most primitive way. If the t0 table has a large amount of data, the performance will be poor, but PolarDB still support it:
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-------------+------------+--------+---------------------+---------------------+---------+--------------+------+----------+-------------+
| 1 | SIMPLE | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 4 | 100.00 | NULL |
| 1 | SIMPLE | t2 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 2 | 100.00 | Using where |
| 1 | SIMPLE | t3 | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | Using where |
| 1 | SIMPLE | <subquery2> | NULL | eq_ref | <auto_distinct_key> | <auto_distinct_key> | 5 | je_test.t3.b | 1 | 100.00 | Using where |
| 2 | MATERIALIZED | t0 | NULL | ALL | NULL | NULL | NULL | NULL | 4 | 100.00 | NULL |
+----+--------------+-------------+------------+--------+------------------
T0 is still implemented by semi-join MATERIALIZATION, and the efficiency is much higher.
Even in some simple scenarios, MariaDB's treatment is incomplete:
EXPLAIN SELECT count(*)
FROM t1
LEFT JOIN t2 ON t1.a = t2.a
LEFT JOIN t3 ON t2.b = t3.a;
Limited by algorithms and implementation, MariaDB's effects are as follows:
+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
| 1 | SIMPLE | t1 | ALL | NULL | NULL | NULL | NULL | 4 | |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | test.t1.a | 1 | Using where |
+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------------+
In such a simple query, it is obvious that t2 and t3 can be eliminated, but MariaDB can only handle join elimination of t3.
PolarDB is implemented more thoroughly:
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| 1 | SIMPLE | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 4 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
The join elimination is a heuristic transformation that must generate benefits. According to the amount of data in the table and the access methods, the performance differences may vary greatly, but in general, the performance will be greatly improved. Here is a comparison between the actual queries before and after the elimination of an online customer:
SELECT count(*)
FROM `shop_customer` `sc`
LEFT JOIN `car` `ca` ON `sc`.`car_id` = `ca`.`id`
LEFT JOIN `company` `co` ON `sc`.`company_id` = `co`.`id`;
=>
SELECT count(*) FROM `shop_customer` `sc`;
Obviously, the final query can be transformed into a single-table query. Prior to elimination, the execution time was 7.5 seconds, but after elimination, it reduced to 0.1 seconds, resulting in a 75-fold improvement.
There are numerous examples where the impact is even more significant, especially when the inner layer contains multiple nested tables. Generally speaking, joins often incur substantial overhead, and eliminating them can significantly reduce this overhead.
Currently, we are continuously adding more advanced query transformation capabilities to MySQL. This is because the native support for transformations in MySQL is quite limited. It is a gradual process that requires accumulating materials from online customer scenarios and actual needs.
Throughout this process, we have encountered three fundamental problems:
To address these three fundamental problems, the team is diligently working on refactoring MySQL's query transformation process. This is a highly important and complex task, which includes:
Free Tier & Special Offers of Alibaba Cloud Database Products: https://www.alibabacloud.com/product/databases
About Database Kernel | PolarDB Optimizer Query Transformation - Window Functions
About Database Kernel | PolarDB Optimizer Query Transformation - Join Condition Pushdown
ApsaraDB - October 20, 2023
ApsaraDB - October 27, 2023
ApsaraDB - October 30, 2023
ApsaraDB - August 7, 2023
ApsaraDB - July 11, 2023
OceanBase - August 25, 2022
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by ApsaraDB