By Yukun Liang (Responsible for the Research and Development of PolarDB In-Memory Column Index Optimizer)
During a database query, the optimizer accepts the query inputs by a user, performs a series of equivalent transformations on the query, and selects the optimal plan from the equivalent execution plans by estimating the cardinality and cost in the query. Since the execution plan has a great impact on performance, the query optimizer is a critical component in all database systems. The following figure shows a typical query optimizer framework:
Generally, a query optimizer uses the following three typical components to complete query optimization:
Among the problems related to query optimization, the most widely studied is the join order of query plans. Since join order can affect query performance, and it depends on the capabilities of the three components above at the same time, it is not too much to say it is the most critical problem in query optimization. This article uses the join reorder of Polar DB MySQL In-Memory Column Index (IMCI) to describe the query optimization process of PolarDB HTAP.
Different database systems use their unique methods to solve join order problems. Although the general framework is shown in the figure above, the specific implementations are different. They can be divided into the following two types:
PostgreSQL and MySQL use a kind of bottom-up optimizer that divides query optimization into the following three steps:
Such a framework is similar to a System-R optimizer. It has the advantage that the search process is intuitive and easy to understand. At the same time, the width of the search is limited by the way of splitting problems, which can ensure that optimization is less time-consuming. However, there are some problems with such a framework.
Different from bottom-up optimization, top-down query optimization adopts the idea of dynamic programming. This kind of optimizer adopts the framework of cascade optimizer. The advantage is that query transformation, cost estimation, and pruning are combined, and the whole query optimization is regarded as a whole. The optimization process is to continuously generate new equivalent execution plan fragments through rules transformation and prune after computing the generated execution plans. This framework overcomes the shortcomings of bottom-up optimization and makes optimizer development and expansion more convenient. In recent years, newly developed analytical database systems have adopted this framework. Please refer to [1] for more details.
PolarDB for MySQL is initially oriented to OLTP scenarios. The core goal of its optimizer is to achieve optimal performance under some simple or fixed-pattern queries. PolarDB for MySQL optimizer performs a large number of optimizations related to its execution model to achieve this (such as executing and eliminating subqueries in the query optimization phase and eliminating order by clauses based on index order). These optimizations have a good effect on query execution but make PolarDB for MySQL optimizer bound to its execution model and storage format.
However, under HTAP load, many complex queries need to be accelerated by row storage + column index (replica) + execution layer optimized for column storage. Under such conditions, a large number of design assumptions in the original PolarDB for MySQL optimizer are broken. At the same time, due to the coupling of the original optimizer with execution models and storage, simple modifications can hardly make the optimizer adapt to query optimization capabilities that HTAP loads. For databases that have to deal with different storage, execution models, and different data models, an optimizer that supports HTAP load scenarios should do the following:
The native optimizer of PolarDB for MySQL has a clear optimization process. The query optimization process is listed below:
1. Apply some rule-based optimization. The rules here usually make plans better. Therefore, cost computing is not involved here. For example:
a) Convert partial outer join to inner join
b) Equivalent Derivation: c1 = 5 and c1 = c2
can be converted to c1 = 5 and c2 = 5
.
c) Partition Pruning: If the query only falls on some partitions, other partitions will not be opened, which can reduce the amount of data scanned.
d) Subquery Elimination: Some subqueries that only return one row will be executed in advance and replaced with the results, which can simplify plans.
2. Cost-Based Join Reorder: As mentioned above, compute cost through the greedy + enumeration algorithm and available indexes.
3. Follow-Up Optimization: Determine the access method of the table and optimize the order by and distinct based on the indexes used
The optimization process is clear and good enough in the row-based iterator execution mode. However, after PolarDB adds a column index, this optimization system exposes the following problems:
PolarDB IMCI adds a new optimization process to the original optimizer framework to solve the performance defects of the native optimizer under HTAP workload. This optimizes the original optimizer framework to meet the requirements of the original row-based OLTP, complex queries, column storage formats, and vectorized parallel execution operators. On the whole, we use rule-based rewriting combined with cost-based query optimization to obtain better execution plans.
In addition to the existing rules of the PolarDB for MySQL optimizer, we have added some new rules. These rules have two main functions: make execution plans meet the requirements of the vectorized executor and optimize the efficiency of the column executor to execute queries. For example:
SELECT COUNT(DISTINCT c1) FROM t;
-- will be converted to
SELECT COUNT(c1) FROM (SELECT c1 FROM t GROUP BY c1);
Through this conversion, the design of the column executor can be simplified, and more functions can be supported with a streamlined implementation.
t1.c2 LIKE '%cat%' AND t1.c1 = 5
, because of the lower execution cost and selection rate of t1.c1 = 5
, executing t1.c1 = 5
first can significantly reduce the heavier LIKE
operations and improve the execution efficiency of queries.This step also includes some rules that should be based on cost evaluation but can make queries better in most cases, such as predicate pushdown. By applying these rules in this step, even though the subsequent optimization and execution layers get standardized input, the advance of predicate pushdown operations reduces the cost-based query optimization space and improves efficiency.
After the rule-based query plan rewriting process, the query plan is optimized through cost-based optimization. In IMCI, we use a cascade-like optimizer framework for query optimization, which can help avoid the local optimization problems that may exist in query optimization. In this framework, we use the three functional modules of the traditional optimizer: plan enumeration, cardinality estimation, and cost model. Here, we take the join order problem as an example to describe the internal details and workflow of each module.
This is the component used by the optimizer to enumerate plans. After a user's SQL statement undergoes a series of steps (such as parse and binding), an initial query plan is generated and input to the optimizer. The optimizer performs various equivalent transformations on the input query plan through rules to generate new equivalent query plans. In IMCI, these rules used to rewrite plans are divided into the following two categories:
In this module, the execution plan will generate a large number of equivalent plans through these rewriting rules, and the best execution plan will be selected through cost comparison. IMCI uses the cascade optimizer framework, which saves common intermediate results through data structures (such as Memo, Group, and GroupExpr), reducing a large number of redundant intermediate results. However, queries containing a large number of joins may still generate a large number of equivalent plans. For a join formed by n tables, there are at least 2 ^ n possible join orders. Without considering the Cartesian product, the relationship between the number of join plans and the enumeration time is shown in the following figure:
In the worst case, the optimization time of a join reorder containing 12 tables exceeds ten seconds. For common star queries, the optimization time exceeds ten seconds for 20 tables. This may be unacceptable for some queries containing a large number of tables. Therefore, when we want to select the best join order, we must cope with the following three problems:
The method of each system is different for the enumeration of join order. PostgreSQL uses an algorithm based on dynamic programming enumeration + heuristic algorithms. When the number of tables is small (<12), PostgreSQL uses a DPsize algorithm based on dynamic programming ideas. When the number of tables is large, the genetic algorithm (GEQO) is used as a heuristic algorithm to compute the join order. Greenplum is the same as the PostgreSQL algorithm when the number of tables is small. When the number of tables is large, it abandons GEQO and uses a greedy algorithm as the heuristic algorithm. MySQL uses an exhaustive + greedy algorithm. Due to the poor efficiency of the exhaustive algorithm, the enumeration depth is controlled by the parameter optimize_search_depth = k
. Each time, MySQL enumerates the k table join with the smallest cost and determines the prefix of the current join according to the result (the greedy idea is used here). This process is repeated until the number of remaining tables is less than k, and the join order of the remaining k tables is determined by the last exhaustive.
In the enumeration phase of IMCI, we mainly solve problems 1 and 2 above. For problem 1, IMCI currently adopts the same DPhyp algorithm as DuckDB and Hyper. Compared with the DPsize algorithm used by GPORCA and PostgreSQL, this algorithm has the main advantage that it is more efficient. DPsize algorithm is an enum-and-test algorithm, which can generate a large number of invalid plans during enumeration. In contrast, the DPhyp algorithm constructs a graph according to join conditions and enumerates plans according to the graph. This decreases invalid enumerations and improves. For different kinds of joins of n tables, the complexity of the algorithm used for join reorder in each system is listed below:
As shown in the table, the complexity of the DPhyp algorithm is optimal among the algorithms that do not limit the enumeration plan shape (MySQL only generates the left deep tree). The following figure shows the efficiency comparison between DPhyp and DPsize enumeration:
For problem 2, currently, we retain the join order generated by the row-based optimizer (which means the join plans of its own are regarded as the product of the heuristic algorithm). In the case of a large number of tables, IMCI only enumerates some join plans. On this basis, we add the join order generated by the native optimizer (equivalent to partially extending the join plans through the DP algorithm) and select the best plan from them. In addition, by incorporating the DPhyp algorithm into the cascade optimizer, we use the cascade optimizer framework to avoid local optimization that may occur due to separate consideration of joins.
In the enumeration section above, the query optimizer enumerates a large number of plans. The next step is to compute the cost of each equivalent query plan and select the best execution plan from it. The existing optimizers usually follow the following two steps to achieve this:
Cardinality computing depends on two modules: the collection and computing of statistics and the logic of computing the number of input and output rows.
Collection of Statistics
IMCI has developed a statistics module for the new optimizer to optimize queries in the absence of sufficient secondary indexes. IMCI estimates the number of output/output rows of operators by collecting the following information from the table.
COUNT(DISTINCT col)
)Construction of Statistics
In order to collect these statistics, we compute the number of rows to be sampled based on the amount of data in the table. The number of rows to be sampled is determined by the following formula given in [3]:
Where n is the size of the table, k is the number of buckets in the histogram, the confidence interval of relative error is f, and the confidence level is γ. After the optimizer computes the number of sampled rows based on the appropriate constants, we build a histogram using the input rows to compute the proportion of NULL values. Then, the optimizer computes the cardinality of the column based on the sampled data, which is related to the sampling method. There are many formulas available for the estimation of the cardinality on the column, each of which makes different assumptions based on the data. For example, in the following formulas, we make D as the estimated result, n as the number of sampled rows, q as the number of sample rates, fi as the number of values that happen to occur i times in the sample, and d as the cardinality of the sample.
In IMCI, we compute the sample rate based on the number of sampled rows and select the appropriate computing formula to compute the number of distinct values based on the size of the sample rate, keeping the error as minimal as possible.
Another problem is that these formulas usually require uniform random sampling. However, in column storage, data is usually not stored in pages. In IMCI, 64K rows are usually compressed and stored together. Then, to read a row, we need to read the entire data block corresponding to the row from the disk and decompress it. The read amplification problem is serious.
In order to solve this problem, we choose to sample according to data blocks to reduce read amplification and use the COLLAPSE algorithm mentioned in [4] to correct the computing formula of distinct values. There is a general idea. If multiple identical values are appearing in the same block, they would be computed as appearing only once because the estimation formula above largely depends on f1. This approach corrects the estimation error based on uniform random sampling by reducing the impact of data locality.
How to Make Number Estimation
The only operators that need to use statistical information to estimate the number of rows are filter, join, and groupby. The output of other operators is usually only related to the number of input rows (such as sorting). Through histograms, we can effectively estimate the selection rate of predicates like a > 10 AND a < 100
, and we can compute the selection rate of equal join and range join through histogram join. The cardinality on the column can help us estimate the number of groups output by group by and assist us in estimating the selection rate of equivalent predicates and equal join when some histograms are unavailable, such asSelectivity(a = 1) = 1 / COUNT(DISTINCT a)
. Unique key and foreign key can help us correct the results when computing equal join, such as the following SQL:
SELECT COUNT(*)
FROM t1, t2
WHERE t1.a = t2.a;
When t1.a is the unique key on t1 and t1.a is the foreign key of t2.a, we can get additional information: for each row of t2, there must be a row in t1 that matches. Therefore, we can accurately get the number of output rows of this join and the size of the t2 table.
Selection Rate Computing of Multi-Predicate Combination
In PostgreSQL, if a predicate is composed of conditions of multiple columns:
-PAY TIME for Beijing Time
SELECT COUNT(*)
FROM PAY_INFO
WHERE NATION = 'CHINA' AND
PAY_TIME BETWEEN '2008-12-20 18:00:00' AND '2008-12-20 21:00:00'
Let’s assume the table stores the payment data of users in China, Germany, and the United States, where the payment data of the three countries account for 1/3, and the transaction time is evenly distributed within 0-24 o'clock. According to PostgreSQL's algorithm, the selection rate of this predicate is SEL(a and b) = SEL(a) * SEL(b) = 1/3 * 1/4 = 1 / 12
, but there is a problem that needs to be considered: the different countries' consumption from different people is often in a specific period. As it happens to be the closing time after 6 PM in China, the selection rate of this query should only be slightly less than 1/3. This example illustrates that there is likely to be an association between the data in different columns in real-world data. If estimated according to mutually independent assumptions, the results are likely to be different from the actual results.
In IMCI, we assume data columns are associated. When we compute the selection rate between multiple predicates, we use the exponential backoff algorithm. We sort the selection rate of each predicate and then compute the selection rate according to the following formula:
This algorithm does not affect benchmark-generated datasets (such as TPCH) but can effectively reduce estimation errors in datasets based on real-world data.
We have tested the performance of IMCI on/off query optimization on the TPCH 1TB dataset, as shown in the following figure:
It can be seen that for multi-table joins (such as Q8 and Q9), after IMCI query optimization is enabled, the execution efficiency is significantly increased because a better query plan is selected. In the workload of user scenarios, the query effect of enabling and disabling the query optimization function is compared, as shown in the following figure:
Before query optimization is enabled:
If query optimization is not enabled, the join is performed based on
It can be seen that the join (orange part) of tables d and e in the figure process a large amount of data, and more than 60% of the whole query is used to process these large table joins.
After query optimization is enabled:
The join order becomes:
The large table join of d and e is eliminated, and the execution time is reduced by more than 50%.
In the process of solving join reorder, the join order obtained from the row-based optimizer is currently regarded as the result of the heuristic algorithm. As mentioned above, the execution plan may not be suitable for column execution models, so the subsequent IMCI will improve the heuristic algorithm when there are too many tables. The plan is to use the search space linearization technique in 2 to reduce search space and improve the possibility of finding a better plan on the premise of ensuring the quality of the plan.
On the other hand, IMCI doesn't have many rewrite rules, which are mainly used to deal with join reorder and a small number of rewrites for groupby. In the future, IMCI will introduce more rewrite rules for execution plans to expand search space and increase the possibility of finding better execution plans.
IMCI has a limited ability to use row-based indexes. After the executor is combined with the determinant executor to execute SQL through hybrid execution plans, the optimizer needs to implement a large number of new functions, such as:
At present, the statistical information of IMCI is mainly constructed based on sampling. Since the sampling ratio is more or less an error, we can consider drawing on the idea of stream computing to increase the accuracy of statistical information. When data is inserted, a sufficiently lightweight stream system is used to compute statistical information.
As such, it can be considered that secondary indexes in 5.2 are a heavy stream system, and the commonly used HyperLogLog is a good example of a light stream system. Umbra database uses HyperLogLog to maintain the number of distinct values in each column in real-time. HyperLogLog (a data structure that supports streaming approximate computing) is collectively called Sketch. Currently, there are many related studies. For example, we can find many encapsulated sketches in Apache datasketches, some of which can be used to replace histograms for range query estimation and distinct value estimation.
[1] G. Graefe et al., “The Cascades Framework for Query Optimization," IEEE Data Eng. Bull., vol. 18, no. 3, pp. 337-348, 1995.
[2] Thomas Neumann and Bernhard Radke. 2018. Adaptive Optimization of Very Large Join Queries. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 677-692. https://doi.org/10.1145/3183713.3183733
[3] Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 1998. Random sampling for histogram construction: how much is enough?SIGMOD Rec. 27, 2 (June 1998), 436-447. https://doi.org/10.1145/276305.276343
[4] Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (SIGMOD '04). Association for Computing Machinery, New York, NY, USA, 287-298. https://doi.org/10.1145/1007568.1007602
[5] PostgreSQL Optimizer Methodology https://www.youtube.com/watch?v=XA3SBgcZwtE
[6] Apache datasketches https://datasketches.apache.org/
About Database Kernel | What Happens to Distributed Databases When Two Machines Are Hung Up?
ApsaraDB - August 4, 2023
ApsaraDB - August 8, 2023
ApsaraDB - June 7, 2022
ApsaraDB - November 21, 2023
ApsaraDB - January 6, 2023
ApsaraDB - April 27, 2023
Alibaba Cloud PolarDB is a cloud-native relational database service that decouples computing resources from storage resources
Learn MoreDesigned to address database challenges such as ultra-high concurrency, massive data storage, and large table performance bottlenecks.
Learn MoreA cloud-native database management platform that allows you to manage on-premises databases in the same way as in Alibaba Cloud.
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by ApsaraDB