PolarDB-X Optimizer Core Technology: Execution Plan Management

This article discusses the importance of SQL Plan Management in maintaining stable database performance by optimizing the execution plan and managing multiple plans for a single SQL query.

Background

The execution plan in the database essentially describes the process of completing a request by consuming a certain amount of resources. The task of the optimizer is to build an optimization model based on various optimization rules, and then estimate the execution plan with the lowest consumption through statistical information.

The quality of the execution plan will directly affect the performance of the database, so to ensure the stability of the execution plan is to ensure the stability of the database performance.

SQL Plan Management is to save one or more execution plans for each SQL query and select only one execution plan from this known set of execution plans when executing the query. The case where there is only one execution plan (for example, the user manually specifies one plan) is trivial, and this article only discusses the case where one SQL query corresponds to multiple execution plans.

In modern database systems, there are more optimization rules, more complex optimizer structures, and more CPU resources are consumed to generate an execution plan. Therefore, the caching of execution plans is one of the key points to improve performance.

In complex scenarios, even if the original execution plan is reused, the execution performance will be severely degraded due to parameter changes.

In summary, the optimizer's execution plan management must strike a balance between the following two objectives:

• Minimize the number of optimizations

• The cost of the actual execution plan is as close as possible to that of the optimized execution plan

SPM at First Glance

Before we expand on the technical principles, let's take a quick look at what an SPM is with a simple example. For the following query:

SELECT COUNT(*)
FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey
where l_shipdate > '1995-03-31'
  and l_shipdate < '1996-04-15'
and p_type = 'STANDARD ANODIZED STEEL'

Get the SQL template after parameterization:

SELECT COUNT(*)
FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey
where l_shipdate > ?
  and l_shipdate < ?
  and p_type = ?

The equivalence conditions of l_shipdate and p_type are different, so it is difficult to accurately determine the relationship between the amount of data on different sides of the JOIN. For example:

• When the amount of data is large, you should use HashJoin and use the side with the larger amount of data as the probe.

• If the amount of data on the lineitem side is small, you should use LookupJoin (BKAJoin) to look up the part table.

If all three situations occur in the system, the SPM generates three plans.

When we execute a specific query, SPM will automatically select the best one to execute (Pay attention to the Source: SPM_PQO in the figure)

Optimize Once and Optimize Always

The parameterized SQL statement is used as the KEY and the execution plan is cached as the Value. The execution plan obtained by the parameterized SQL statement is based on the first time. This is called Optimize Once.

The advantage of Optimize Once is that it ensures the absolute stability of the execution plan. The disadvantage is that the execution plan cannot be adjusted in time according to subsequent parameters and data changes.

If no cache is used at all, and each request generates an execution plan based on the current parameter list, it is called Optimize Always.

Optimize Always generates a new execution plan for each new request, so it is sensitive to changes in parameters and data and can adjust a better execution plan in time. However, the disadvantage is the inevitable overhead of execution plan generation and lower stability.

Choose Plan By Cost

Comparison with Optimize Once/Always

If Optimize Once/Always is too extreme, then Choose Plan By Cost is making a compromise. Choose Plan By Cost will cache the execution plan, but for each parameterized SQL statement, multiple execution plans will be cached. When obtaining the execution plan, select the one with the lowest cost, and the cost will be calculated based on the parameters and data status of the current request.

Take a very common SQL connection between two tables as an example. T1 table and T2 table are connected by NAME, and the two tables have their own expressions (EXPR) for filtering. The expressions of the two tables have different filtering properties due to the different parameters of each request, and the amount of data involved in the join of the two tables is also different. We first ignore the order of filtering data and connection, and simplify it to a problem of conditional filtering first and then connection. Even so, there are two options for connection algorithms in general databases: hash join and lookup join.

SELECT * FROM T1 JOIN T2 ON T1.NAME=T2.NAME WHERE T1.EXPR1 AND T2.EXPR2

Hash join is divided into build side and probe side, and the cost is basically equal to R + S, which is generally suitable for queries of large data volume.

The lookup join algorithm is generally indexed on the lookup side. Assuming that the index height is H, the cost is approximately equal to R + R*H. When the amount of data on the drive side is small, the lookup performance is optimal.

When the amount of data filtered out by T1 or T2 is small on one side and the table on the other side has indexes, the lookup algorithm is preferable. When the amount of data filtered out by T1 and T2 tables is large, the hash join algorithm is preferable. However, the amount of data filtered by the two tables changes with the different parameters in the SQL expression T1.EXPR1 AND T2.EXPR2. If the Optimize Once strategy is adopted, it can encounter a large increase in RT due to a certain request parameter change.

In real practice, CBO also needs to determine the scan plan of each table, the order of joins, etc. If the Optimize Always policy is adopted, each request has to go through the optimizer, resulting in unnecessary optimization overhead.

Specific Ideas

If only one execution plan is not enough to deal with all situations, cache more. The most basic idea of Choose Plan By Cost is to change the 1: 1 cache relationship in Optimize Once to 1: n. When multiple execution plans are selected, the cost of all execution plans is calculated based on the parameters of the current request, and the minimum cost is selected. The cost of cost calculation is negligible compared to the cost of entering the optimizer once again.

Compared with Optimize Once, the most important thing about Choose Plan By Cost is adding the evolution logic to cache multiple execution plans for the same stmt. Taking Oracle's SPM as an example, it is mainly divided into three processes: capture, selection, and evolution of execution plans:

When receiving the initial request, Oracle will mark this SQL statement as tracked. When receiving the marked request again, it will generate a baseline for it.

Baseline: a collection of all execution plans for a parameterized SQL statement

The above is the plan selection process. The key point is that when a new plan enters the baseline, it will be in the unaccepted state first, and SPM will preferentially select the plan in the accepted state.

Plans in the unaccepted state will eventually change to the accepted state or be discarded in the evolution process.

At present, many databases use this method to manage execution plans, such as Postgre/Oracle 11g.

Parametric Query Optimize (PQO)

The distribution of data is different, so sometimes even if the original execution plan is directly reused in a query request, the execution performance will be severely degraded due to parameter changes. For example:

SELECT * FROM FRESHMEN WHERE AGE = ?X OR AGE = ?Y

If USER has a non-aggregate index of age, there are two AccesssPaths of this query to choose from: directly scan the primary table, or scan the index and query back to the primary table. At this time, the optimizer will estimate the amount of data scanned according to the filtering performance of the age condition. If the amount is less, the index can be used. If the amount reaches a certain threshold, the optimizer will use the primary table directly.

These are two completely different plans:

Then the cache of the execution plan will have the following two problems:

• If the parameters are not cached, subsequent parameters are not guaranteed to have similar filtering characteristics as when the execution plan was generated at the beginning. There will be a loss in performance.

• If the parameters are cached, the execution plan can only take effect for this SQL statement, which is too rigid.

To avoid the above two problems, the third way is proposed, which classifies the parameter features and selects the same execution plan for the parameters with similar features. That is, query optimization based on parameter space: Parametric Query Optimize.

So how to choose features? You know that there are various types of parameters, and conditions will contain various combinations of expressions. In complex scenarios, feature extraction based only on the value of the parameter itself is ineffective.

Let's look back at the example just now. The key point in choosing the execution plan is the filtering of the conditions. What matters is not the size of the age range, but how much data are in this range, which is the selectivity of the conditions.

If the parameter values are used directly to build the parameter list, the selection area of the execution plan can easily become fragmented. But with selectivity, you will generally get a consistent regional distribution of execution plans.

More importantly, pure parameter space is easily affected by data changes, while selectivity is more stable.

The cost calculation of the execution plan starts from the initial TableScan. The amount of data to be processed is derived from the selectivity calculated by the expression, and the expression calculation requires the specific parameters in SQL statements.

Selectivity: generally used to describe the filtering performance of an expression on a data entity.

Comparison with Choose Plan By Cost

When a request parameter change causes great changes to the selectivity of TableScan, the cost of the original execution plan may also change dramatically. Therefore, it is difficult for an execution plan to fully cope with all the possibilities of parameter lists in SQL statements.

In Choose Plan By Cost, new execution plans are continuously introduced into SPM through evolutionary logic, and then execution plans are selected through the cost. It sounds like it can solve the problem of parameter change, but it has the following two main disadvantages:

• The logic of evolution cannot cover all user parameter lists

• Execution plan overall cost calculation is inaccurate

Therefore, Parametric Query Optimize (PQO) came into being. It selects the execution plan based on the parameter space. It calculates the selectivity of each table involved in the request and then builds the applicable boundaries of an execution plan based on the multidimensional selectivity information.

• The selection of the execution plan is more certain. The selectivity of TableScan is the most accurate in the cost estimation process of the execution plan compared with other links. As stated before, the cost estimation of the execution plan starts from TableScan, and then the cost of other compute nodes is calculated layer by layer according to the output cardinality. Among them, the selectivity calculated in the initial TableScan based on statistical information is the most accurate.

• The parameter space does not change essentially due to statistical information updates. The boundaries built between execution plans with selectivity reflect the optimization model in the optimizer's ideal to some extent. This model has nothing to do with statistical information or data changes. For example, even if an execution plan is built based on wrong statistical information, when the information becomes accurate, the execution plan and its boundary model built with selectivity are still meaningful.

Disadvantages of POQ

POQ is more dependent on the accuracy of statistical information than other solutions. In addition, there are two difficulties in implementation:

• Bucket mechanism: In the online solution, whether the division of the execution plan area is placed before or after execution is a question worth thinking about. Before execution, the execution plan management module needs to invade more optimizer cost models, and after execution, it needs a set of accurate feedback mechanisms.

• Region boundaries: In POQ, different region boundaries should be designed based on different cost tendencies. This is also a more complex issue worthy of study. This topic is not expanded here.

Practice and Summary

Analysis

PolarDB-X is an HTAP database. In practice, it often faces a large number of TP requests and is mixed with many AP online analysis. When designing execution plan management, it is most important to adopt different strategies for different workloads.

At the same time, the separation architecture of the PolarDB-X CN/DN (DN can also perform complex queries based on local data) determines the complexity of the optimizer optimization process. If too many optimization requests are generated for AP requests, a large amount of CPU resources will be occupied, which will affect the TP request throughput.

For pure TP-type point checking, there is not much room for change in the execution plan. It is unnecessary to estimate the code or sketch the space of selectivity before execution. The most suitable execution plan strategy for this kind of traffic is Optimize Once.

For TP traffic with some changes in the processing result set, such as the previous example, Choose Plan By Cost or PQO should be used to handle it. The best processing method is the feedback mechanism. Only when the executor feedbacks that the cost of the request is significantly different from the previous one, will the strategy be changed from the Optimize Once policy to PQO.

PQO is the best solution for online AP requests, and the feedback mechanism can also avoid generating too many buckets.

For very complex AP requests, the execution time of such SQL statements is much higher than the optimization time, so you can use the Optimize Always policy.

Overall Components

The PolarDB-X SPM is mainly based on three components: PLAN CACHE, BASELINE, PQO.

• PLAN CACHE: 1:1 stores STMT to the plan information, which can be directly obtained to the plan in KV mode based on parameterized SQL statements.

• BASELINE: 1:N stores STMT to the plan information. Plans are classified into Accepted and Unaccepted categories. After receiving the request, the plan with the lowest cost will be selected according to the current parameter list.

• PQO: records the correspondence between the selectivity information and plan, and calculates the corresponding plan range based on the selectivity information of the current request.

Update Strategy for Execution Plans

No matter what kind of SQL statements, with the change of data layer, statistical data, or table schema, a set of update mechanisms are needed to continuously introduce new execution plans. Optimize Once component adopts the method of regular update, PQO (by selectivity) is based on statistical information update, and By Cost adopts the method of synchronous deepening, similar to Oracle's method.

Implementation

Let's review the example at the beginning of the article to see what really happens inside SPM.

SELECT COUNT(*)
FROM part p JOIN lineitem l ON p.p_partkey = l.l_partkey
where l_shipdate > '1995-03-31'
  and l_shipdate < '1996-04-15'
  and p_type = 'STANDARD ANODIZED STEEL'

For the above SQL statements, since p_type is not obviously skewed on the part table, the amount of data in the part table will not change greatly with the change of conditions. However, the condition l_shipdate of the lineitem table is a range query, so the filtering performance will be greatly influenced by the change of conditions.

Finally, according to the change of l_shipdate, the optimizer can choose three different execution plans, as shown in the following figure:

The Execution Plan enters SPM_PQO

For an SQL template, before execution, the source of all execution plans is PLAN_CACHE. The first execution will not enter the Baseline, but the second execution will.

As shown in the following figure, the source of the execution plan is PLAN_CACHE when you do the EXPLAIN operation before the execution. After one execution, when you use the EXPLAIN again, the source becomes SPM_NEW_BUILD.

After you execute again, the source changes to SPM_PQO. This indicates that you have entered the PQO module. From now on, the source of EXPLAIN is SPM_PQO.

Note that the source of the execution plan that enters PQO for the first time is PLAN_CACHE, and the conditions for generating the execution plan in PLAN_CACHE are:

l_shipdate > '1995-03-31' and l_shipdate <'1996-04-01'

Therefore, in the above execution process, no matter how the parameters are changed, the first execution plan is used.

Generate a New Execution Plan Based on Feedback in PQO

When the number of rows processed during SQL execution does not meet expectations, the PQO module divides regions according to the selectivity and generates different execution plans in different regions.

When the conditions of l_shipdate > '1995-03-31' and l_shipdate <'1995-04-01' are used twice, and then the SQL statements with the same conditions are explained, the second execution plan is generated.

At this point, in the SPM view, PQO for this SQL statement is currently divided into two areas. The biggest difference of SELECTIVITY_SPACE lies in the lineitem table filtering. Therefore, the biggest difference between the two execution plans is the choice of the HASH JOIN build side.

The same operation can be used to construct the execution plan of BKA JOIN in PQO. The final SPM view is as follows:

The results of EXPLAIN in various conditions are as follows. For clearer comparison, EXPLAIN simple is used. Note that the conditions in the following screenshot are not exactly the same as the previous settings. As long as they are similar in the selectivity area, you can select the nearest execution plan.

References

• A. Hulgeri and S. Sudarshan, “Parametric Query Optimization for Linear and Piecewise Linear Cost Functions,” Proc. 28th Int’l Conf. Very Large Data Bases (VLDB), 2002.
• Pedro Bizarro, Nicolas Bruno, and David J. DeWitt, "Progressive Parametric Query Optimization" IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 4, APRIL 2009
• "SQL Plan Management in Oracle Database 11g" An Oracle White Paper November 2010

Community

PolarDB-X Optimizer Core Technology: Execution Plan Management

Background

SPM at First Glance

Optimize Once and Optimize Always

Choose Plan By Cost

Comparison with Optimize Once/Always

Specific Ideas

Parametric Query Optimize (PQO)

Comparison with Choose Plan By Cost

Disadvantages of POQ

Practice and Summary

Analysis

Overall Components

Update Strategy for Execution Plans

Implementation

The Execution Plan enters SPM_PQO

Generate a New Execution Plan Based on Feedback in PQO

References

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for Xscale

PolarDB for PostgreSQL

PolarDB for MySQL

ApsaraDB RDS for SQL Server