PolarDB-X provides a rule-based optimizer and a cost-based optimizer to help you optimize logical execution plans for SQL queries to generate physical execution plans. The rule-based optimizer is called SQL Rewrite. The cost-based optimizer is called Plan Enumerator. This topic describes how the query optimizers work and also describes the relational algebra operators that are used in SQL queries.
- The syntax parser parses the SQL text to an abstract syntax tree (AST).
- The syntax tree is converted to a logical execution plan based on relational algebra operators.
- The optimizers optimize the logical execution plan to generate a physical execution plan.
- The executor executes the physical execution plan to retrieve the query result and returns the query result to the client.
Relational algebra operators
In a database system, an SQL query is usually represented as a tree that consists of relational algebra operators. The following types of operators can be used in an SQL query:
- Project operators are used to describe the SELECT operation, including the functions that are used to perform the SELECT operation.
- Filter operators are used to describe WHERE conditions.
- Join operators are used to describe JOIN conditions. The physical operators that correspond to join operators include HashJoin, BKAJoin, Nested-Loop Join, and SortMergeJoin.
- Aggregate operators are used to describe GROUP BY conditions and aggregate functions. The physical operators that correspond to aggregate operators include HashAgg and SortAgg.
- Sort operators are used to describe the ORDER BY and LIMIT conditions. The physical operators that correspond to sort operators include TopN and MemSort.
For example, the following code provides the SQL statement that is executed to perform an SQL query:
SELECT l_orderkey, sum(l_extendedprice *(1 - l_discount)) AS revenue FROM CUSTOMER, ORDERS, LINEITEM WHERE c_mktsegment = 'AUTOMOBILE' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < '1995-03-13' and l_shipdate > '1995-03-13' GROUP BY l_orderkey;
After you execute the EXPLAIN statement to query the execution plan in PolarDB-X, the following information is returned:
HashAgg(group="l_orderkey", revenue="SUM(*)") HashJoin(condition="o_custkey = c_custkey", type="inner") Gather(concurrent=true) LogicalView(tables="ORDERS_[0-7],LINEITEM_[0-7]", shardCount=8, sql="SELECT `ORDERS`.`o_custkey`, `LINEITEM`.`l_orderkey`, (`LINEITEM`.`l_extendedprice` * (? - `LINEITEM`.`l_discount`)) AS `x` FROM `ORDERS` AS `ORDERS` INNER JOIN `LINEITEM` AS `LINEITEM` ON (((`ORDERS`.`o_orderkey` = `LINEITEM`.`l_orderkey`) AND (`ORDERS`.`o_orderdate` < ?)) AND (`LINEITEM`.`l_shipdate` > ?))") Gather(concurrent=true) LogicalView(tables="CUSTOMER_[0-7]", shardCount=8, sql="SELECT `c_custkey` FROM `CUSTOMER` AS `CUSTOMER` WHERE (`c_mktsegment` = ?)")
The following figure shows the execution plan by using an AST.
When an SQL statement is processed by SQL Rewrite, the logical execution plan of the SQL statement is rewritten into another logical execution plan. In SQL Rewrite, multiple heuristic rules are applied to the SQL statement for optimization. This way, SQL Rewrite is used as a rule-based optimizer.
When an SQL statement is processed in SQL Rewrite, the following operations are performed:
- Subquery unnesting. Nested subqueries are converted to operations that use semijoin
operators or similar operators to facilitate subsequent optimizations, such as pushing
down operators to ApsaraDB RDS for MySQL or selecting an algorithm for execution in
PolarDB-X. In the following example, an IN subquery is converted to an operation that
uses a semijoin operator and then to an operation that uses a physical hash semijoin
operator for execution in PolarDB-X:
> explain select id from t1 where id in (select id from t2 where t2.name = 'hello'); SemiHashJoin(condition="id = id", type="semi") Gather(concurrent=true) LogicalView(tables="t1", shardCount=2, sql="SELECT `id` FROM `t1` AS `t1`") Gather(concurrent=true) LogicalView(tables="t2_[0-3]", shardCount=4, sql="SELECT `id` FROM `t2` AS `t2` WHERE (`name` = ?)")
- Operator pushdown
Operators are pushed down. Operator pushdown is critical. PolarDB-X provides the following built-in optimization rules for operator pushdown.
Optimization rule Description Predicate pushdown or column pruning Pushes filter and project operators to ApsaraDB RDS for MySQL for execution and filters out rows and columns that are not required. JOIN Clustering Re-sorts and clusters join operations based on the sharding method and the equality condition of the shard key to facilitate subsequent pushdown of join operations. Join pushdown Pushes down join operations that meet specific conditions to ApsaraDB RDS for MySQL for execution. Aggregate pushdown Splits each aggregate operation into operations that use the FinalAgg and LocalAgg operators and then pushes down the LocalAgg operator to ApsaraDB RDS for MySQL. Sort pushdown Splits each sort operation into operations that use the MergeSort and LocalSort operator and then pushes down the LocalSort operator to ApsaraDB RDS for MySQL.
For more information about query pushdown, see Push down and rewrite queries.
Plan Enumerator provides a final physical execution plan based on the received logical execution plan generated by SQL Rewrite. Plan Enumerator selects the query plan that has the minimum cost from multiple feasible execution plans by using the predefined cost model. The execution plan generated by SQL Rewrite may be better or worse than the original execution plan of the SQL statement. Plan Enumerator compares the costs of the two execution plans by using operators and selects the execution plan that incurs less cost. This way, Plan Enumerator is used a cost-based optimizer.
The following items are the core components of Plan Enumerator:
- Cardinality estimation
- Transform rules
- Cost model
- Plan space search engine
When an SQL statement is processed by Plan Enumerator, the following operations are performed:
- The search engine converts the logical execution plan by using transform rules to construct search space for the physical execution plan.
- Plan Enumerator uses the cost model to estimate the cost of each execution plan in the search space. Then, Plan Enumerator selects the physical execution plan that incurs the lowest cost.
- During the cost estimation, the cardinality estimation component estimates the number of input rows, the selection rate, and other information about each operator based on the statistical information about each table and each column. Then, the estimated information is provided to the cost model of the operator to estimate the cost of each execution plan.