The query optimizer optimizes logical plans to generate physical plans. The main phases include SQL rewrite and plan enumeration.
The following figure shows the execution process after DRDS receives a Structured Query Language (SQL) statement.
- The syntax parser parses SQL text to an abstract syntax tree (AST).
- The syntax tree is converted to a relational algebra-based logical plan.
- The optimizer optimizes the logical plan to generate a physical plan.
- The executor executes this plan to retrieve the query result and returns the query result to the client.
This topic describes how the query optimizer works. The following aspects are included:
- Relational algebra operators
- SQL rewrite (the Rule-Based Optimizer (RBO) phase)
- Query plan enumeration (the Cost-Based Optimizer (CBO) phase)
Relational algebra operators
In a database system, an SQL query is usually represented as a tree that consists of relational algebra operators. The operators in the following scenarios are available:
Project: used to describe the SELECT column in SQL, including function computations.
Filter: used to describe the WHERE conditions in SQL.
- JOIN: used to describe JOIN in SQL. The corresponding physical operators are HashJoin, BKAJoin, Nested-Loop Join, and SortMergeJoin.
- Agg: used to describe Group By and aggregate functions in SQL. The corresponding physical operators are HashAgg and SortAgg.
- Sort: used to describe Order By and Limit in SQL. The corresponding physical operators are TopN and MemSort.
LogicalView: used to describethe SQL statement that DRDS sends to the storage layer MySQL by . The SQL statement may include one or more logical operators.
Gather: represents the operation that gathers data from multiple data streams. This Gather operator usually appears above LogicalView. If parallel execution is enabled, the Gather operator is pulled up in the parallel optimization step.
The following query SQL statement is used as an example. The statement is obtained by modifying TPC-H Query 3.
SELECT l_orderkey, sum(l_extendedprice *(1 - l_discount)) AS revenue FROM CUSTOMER, ORDERS, LINEITEM WHERE c_mktsegment = 'AUTOMOBILE' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < '1995-03-13' and l_shipdate > '1995-03-13' GROUP BY l_orderkey;
Run the following EXPLAIN command to viewthe execution plan of DRDS:
HashAgg(group="l_orderkey", revenue="SUM(*)") HashJoin(condition="o_custkey = c_custkey", type="inner") Gather(concurrent=true) LogicalView(tables="ORDERS_[0-7],LINEITEM_[0-7]", shardCount=8, sql="SELECT `ORDERS`.`o_custkey`, `LINEITEM`.`l_orderkey`, (`LINEITEM`.`l_extendedprice` * (? - `LINEITEM`.`l_discount`)) AS `x` FROM `ORDERS` AS `ORDERS` INNER JOIN `LINEITEM` AS `LINEITEM` ON (((`ORDERS`.`o_orderkey` = `LINEITEM`.`l_orderkey`) AND (`ORDERS`.`o_orderdate` < ?)) AND (`LINEITEM`.`l_shipdate` > ?))") Gather(concurrent=true) LogicalView(tables="CUSTOMER_[0-7]", shardCount=8, sql="SELECT `c_custkey` FROM `CUSTOMER` AS `CUSTOMER` WHERE (`c_mktsegment` = ?)")
The following treemap chart illustrates the execution plan.
SQL rewrite (RBO)
In the SQL rewrite phase, the input is a logical execution plan and the output is a logical execution plan. In this step, some heuristic rules are mainly applied and RBO is used. Therefore, this step is also usually called the RBO phase.
The SQL rewrite step has the following main features:
- Subquery unnesting
Subquery unnesting is to express correlated subqueries that contain correlated items as SemiJoin or similar operators. This facilitates various subsequent optimizations, such as pushing down to the storage layer MySQL orselecting an algorithm for execution at the DRDS layer. In the following example, an IN subquery is converted to a SemiJoin operator and finally converted to a physical SemiHashJoin operator thatis executed by DRDS.
> explain select id from t1 where id in (select id from t2 where t2.name = 'hello'); SemiHashJoin(condition="id = id", type="semi") Gather(concurrent=true) LogicalView(tables="t1", shardCount=2, sql="SELECT `id` FROM `t1` AS `t1`") Gather(concurrent=true) LogicalView(tables="t2_[0-3]", shardCount=4, sql="SELECT `id` FROM `t2` AS `t2` WHERE (`name` = ?)")
- Operator pushdown
Operator pushdown is a critical step.DRDS has the following built-in optimization rules for operator pushdown.
Optimization rule Description Predicate pushdown or column pruning Push down the Filter and Project operators to the storage layer MySQL for execution and filter out unneeded rows and columns. JOIN Clustering Re-sort and cluster JOIN operations by using sharding methods and based on the equality conditions of shard keys. This facilitates subsequent pushdown of JOIN operations. JOIN pushdown Push down JOIN operations that meet the conditions to the storage layer MySQL for execution. Agg pushdown Split Agg operations into the two phases FinalAgg and LocalAgg and push down LocalAgg to the storage layer MySQL. Sort pushdown Split Sort operations into the two phases MergeSort and LocalSort and push down LocalSort to the storage layer MySQL.
For more information about query pushdown, see SQL rewrite and pushdown.
Query plan enumeration (CBO)
The logical execution plan that is generated in the SQL rewrite phase is used as the input of Plan Enumerator. Then, Plan Enumerator generates a final physical execution plan. Plan Enumerator selects the query plan that has the minimum cost from multiple feasible query plans by using the predefined cost model. Different from the SQL rewrite phase, in Plan Enumerator, a better or worse execution plan may be generated based on rules. A better execution plan is selected based on the cost comparison result. Therefore, Plan Enumerator is also called CBO.
The core components of CBO have the following parts:
- Cardinality estimation
- Transform rules
- Cost model
- Plan space search engine
The CBO process consists of the following steps in terms of logic:
- The search engine converts the input logical execution plans by using transform rules to construct search space for physical execution plans.
- Then, the cost model is used to estimate the cost of each execution plan in search space to select the physical execution plan that has the minimum cost.
- The cost estimation process cannot be separated from cardinality estimation. In cardinality estimation, information about each operator, such as the number of input rows and the selection rate, is estimated based on the statistics of each table and each column. Then, the estimated information is provided to the cost model of the operator to estimate the cost of the query plan.