All Products
Search
Document Center

Rule-based query rewrite

Last Updated: Sep 29, 2021

The rule-based query rewrite includes subquery-related rewrite, outer join elimination, condition simplification, and non-Select-Project-Join (SPJ) rewrite.

Subquery related rewrite

The optimizer usually executes subqueries in a nested way, which means that each time a parent query generates a row of data, the optimizer executes a subquery. This method requires multiple subqueries to be executed, resulting in low execution efficiency. To optimize the execution, the nested operation is rewritten to a join, which greatly improves the execution efficiency because of the following benefits:

  • It avoids the repeated execution of the same subquery.

  • The optimizer selects a better join order and better join algorithms based on the statistics.

  • After the join and filter conditions of the subquery are rewritten as the conditions of the parent query, more optimization options are available to the optimizer, such as conditional pushdown.

Frequently applied methods to rewrite a subquery include view merging, subquery expansion, and rewriting ANY/ALL by using MAX/MIN.

View merge

View merge refers to the process of merging a subquery that represents a view into the query that contains the view. The merge provides the optimizer with more options of join order types, access paths, and rewrites, making the selection of a better execution plan much easier.

OceanBase Database supports the merge of SPJ views. In the following example, Q1 is rewritten to Q2:

obclient>CREATE TABLE t1 (c1 INT, c2 INT);
Query OK, 0 rows affected (0.00 sec)

obclient>CREATE TABLE t2 (c1 INT PRIMARY KEY, c2 INT);
Query OK, 0 rows affected (0.00 sec)

obclient>CREATE TABLE t3 (c1 INT PRIMARY KEY, c2 INT);
Query OK, 0 rows affected (0.00 sec)

Q1: 
obclient>SELECT t1.c1, v.c1 
                FROM t1, (SELECT t2.c1, t3.c2 
                FROM t2, t3 
                WHERE t2.c1 = t3.c1) v 
                WHERE t1.c2 = v.c2;
<==>
Q2: 
obclient>SELECT t1.c1, t2.c1 
                FROM t1, t2, t3 
                WHERE t2.c1 = t3.c1 AND t1.c2 = t3.c2;

Before the rewrite, the types of join orders available for query Q1 are:

  • t1, v(t2,t3)

  • t1, v(t3,t2)

  • v(t2,t3), t1

  • v(t3,t2), t1

After the rewrite, the following types of join orders are available:

  • t1, t2, t3

  • t1, t3, t2

  • t2, t1, t3

  • t2, t3, t1

  • t3, t1, t2

  • t3, t2, t1

You can see that the view merge provides more join order options. This means higher flexibility in selecting access paths and rewrite methods for a complex query, allowing the optimizer to generate a better plan.

Subquery unnesting

Subquery unnesting promotes the subquery in the WHERE condition of the parent query and unnests it as a join condition in parallel with the parent query. The rewrite deconstructs the subquery and changes the outer parent query to a multi-table join.

The benefit is that the optimizer takes into account tables in the subquery when it selects access paths, join methods, and sorting methods, thus obtaining a better execution plan. Relevant subquery unnesting expressions include NOT IN, IN, NOT EXIST, EXIST, ANY, and ALL.

You can unnest a subquery by the following methods:

  • Rewrite conditions so that the generated join statement is enabled to return the same rows as the original statement.

  • Unnest it as a semi-join by using Semi Join or Anti Join algorithm.

    In the following example, t2.c2 is changed to a Semi Join statement as it is not unique. The execution plan is then rewritten into:

    obclient>CREATE TABLE t1 (c1 INT, c2 INT);
    Query OK, 0 rows affected (0.17 sec)
    
    obclient>CREATE TABLE t2 (c1 INT PRIMARY KEY, c2 INT);
    Query OK, 0 rows affected (0.01 sec)
    
    obclient>EXPLAIN SELECT * FROM t1 WHERE t1.c1 IN (SELECT t2.c2 FROM t2)\G;
    *************************** 1. row ***************************
    Query Plan: 
    =======================================
    |ID|OPERATOR      |NAME|EST. ROWS|COST|
    ---------------------------------------
    |0 |HASH SEMI JOIN|    |495      |3931|
    |1 | TABLE SCAN   |t1  |1000     |499 |
    |2 | TABLE SCAN   |t2  |1000     |433 |
    =======================================
    
    Outputs & filters:
    -------------------------------------
      0 - output([t1.c1], [t1.c2]), filter(nil),
          equal_conds([t1.c1 = t2.c2]), other_conds(nil)
      1 - output([t1.c1], [t1.c2]), filter(nil),
          access([t1.c1], [t1.c2]), partitions(p0)
      2 - output([t2.c2]), filter(nil),
          access([t2.c2]), partitions(p0)

    After you change the operator in front of a query to NOT IN, you can rewrite the query to apply Anti Join. See the following example:

    obclient>EXPLAIN SELECT * FROM t1 WHERE t1.c1 NOT IN (SELECT t2.c2 FROM t2)\G;
    *************************** 1. row ***************************
    Query Plan:
    ================================================
    |ID|OPERATOR             |NAME|EST. ROWS|COST  |
    ------------------------------------------------
    |0 |NESTED-LOOP ANTI JOIN|    |0        |520245|
    |1 | TABLE SCAN          |t1  |1000     |499   |
    |2 | TABLE SCAN          |t2  |22       |517   |
    ================================================
    
    Outputs & filters:
    -------------------------------------
      0 - output([t1.c1], [t1.c2]), filter(nil),
          conds(nil), nl_params_([t1.c1], [(T_OP_IS, t1.c1, NULL, 0)])
      1 - output([t1.c1], [t1.c2], [(T_OP_IS, t1.c1, NULL, 0)]), filter(nil),
          access([t1.c1], [t1.c2]), partitions(p0)
      2 - output([t2.c2]), filter([(T_OP_OR, ? = t2.c2, ?, (T_OP_IS, t2.c2, NULL, 0))]),
          access([t2.c2]), partitions(p0)
  • Unnest a subquery to an inner join

    For query Q1 in the preceding example, the output of the subquery is unique if t2.c2 is modified into t2.c1, which is the primary key. So, you can rewrite it into an inner join, as shown in the following example:

    Q1: 
    obclient>SELECT * FROM t1 WHERE t1.c1 IN  (SELECT t2.c1 FROM t2)\G;
    <==>
    Q2: 
    obclient>SELECT t1.* FROM t1, t2 WHERE t1.c1 = t2.c1;

    The execution plan of query Q1 is then rewritten into:

    obclient>EXPLAIN SELECT * FROM t1 WHERE t1.c1 IN (SELECT t2.c1 FROM t2)\G;
    *************************** 1. row ***************************
    Query Plan:
     ====================================
    |ID|OPERATOR   |NAME|EST. ROWS|COST|
    ------------------------------------
    |0 |HASH JOIN  |    |1980     |3725|
    |1 | TABLE SCAN|t2  |1000     |411 |
    |2 | TABLE SCAN|t1  |1000     |499 |
    ====================================
    
    Outputs & filters:
    -------------------------------------
      0 - output([t1.c1], [t1.c2]), filter(nil),
          equal_conds([t1.c1 = t2.c1]), other_conds(nil)
      1 - output([t2.c1]), filter(nil),
          access([t2.c1]), partitions(p0)
      2 - output([t1.c1], [t1.c2]), filter(nil),
          access([t1.c1], [t1.c2]), partitions(p0)

    You can rewrite subquery expressions, such as NOT IN, IN, NOT EXIST, EXIST, ANY, and ALL, in the same way.

Rewrite ANY or ALL by using MAX or MIN

For a subquery with an ANY or ALL expressions, if the subquery does not contain the GROUP BY clause, aggregate function, or HAVING clause, the expressions shown in the following example can be equivalently converted by the aggregate function MIN or MAX, where col_item is a separate, non-NULL column:

val > ALL(SELECT col_item ...)  <==> val > ALL(SELECT MAX(col_item) ...);
val >= ALL(SELECT col_item ...) <==> val >= ALL(SELECT MAX(col_item) ...);
val < ALL(SELECT col_item ...)  <==> val < ALL(SELECT MIN(col_item) ...);
val <= ALL(SELECT col_item ...) <==> val <= ALL(SELECT MIN(col_item) ...);
val > ANY(SELECT col_item ...)  <==> val > ANY(SELECT MIN(col_item) ...);
val >= ANY(SELECT col_item ...) <==> val >= ANY(SELECT MIN(col_item) ...);
val < ANY(SELECT col_item ...)  <==> val < ANY(SELECT MAX(col_item) ...);
val <= ANY(SELECT col_item ...) <==> val <= ANY(SELECT MAX(col_item) ...);

After the subquery is converted to contain MAX or MIN, you can further rewrite MAX or MIN to reduce the number of scanning operations on the inner table.Example:

obclient>SELECT c1 FROM t1 WHERE c1 > ANY(SELECT c1 FROM t2);
<==>
obclient>SELECT c1 FROM t1 WHERE c1 > ANY(SELECT MIN(c1) FROM t2);

After the MAX/MIN rewrite, the primary key t2.c1 can be used to directly push LIMIT 1 down to TABLE SCAN to generate the MIN value. Execution plan:

obclient>EXPLAIN SELECT c1 FROM t1 WHERE c1 > ANY(SELECT c1 FROM t2)\G;
*************************** 1. row ***************************
Query Plan:
 ===================================================
|ID|OPERATOR        |NAME          |EST. ROWS|COST|
---------------------------------------------------
|0 |SUBPLAN FILTER  |              |1        |73  |
|1 | TABLE SCAN     |t1            |1        |37  |
|2 | SCALAR GROUP BY|              |1        |37  |
|3 |  SUBPLAN SCAN  |subquery_table|1        |37  |
|4 |   TABLE SCAN   |t2            |1        |36  |
===================================================

Outputs & filters:
-------------------------------------
  0 - output([t1.c1]), filter([t1.c1 > ANY(subquery(1))]),
      exec_params_(nil), onetime_exprs_(nil), init_plan_idxs_([1])
  1 - output([t1.c1]), filter(nil),
      access([t1.c1]), partitions(p0)
  2 - output([T_FUN_MIN(subquery_table.c1)]), filter(nil),
      group(nil), agg_func([T_FUN_MIN(subquery_table.c1)])
  3 - output([subquery_table.c1]), filter(nil),
      access([subquery_table.c1])
  4 - output([t2.c1]), filter(nil),
      access([t2.c1]), partitions(p0),
      limit(1), offset(nil)

Outer join elimination

Outer Join operations include Left Outer Join, Right Outer Join, and Full Outer Join. The order of the Outer Join cannot be changed during the operation, the optimizer has limited options of join order. Outer join elimination helps convert the Outer Join into an Inner Join, providing the optimizer more options of join paths.

However, the outer join elimination requires a "Reject-NULL" condition, which is contained in WHERE conditions, and exports FALSE when the inner table generates a NULL value.

Example:

obclient>SELECT t1.c1, t2.c2 FROM t1 LEFT JOIN t2 ON t1.c2 = t2.c2;

This is an outer join, whose output row t2.c2 may be NULL. If you add the condition t2.c2 > 5 and filter by this condition, the non-NULL output of t2.c1, the outer join can be converted to an inner join.

obclient>SELECT t1.c1, t2.c2 FROM t1 LEFT JOIN t2 ON t1.c2 = t2.c2 WHERE t2.c2 > 5;
<==>
obclient>SELECT t1.c1, t2.c2 FROM t1 LEFT INNER JOIN t2 ON t1.c2 = t2.c2 
            WHERE t2.c2 > 5;

Condition simplification

HAVING condition elimination

The HAVING condition can be merged into the WHERE conditions, with the HAVING condition eliminated. When no AGGREGATE or GROUP BY operator exists in the query, the HAVING condition can be managed collectively with other conditions in WHERE, for the purpose of further optimization.

obclient>SELECT * FROM t1, t2 WHERE t1.c1 = t2.c1 HAVING t1.c2 > 1;
<==>
obclient>SELECT * FROM t1, t2 WHERE t1.c1 = t2.c1 AND t1.c2 > 1;

The rewritten plan is shown in the following example, where the condition t1.c2 > 1 is pushed down to the TABLE SCAN layer.

obclient>EXPLAIN SELECT * FROM t1, t2 WHERE t1.c1 = t2.c1 HAVING t1.c2 > 1\G;
*************************** 1. row ***************************
Query Plan: 
=========================================
|ID|OPERATOR        |NAME|EST. ROWS|COST|
-----------------------------------------
|0 |NESTED-LOOP JOIN|    |1        |59  |
|1 | TABLE SCAN     |t1  |1        |37  |
|2 | TABLE GET      |t2  |1        |36  |
=========================================

Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2], [t2.c1], [t2.c2]), filter(nil),
      conds(nil), nl_params_([t1.c1])
  1 - output([t1.c1], [t1.c2]), filter([t1.c2 > 1]),
      access([t1.c1], [t1.c2]), partitions(p0)
  2 - output([t2.c1], [t2.c2]), filter(nil),
      access([t2.c1], [t2.c2]), partitions(p0)

Equivalent relation deduction

In the process of equivalent relation deduction, new conditional expressions are deduced from the transitivity of comparison operators to reduce the number of rows to be processed or to select a more efficient index.

OceanBase Database supports the deduction of equivalent joins. For example, if a = b AND a > 1 AND b > 1 can be deduced from a = b AND a > 1, and column b is indexed, and b > 1 is rarely selected in the index, then it is possible to significantly improve the performance of accessing the table of column b.

In the following example, the condition t1.c1 = t2.c2 AND t1.c1 > 2is equivalently deduced tot1.c1 = t2.c2 AND t1.c1 > 2 AND t2.c2 > 2. You can see that t2.c2 is pushed down to TABLE SCAN, and the corresponding index for t2.c2 is applied.

obclient>CREATE TABLE t1(c1 INT PRIMARY KEY, c2 INT);
Query OK, 0 rows affected (0.15 sec)

obclient>CREATE TABLE t2(c1 INT PRIMARY KEY, c2 INT, c3 INT, KEY IDX_c2(c2));
Query OK, 0 rows affected (0.10 sec)
/*Run this command in MySQL mode*/

obclient>EXPLAIN EXTENDED_NOADDR SELECT t1.c1, t2.c2 FROM t1, t2 
              WHERE t1.c1 = t2.c2 AND t1.c1 > 2\G;
*************************** 1. row ***************************
Query Plan: 
==========================================
|ID|OPERATOR   |NAME      |EST. ROWS|COST|
------------------------------------------
|0 |MERGE JOIN |          |5        |78  |
|1 | TABLE SCAN|t2(IDX_c2)|5        |37  |
|2 | TABLE SCAN|t1        |3        |37  |
==========================================

Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t2.c2]), filter(nil),
      equal_conds([t1.c1 = t2.c2]), other_conds(nil)
  1 - output([t2.c2]), filter(nil),
      access([t2.c2]), partitions(p0),
      is_index_back=false,
      range_key([t2.c2], [t2.c1]), range(2,MAX ; MAX,MAX),
      range_cond([t2.c2 > 2])
  2 - output([t1.c1]), filter(nil),
      access([t1.c1]), partitions(p0),
      is_index_back=false,
      range_key([t1.c1]), range(2 ; MAX),
      range_cond([t1.c1 > 2])

Identically true/false elimination

The following identically true and false conditions can be eliminated:

  • false and expr = identically false

  • true or expr = identically true

In the following example, for the condition WHERE 0 > 1 AND c1 = 3, AND is identically false because of the condition 0 > 1. So, the SQL query is not executed and directly returns, making the execution process faster.

obclient>EXPLAIN EXTENDED_NOADDR SELECT * FROM t1 WHERE 0 > 1 AND c1 = 3\G;
*************************** 1. row ***************************
Query Plan: 
===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1  |0        |38  |
===================================

Outputs & filters:
-------------------------------------
  0 - output([t1.c1], [t1.c2]), filter([0], [t1.c1 = 3]), startup_filter([0]),
      access([t1.c1], [t1.c2]), partitions(p0),
      is_index_back=false, filter_before_indexback[false,false],
      range_key([t1.__pk_increment], [t1.__pk_cluster_id], [t1.__pk_partition_id]),
      range(MAX,MAX,MAX ; MIN,MIN,MIN)always false

Non-SPJ rewrite

Sorting redundancy elimination

The sorting redundancy elimination refers to the deletion of redundant order items to reduce resource consumption for sorting. Sorting redundancy elimination is required in the following three circumstances:

  • Duplicate columns in the ORDER BY expression list. You can sort the columns after deduplication.

    obclient>SELECT * FROM t1 WHERE c2 = 5 ORDER BY c1, c1, c2, c3 ;
    <==>
    obclient>SELECT * FROM t1 WHERE c2 = 5 ORDER BY c1, c2, c3;
  • A single valued column in both ORDER BY and WHERE. This column can be deleted before sorting.

    obclient>SELECT * FROM t1 WHERE c2 = 5 ORDER BY c1, c2, c3;
    <==>
    obclient>SELECT * FROM t1 WHERE c2 = 5 ORDER BY c1, c3;
  • If a query at the current layer contains ORDER BY but not LIMIT, and it resides in the set operation of the parent query, then ORDER BY can be eliminated. The reason is that the UNION operation on two ordered sets simply leads to an unordered result set. However, if ORDER BY contains LIMIT, which semantically implies the selection of the greatest/least N items, ORDER BY must be retained to avoid semantic errors.

    obclient>(SELECT c1,c2 FROM t1 ORDER BY c1) UNION (SELECT c3,c4 FROM t2 ORDER BY c3);
    <==>
    obclient>(SELECT c1,c2 FROM t1) UNION (SELECT c3,c4 FROM t2);

LIMIT pushdown

LIMIT pushdown rewrite means to push the LIMIT down to the subquery. OceanBase Database V2.2.76 supports pushing down LIMIT to a view (Example 1) and a subquery that contains the UNION operator (Example 2), without changing the semantics. See the following two examples:

Example 1:

obclient>SELECT * FROM (SELECT * FROM t1 ORDER BY c1) a LIMIT 1; 
<==>
obclient>SELECT * FROM (SELECT * FROM t1 ORDER BY c1 LIMIT 1) a LIMIT 1;

Example 2:

obclient>(SELECT c1,c2 FROM t1) UNION ALL (SELECT c3,c4 FROM t2) LIMIT 5;
<==>
obclient>(SELECT c1,c2 FROM t1 LIMIT 5) UNION ALL (SELECT c3,c4 FROM t2 limit 5) LIMIT 5;

DISTINCT elimination

  • DISTINCT can be eliminated, with LIMIT 1 added, if the SELECT statement contains only constants.

    obclient>SELECT DISTINCT 1,2 FROM t1 ;
    <==> 
    obclient>SELECT DISTINCT 1,2 FROM t1 LIMIT 1;
    
    obclient>CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 INT);
    Query OK, 0 rows affected (0.17 sec)
    
    obclient>EXPLAIN EXTENDED_NOADDR SELECT DISTINCT 1,2 FROM t1\G;
    *************************** 1. row ***************************
    Query Plan: 
    ===================================
    |ID|OPERATOR  |NAME|EST. ROWS|COST|
    -----------------------------------
    |0 |TABLE SCAN|t1  |1        |36  |
    ===================================
    
    Outputs & filters:
    -------------------------------------
      0 - output([1], [2]), filter(nil),
          access([t1.c1]), partitions(p0),
          limit(1), offset(nil),
          is_index_back=false,
          range_key([t1.c1]), range(MIN ; MAX)always true
  • The DISTINCT can also be eliminated if the SELECT statement contains columns with a unique constraint. In the following example, (c1, c2) is the primary key, which ensures the uniqueness of c1, c2, and c3, so that DISTINCT can be eliminated.

    obclient>CREATE TABLE t2(c1 INT, c2 INT, c3 INT, PRIMARY KEY(c1, c2));
    Query OK, 0 rows affected (0.17 sec)
    
    obclient>SELECT DISTINCT c1, c2, c3 FROM t2;
    <==>
    obclient>SELECT c1, c2 c3 FROM t2;
    
    obclient>EXPLAIN SELECT DISTINCT c1, c2, c3 FROM t2\G;
    *************************** 1. row ***************************
    Query Plan: 
    ===================================
    |ID|OPERATOR  |NAME|EST. ROWS|COST|
    -----------------------------------
    |0 |TABLE SCAN|t2  |1000     |455 |
    ===================================
    
    Outputs & filters:
    -------------------------------------
      0 - output([t2.c1], [t2.c2], [t2.c3]), filter(nil),
          access([t2.c1], [t2.c2], [t2.c3]), partitions(p0)

MIN/MAX rewrite

  • When the parameter in the MIN/MAX function is the index prefix column and does not include GROUP BY, the SCALAR aggregate can be rewritten into an index scan for scanning only 1 row, as shown in the following example:

    obclient>CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 INT, c3 INT, KEY IDX_c2_c3(c2,c3));
    Query OK, 0 rows affected (0.17 sec)
    
    obclient>SELECT MIN(c2) FROM t1;
    <==>
    obclient>SELECT MIN(c2) FROM (SELECT c2 FROM t2 ORDER BY c2 LIMIT 1) AS t;
    
    obclient>EXPLAIN SELECT MIN(c2) FROM t1\G;
    *************************** 1. row ***************************
    Query Plan: 
    ==================================================
    |ID|OPERATOR       |NAME          |EST. ROWS|COST|
    --------------------------------------------------
    |0 |SCALAR GROUP BY|              |1        |37  |
    |1 | SUBPLAN SCAN  |subquery_table|1        |37  |
    |2 |  TABLE SCAN   |t1(idx_c2_c3) |1        |36  |
    ==================================================
    
    Outputs & filters:
    -------------------------------------
      0 - output([T_FUN_MIN(subquery_table.c2)]), filter(nil),
          group(nil), agg_func([T_FUN_MIN(subquery_table.c2)])
      1 - output([subquery_table.c2]), filter(nil),
          access([subquery_table.c2])
      2 - output([t1.c2]), filter([(T_OP_IS_NOT, t1.c2, NULL, 0)]),
          access([t1.c2]), partitions(p0),
          limit(1), offset(nil)
  • If parameters of SELECT MIN/MAX are constants and include GROUP BY, you can rewrite MIN/MAX to a constant to reduce the resource consumption of the MIN/MAX operation.

    obclient>SELECT MAX(1) FROM t1 GROUP BY c1;
    <==>
    obclient>SELECT 1 FROM t1 GROUP BY c1;
    
    obclient>EXPLAIN EXTENDED_NOADDR SELECT MAX(1) FROM t1 GROUP BY c1\G;
    *************************** 1. row ***************************
    Query Plan: 
    ===================================
    |ID|OPERATOR  |NAME|EST. ROWS|COST|
    -----------------------------------
    |0 |TABLE SCAN|t1  |1000     |411 |
    ===================================
    
    Outputs & filters:
    -------------------------------------
      0 - output([1]), filter(nil),
          access([t1.c1]), partitions(p0),
          is_index_back=false,
          range_key([t1.c1]), range(MIN ; MAX)always true
  • If parameters of SELECT MIN/MAX are constants and do not include GROUP BY, you can rewrite the query by referencing the following example and scan 1 row only by using the index.

    obclient>SELECT MAX(1) FROM t1;
    <==> 
    obclient>SELECT MAX(t.a) FROM (SELECT 1 AS a FROM t1 LIMIT 1) t;
    
    obclient>EXPLAIN EXTENDED_NOADDR SELECT MAX(1) FROM t1\G;
    *************************** 1. row ***************************
    Query Plan: 
    ==================================================
    |ID|OPERATOR       |NAME          |EST. ROWS|COST|
    --------------------------------------------------
    |0 |SCALAR GROUP BY|              |1        |37  |
    |1 | SUBPLAN SCAN  |subquery_table|1        |37  |
    |2 |  TABLE SCAN   |t1            |1        |36  |
    ==================================================
    
    Outputs & filters:
    -------------------------------------
      0 - output([T_FUN_MAX(subquery_table.subquery_col_alias)]), filter(nil),
          group(nil), agg_func([T_FUN_MAX(subquery_table.subquery_col_alias)])
      1 - output([subquery_table.subquery_col_alias]), filter(nil),
          access([subquery_table.subquery_col_alias])
      2 - output([1]), filter(nil),
          access([t1.c1]), partitions(p0),
          limit(1), offset(nil),
          is_index_back=false,
          range_key([t1.c1]), range(MIN ; MAX)always true