All Products
Search
Document Center

Rule-based path selection

Last Updated: Jun 18, 2021

This topic introduces the rules for selecting the access path in OceanBase Database.

Currently, OceanBase Database provides pre-rules, or forward rules, and Skyline pruning rules, or reverse rules. The pre-rule is based on strict matching and directly specifies the index to use for a query.

The Skyline pruning rule allows the comparison of two indexes. If one index dominates the other in terms of some defined dimensions, the inferior index is pruned. The remaining indexes are compared based on the cost to determine the best one.

The OceanBase Database optimizer applies pre-rules first to select indexes. If no matching index is found, it applies the Skyline pruning rule to exclude some inferior indexes. The remaining indexes are compared by using the cost model to select the one with the lowest cost.

As shown in the following example, OceanBase Database displays the information about path selection rule in the plan demonstration.

obclient>CREATE TABLE t1(a INT PRIMARY KEY, b INT, c INT, d INT, e INT, 
            UNIQUE INDEX k1(b), INDEX k2(b,c), INDEX k3(c,d));
Query OK, 0 rows affected (0.38 sec)

obclient> EXPLAIN EXTENDED SELECT * FROM t1 WHERE b = 1;
+-----------------------------------------------------------------+
| Query Plan                                                                              |
+-----------------------------------------------------------------+
| =====================================
|ID|OPERATOR  |NAME  |EST. ROWS|COST|
-------------------------------------
|0 |TABLE SCAN|t1(k1)|2        |94  |
=====================================

Outputs & filters:
-------------------------------------
  0 - output([t1.a(0x7f3178058bf0)], [t1.b(0x7f3178058860)], [t1.c(0x7f3178058f80)], [t1.d(0x7f3178059310)], [t1.e(0x7f31780596a0)]), filter(nil),
      access([t1.b(0x7f3178058860)], [t1.a(0x7f3178058bf0)], [t1.c(0x7f3178058f80)], [t1.d(0x7f3178059310)], [t1.e(0x7f31780596a0)]), partitions(p0),
      is_index_back=true,
      range_key([t1.b(0x7f3178058860)], [t1.shadow_pk_0(0x7f31780784b8)]), range(1,MIN ; 1,MAX),  
      range_cond([t1.b(0x7f3178058860) = 1(0x7f31780581d8)])
Optimization Info:
-------------------------------------
t1:optimization_method=rule_based, heuristic_rule=unique_index_with_indexback


obclient> EXPLAIN EXTENDED SELECT * FROM t1 WHERE c < 5 ORDER BY c;
+-----------------------------------------------------------------+
| Query Plan                                                                              |
+-----------------------------------------------------------------+
| ====================================
|ID|OPERATOR   |NAME|EST. ROWS|COST|
------------------------------------
|0 |SORT       |    |200      |1054|
|1 | TABLE SCAN|t1  |200      |666 |
====================================

Outputs & filters:
-------------------------------------
  0 - output([t1.a(0x7f3178059220)], [t1.b(0x7f31780595b0)], [t1.c(0x7f3178058e90)], [t1.d(0x7f3178059940)], [t1.e(0x7f3178059cd0)]), filter(nil), sort_keys([t1.c(0x7f3178058e90), ASC])
  1 - output([t1.c(0x7f3178058e90)], [t1.a(0x7f3178059220)], [t1.b(0x7f31780595b0)], [t1.d(0x7f3178059940)], [t1.e(0x7f3178059cd0)]), filter([t1.c(0x7f3178058e90) < 5(0x7f3178058808)]),
      access([t1.c(0x7f3178058e90)], [t1.a(0x7f3178059220)], [t1.b(0x7f31780595b0)], [t1.d(0x7f3178059940)], [t1.e(0x7f3178059cd0)]), partitions(p0),
      is_index_back=false, filter_before_indexback[false],
      range_key([t1.a(0x7f3178059220)]), range(MIN ; MAX)always true
t1:optimization_method=cost_based, avaiable_index_name[t1,k3], pruned_index_name[k1,k2]

As you can see, optimization_method indicates the detailed rule information. It has the following two formats:

  • optimization_method=rule_based indicates the hit of a pre-rule. It is followed by the rule name. unique_index_with_indexback is the third type of pre-rule, which requires that three rules must be met: "exact match with a unique index + table access by index primary key is required + the number of TABLE ACCESS BY INDEX PRIMARY KEY operations is less than the specified threshold".

  • optimization_method=cost_based indicates that the path selection is based on the cost. This is followed by access paths pruned by Skyline pruning as listed by pruned_index_name, and the remaining access paths as listed by avaiable_index_name.

Pre-rules

At present, pre-rules in OceanBase Database only apply to simple single table scanning. As pre-rules are based on strict matching, an index is applied as soon as it is hit. Therefore, it is used only in limited scenarios to prevent the risks of incorrect selection.

Depending on whether the query criteria cover all index keys and whether the use of the index requires table access by index primary key, OceanBase Database classify pre-rules into the following three priority-based types:

  • Exact match with a unique index + no table access by index primary key is required. The primary key is considered a unique index in this context. If the rule leaves several indexes available, the one with the lowest number of columns is selected.

  • Exact match with a normal index + no table access by index primary key is required. If the rule leaves several indexes available, the one with the lowest number of columns is selected.

  • Exact match with a unique index + table access by index primary key is required + the number of TABLE ACCESS BY INDEX PRIMARY KEY operations is less than the specified threshold. If the rule has several indexes available, the one with the smallest number of TABLE ACCESS BY INDEX PRIMARY KEY operations is selected.

Note that full index match means that equivalent condition is present on every index key, which corresponds to get or multi-get.

In the following example, the Q1 query hits index uk1 (exact match with a unique index + no table access by index primary key is required), and the Q2 query hits Index uk2 (exact match with a unique index + table access by index primary key is required + the number of rows returned by a TABLE ACCESS BY INDEX PRIMARY KEY operation is less than 4 rows).

obclient>CREATE TABLE  test(a INT PRIMARY KEY, b INT, c INT, d INT, e INT, 
              UNIQUE KEY UK1(b,c), UNIQUE KEY UK2(c,d) );
Query OK, 0 rows affected (0.38 sec)

Q1: 
obclient>SELECT b,c FROM test WHERE (b = 1 OR b = 2) AND (c = 1 OR c =2);

Q2: 
obclient>SELECT * FROM test WHERE (c = 1 OR c =2) OR (d = 1 OR d = 2);

Skyline pruning rules

The Skyline operator is a relatively new database operator introduced in 2001. It is not a typical SQL operator. Information scientists have since invested a great deal of effort to study the Skyline operator, from its syntax, semantics to execution.

Skyline literally refers to points in the sky that forms its boundaries. These points make up the set of optimal solutions in a search space. For example, you want to find the cheapest and nearest hotel. Imagine a two-dimensional coordinate system, with the horizontal axis being the price, and the vertical axis being the distance, and every point in the space represents a hotel.

As shown in the following figure, the optimal solution is definitely on the "skyline" of this space, regardless of your selection of price and distance. Assume that point A is not on the skyline, then you can certainly find a point B on the skyline that is better than A in both dimensions. In this scenario, point B is a hotel that is cheaper and closer to you. This is referred to as point B dominating point A. So, one important application of the Skyline rule is when you cannot establish the significance of multiple dimensions, or when these dimensions cannot be comprehensively quantified. If comprehensive quantification is possible, you can simply use an SQL function and ORDER BY to get the job done.

Skyline

Given an object set O, the purpose of the Skyline operation is to find the set of objects that are not dominated by other objects. I object A is not dominated by object B in any dimension, and A dominates B in at least one dimension, you can say A dominates B, In other words, the selection of dimensions and the definition of the domination relationship on each dimension are two major factors in the Skyline operation. Assume that the optimizer can choose from N index paths <idx_1, idx_2, and idx_3...idx_n>, and for query Q, index idx_x dominates index idx_y on the defined dimensions, the optimizer can prune index idx_y off to remove it from the cost comparison at the end.

Definition of dimensions

For Skyline pruning, the following three dimensions are defined for the indexes including the primary key:

  • The necessity for table access by index primary key

  • The existence of interesting order

  • The extraction of query range from index prefixes

The following example is provided for analysis:

 obclient> CREATE TABLE skyline(
        pk INT PRIMARY KEY, a INT, b INT, c INT,
        KEY idx_a_b(a, b),
        KEY idx_b_c(b, c),
        KEY idx_c_a(c, a));
Query OK, 0 rows affected (0.09 sec)
  • The necessity for table access by index primary key in the query:

    /*Primary table access is necessary if index idx_a_b is selected, because idx_a_b does not contain column c*/ 
    obclient>SELECT /*+INDEX(skyline idx_a_b)*/ * FROM skyline;
  • The existence of interesting order:

    /*The ORDER BY clause can be eliminated if index idx_b_c is selected*/ 
    obclient>SELECT pk, b FROM skyline ORDER BY b;
  • The extraction of query range from index prefixes:

    /*You can see that by using index idx_c_a, the range of rows can be quickly located without a full table scan */
    obclient>SELECT pk, b FROM skyline WHERE c > 100 AND c < 2000;

The domination relationship between indexes is defined by these three dimensions. If Index A is not worse than index B in any of the three dimensions and is better than index B in at least one dimension, then index B can be pruned off, because a plan based on index B will not be better than one based on Index A.

  • If table access by index primary key is not required for idx_A, but is required for idx_B, then idx_A dominates idx_B in this dimension.

  • If the interesting order extracted is a vector Va<a1, a2, a3 ...an> for idx_A, and Vb<b1, b2, b3...bm> for idx_ B, and n > m, and ai = bi (i = 1..m), then index idx_A dominates idx_ B in this dimension.

  • If the set of columns for extracting the query range is Sa<a1, a2, a3 ...an> for idx_A, and Sb <b1, b2, b3...bm> for idx_B, and Sa is a super set of Sb, then idx_A dominates idx_B in this dimension.

Table access by index primary key

At a glance, this is a simple dimension. It only checks if the column needed in the query is included in the index. However, there are some cases that are worthy of special mention. For example, if neither the primary table nor the indexed tables have an interesting order, and they cannot be used to extract the query range either, using the primary table is not necessarily the best solution.

obclient>CREATE TABLE t1(
         pk INT PRIMARY KEY, a INT, b INT, c INT, v1 VARCHAR(1000), 
         v2 VARCHAR(1000), v3 VARCHAR(1000), v4 VARCHAR(1000),INDEX idx_a_b(a, b));
Query OK, 0 rows affected (0.09 sec)

obclient>SELECT a, b,c FROM t1 WHERE b = 100;

Index

Index Back

Interesting Order

Query Range

primary

no

no

no

idx_a_b

yes

no

no

The indexed table is slim when compared with the wide primary table. Although the primary table dominates idx_a_b in terms of the three dimensions, the cost of an indexed scan plus a TABLE ACCESS BY INDEX PRIMARY KEY operation is not necessarily higher than a full scan of the primary table. Simply put, an indexed table may have only one macroblock to read, while a primary table may have ten. Because of this, it is necessary to allow for some margin for rules and reflect on specific filter conditions.

Interesting Order

If the optimizer can utilize the underlying order by leveraging the interesting order, it does not need to sort the rows scanned at the underlying layer. This will also eliminate the need for ORDER BY by performing MERGE GROUP BY instead, which optimizes the Pipeline as materialization is not required.

obclient>CREATE TABLE skyline(
        pk INT PRIMARY KEY, v1 INT, v2 INT, v3 INT, v4 INT, v5 INT,
        KEY idx_v1_v3_v5(v1, v3, v5),
        KEY idx_v3_v4(v3, v4));
Query OK, 0 rows affected (0.10 sec)

obclient>CREATE TABLE tmp (c1 INT PRIMARY KEY, c2 INT, c3 INT);
Query OK, 0 rows affected (0.06 sec)

obclient>(SELECT DISTINCT v1, v3 FROM skyline JOIN tmp WHERE skyline.v1 = tmp.c1 
          ORDER BY v1, v3) UNION (SELECT c1, c2 FROM tmp);
p167292

The execution plan shows that ORDER BY is eliminated, while MERGE DISTINCT is applied, and SORT is omitted while performing UNION. The order generated from the underlying TABLE SCAN can be used by the upper-layer operators. In other words, retaining the row order from idx_v1_v3_v5 allows subsequent operators to perform better operations without changing the order. The optimizer can generate a better execution plan only if it recognizes these orders.

Therefore, the Skyline determination made based on the interesting order must take into consideration the order in which the indexes can make the maximum use. For example, the most useful order in the example above is actually v1,v3 rather than just v1. The order (v1, v3) derived from MERGE JOIN can also be used by the MERGE DISTINCT operator, and then the UNISON DISTINCT operator.

Query Range

The extraction of query range allows for the location of a specific macroblock at the bottom layer by simply using the extracted range, thus reducing the I/O resource consumption at the storage layer.

For example, SELECT * FROM t1 WHERE pk < 100 AND pk > 0 can be applied to locate the specific macroblock by using the level-1 index. This makes the query faster, because a more accurate query range enables the database to scan fewer rows.

obclient> CREATE TABLE t1 (
       pk INT PRIMARY KEY, a INT, b INT,c INT,
        KEY idx_b_c(b, c),
        KEY idx_a_b(a, b));
Query OK, 0 rows affected (0.12 sec)

obclient>SELECT b FROM t1 WHERE a = 100 AND b > 2000;

For idx_b_c, query range can be extracted from its index prefix (b), and for idx_a_b, it is (a, b). So, idx_a_b dominates idx_b_c in this dimension.

A comprehensive example

obclient>CREATE TABLE skyline(
        pk INT PRIMARY KEY, v1 INT, v2 INT, v3 INT, v4 INT, v5 INT,
        KEY idx_v1_v3_v5(v1, v3, v5),
        KEY idx_v3_v4(v3, v4));
Query OK, 0 rows affected (0.10 sec)

obclient>CREATE TABLE tmp (c1 INT PRIMARY KEY, c2 INT, c3 INT);
Query OK, 0 rows affected (0.06 sec)

obclient>SELECT MAX(v5) FROM skyline WHERE v1 = 100 AND v3 > 200 GROUP BY v1;

Index

Index Back

Interesting order

Query range

primary

No need

No

No

idx_v1_v3_v5

No need

(v1)

(v1, v3)

idx_v3_v4

Need

No

(v3)

You can see that idx_v1_v3_v5 is not dominated by either the primary key index or idx_v3_v4 in any of the three dimensions. Under the current rule, the primary key index and idx_v3_v4 are pruned off. The validity of Skyline pruning depends on the correct definition of dimensions. Incorrect dimensions may cause improper pruning of indexes, leaving the selection of the best plan an impossible task.