To improve the ability of In-Memory Column Indexes (IMCIs) to process complex queries, the IMCI optimizer combines transformation rules with the statistics of each column in a table to generate efficient execution plans based on costs. This topic describes the IMCI query optimization feature.
How it works
SQL is a declarative query language and does not specifically describe the query plans of an SQL statement. Several query plans may be used to obtain the correct result of an SQL statement. Example:
SELECT * FROM t0, t1, t2, t3 WHERE t0.a = t1.a AND t1.a = t2.a AND t2.a = t3.a AND t3.b = t1.b;
For the preceding SQL statement, you can obtain correct query results by using the following two query plans. Plan A and Plan B are equivalent query plans. As a search framework, the query optimizer searches for equivalent query plans corresponding to the current SQL statement by transforming one query plan to another equivalent query plan. For example,
t1 INNER JOIN t2
and t2 INNER JOIN t1
are equivalent query plans. The query optimizer can turn t1 INNER JOIN t2
into t2 INNER JOIN t1
by transforming equivalent query plans. This transformation by the optimizer is called a query transformation rule.
The query optimizer works in the following way:
Enters the initial query plan obtained by parsing the SQL statement.
Generates equivalent query plans from the initial state query plan by using the query transformation rule.
Selects the query plan with the lowest execution costs from the equivalent query plans by combining statistics and cost models and submits it to the execution layer as the final execution plan.
The query optimization feature relies on statistics to perform cardinality estimation and cost calculation for determining the best query plan. In IMCIs, the statistics of a table include the following items:
Histograms, which show the value distribution of different columns and are mainly used to estimate the value range on a single table and the selection rate of equivalent predicates.
The number of special values in the column, which is mainly used to estimate the number of groups in
Group By
, and can also be used to assist in estimating the selection rate of equivalent predicates.Other constraints, such as whether the column has a unique index, or whether the column has foreign key constraints with other columns.
The query optimizer calculates the costs of each node operator in the query plan based on the following items:
The total number of rows processed by the operator in the query plan. You can use the statistics to estimate the total number of rows.
The complexity of the algorithm used by each operator in the query plan.
The total number of rows processed by the operator in the query plan is a parameter of the algorithm complexity function. The execution costs of the entire query plan is the sum of the operator costs of all nodes. For the two query plans in the preceding figure, if hash join is used as the join execution algorithm, the cost formula is:
Costjoin=Cardinner+Cardouter
The costs of the two execution plans are:
CostA==10000+1+1000+100+10000+10=21111
CostB==10000+1+100+10+1000+10=11121
The execution costs of Plan B are lower. Therefore, the query optimizer selects Plan B as the final execution plan.
Prerequisites
Your PolarDB cluster uses one of the following versions:
A cluster of PolarDB for MySQL 8.0.1 whose revision version is 8.0.1.1.31 or later.
A cluster of PolarDB for MySQL 8.0.2 whose revision version is 8.0.2.2.12 or later.
Limits
The following conditions may bring large errors in the cardinality estimation, and therefore cause the optimizer to select a suboptimal query plan. You can use hints to instruct the optimizer to generate an ideal query plan.
For queries containing predicates, comparison operators are used for querying different columns of a table. Example:
t1.c1>t1.c2
.For queries containing predicates, the operators used in the query statement cannot use statistics for estimation. Examples:
t1.c1 MOD 2=1
andt1.c2 LIKE '%ABC%
.For queries containing predicates, the query statement has expressions and the optimization feature cannot be used for calculation. Example:
t1.c1+t1.c3>100
.The columns involved in the operators in the query statement do not have statistics available for estimating the predicate selection rate. Example:
SELECT a, SUM(b) FROM t1 HAVING SUM(b) > 10
.Multiple predicates are combined by using the
AND
operator. Example:t1.c1>10 AND t1.c3<5
.The query statement contains many nested layers.
The query statement joins many tables. You can modify the value of the
loose_imci_max_enum_join_pairs
parameter to specify the number of joins that the IMCI optimizer can use.
Parameters
You can configure the parameters described in the following table in the PolarDB console to enable and use the IMCI query optimization feature. For more information about how to configure parameters in the PolarDB console, see Specify cluster and node parameters.
Parameter | Description |
loose_imci_optimizer_switch | Specifies whether to enable the IMCI query optimization feature. Valid values:
|
loose_imci_auto_update_statistic | Specifies whether the IMCI query optimizer re-collects statistics when the statistics are not the latest. Valid values:
|
loose_imci_max_enum_join_pairs | The number of equivalent execution plans that the IMCI query optimizer can retrieve when the IMCI and join reorder features are enabled. Valid values: 0 to 4294967295. Default value: 2000. |
Usage
To use the IMCI query optimization feature, you must first collect statistics based on the information collection policy that you select. After the information is collected, enable the IMCI query optimization feature and execute query statements.
Collect statistics.
You can collect statistics based on the following information collection policies:
Periodically execute the
ANALYZE TABLE
statement on databases that contain the tables where the IMCI query optimization feature is used to build latest statistics.(Recommended) For a table where an IMCI is newly created, execute the
ANALYZE TABLE
statement on the read-only node to build initial statistics. Then, set theloose_imci_auto_update_statistic
parameter to ASYNC to automatically update statistics.
Enable the IMCI query optimization feature.
You can set the
loose_imci_optimizer_switch
parameter to ON in the PolarDB console to enable the IMCI query optimization feature.Execute query statements.
Query result comparison
The following example uses TPCH-Q8. The query statement involves multiple tables and contains aggregate functions.
SELECT
o_year,
SUM(
CASE
WHEN nation = 'BRAZIL' THEN volume
ELSE 0
END
) / SUM(volume) AS mkt_share
FROM
(
SELECT
EXTRACT(
year
FROM
o_orderdate
) AS o_year,
l_extendedprice * (1 - l_discount) AS volume,
n2.n_name AS nation
FROM
lineitem,
orders,
part,
supplier,
customer,
nation n1,
nation n2,
region
WHERE
p_partkey = l_partkey
AND s_suppkey = l_suppkey
AND l_orderkey = o_orderkey
AND o_custkey = c_custkey
AND c_nationkey = n1.n_nationkey
AND n1.n_regionkey = r_regionkey
AND r_name = 'AMERICA'
AND s_nationkey = n2.n_nationkey
AND o_orderdate BETWEEN DATE '1995-01-01'
AND DATE '1996-12-31'
AND p_type = 'ECONOMY ANODIZED STEEL'
) AS all_nations
GROUP By
o_year
ORDER BY
o_year;
The following figure shows the query plan when the IMCI query optimization feature is disabled.
The query plan involves multiple joins and generates a large result set. This increases the data rows processed by operators and the processing costs and extends the latency. The 32-core cluster and TPCH SF100 data are used in the test. The query duration is 7,017 ms.
The following figure shows the query plan when the IMCI query optimization feature is enabled.
The query optimizer re-sorts joins to reduce the output data of almost all join operators to millions of rows. This decreases the processing costs for subsequent operators. The 32-core cluster and TPCH SF100 data are used in the test. The query duration is 1,900 ms. The query duration is 73% shorter than that when the IMCI query optimization feature is disabled.