All Products
Search
Document Center

PolarDB:Specify join reorder for the IMCI optimizer

Last Updated:Dec 08, 2023

To improve the ability of In-Memory Column Indexes (IMCIs) to process complex queries, the IMCI optimizer combines transformation rules with the statistics of each column in a table to generate efficient execution plans based on costs. This topic describes the IMCI query optimization feature.

How it works

SQL is a declarative query language and does not specifically describe the query plans of an SQL statement. Several query plans may be used to obtain the correct result of an SQL statement. Example:

SELECT * FROM t0, t1, t2, t3 WHERE t0.a = t1.a AND t1.a = t2.a AND t2.a = t3.a AND t3.b = t1.b;

For the preceding SQL statement, you can obtain correct query results by using the following two query plans. 执行方式Plan A and Plan B are equivalent query plans. As a search framework, the query optimizer searches for equivalent query plans corresponding to the current SQL statement by transforming one query plan to another equivalent query plan. For example, t1 INNER JOIN t2 and t2 INNER JOIN t1 are equivalent query plans. The query optimizer can turn t1 INNER JOIN t2 into t2 INNER JOIN t1 by transforming equivalent query plans. This transformation by the optimizer is called a query transformation rule.

The query optimizer works in the following way:

  1. Enters the initial query plan obtained by parsing the SQL statement.

  2. Generates equivalent query plans from the initial state query plan by using the query transformation rule.

  3. Selects the query plan with the lowest execution costs from the equivalent query plans by combining statistics and cost models and submits it to the execution layer as the final execution plan.

The query optimization feature relies on statistics to perform cardinality estimation and cost calculation for determining the best query plan. In IMCIs, the statistics of a table include the following items:

  • Histograms, which show the value distribution of different columns and are mainly used to estimate the value range on a single table and the selection rate of equivalent predicates.

  • The number of special values in the column, which is mainly used to estimate the number of groups in Group By, and can also be used to assist in estimating the selection rate of equivalent predicates.

  • Other constraints, such as whether the column has a unique index, or whether the column has foreign key constraints with other columns.

The query optimizer calculates the costs of each node operator in the query plan based on the following items:

  • The total number of rows processed by the operator in the query plan. You can use the statistics to estimate the total number of rows.

  • The complexity of the algorithm used by each operator in the query plan.

The total number of rows processed by the operator in the query plan is a parameter of the algorithm complexity function. The execution costs of the entire query plan is the sum of the operator costs of all nodes. For the two query plans in the preceding figure, if hash join is used as the join execution algorithm, the cost formula is:

Costjoin=Cardinner+Cardouter

The costs of the two execution plans are:

CostA==10000+1+1000+100+10000+10=21111

CostB==10000+1+100+10+1000+10=11121

The execution costs of Plan B are lower. Therefore, the query optimizer selects Plan B as the final execution plan.

Prerequisites

Your PolarDB cluster uses one of the following versions:

  • A cluster of PolarDB for MySQL 8.0.1 whose revision version is 8.0.1.1.31 or later.

  • A cluster of PolarDB for MySQL 8.0.2 whose revision version is 8.0.2.2.12 or later.

Limits

The following conditions may bring large errors in the cardinality estimation, and therefore cause the optimizer to select a suboptimal query plan. You can use hints to instruct the optimizer to generate an ideal query plan.

  • For queries containing predicates, comparison operators are used for querying different columns of a table. Example: t1.c1>t1.c2.

  • For queries containing predicates, the operators used in the query statement cannot use statistics for estimation. Examples: t1.c1 MOD 2=1 and t1.c2 LIKE '%ABC%.

  • For queries containing predicates, the query statement has expressions and the optimization feature cannot be used for calculation. Example: t1.c1+t1.c3>100.

  • The columns involved in the operators in the query statement do not have statistics available for estimating the predicate selection rate. Example: SELECT a, SUM(b) FROM t1 HAVING SUM(b) > 10.

  • Multiple predicates are combined by using the AND operator. Example: t1.c1>10 AND t1.c3<5.

  • The query statement contains many nested layers.

  • The query statement joins many tables. You can modify the value of the loose_imci_max_enum_join_pairs parameter to specify the number of joins that the IMCI optimizer can use.

Parameters

You can configure the parameters described in the following table in the PolarDB console to enable and use the IMCI query optimization feature. For more information about how to configure parameters in the PolarDB console, see Specify cluster and node parameters.

Parameter

Description

loose_imci_optimizer_switch

Specifies whether to enable the IMCI query optimization feature. Valid values:

  • use_imci_card_est: specifies whether to enable the cardinality estimation and cost calculation features for IMCIs. Valid values:

    • OFF (default)

    • ON

  • use_imci_join_reorder: specifies whether to enable join reorder for IMCIs. Valid values:

    • OFF (default)

    • ON

    Note

    If the table involved in the query statement does not have statistics or the cardinality estimation and cost calculation features are not enabled for IMCIs, join reorder for IMCIs is not enabled even if this parameter is set to ON.

loose_imci_auto_update_statistic

Specifies whether the IMCI query optimizer re-collects statistics when the statistics are not the latest. Valid values:

  • ASYNC (default): The IMCI query optimizer performs asynchronous sampling and re-collects statistics when the statistics are not the latest.

  • SYNC: The IMCI query optimizer performs synchronous sampling and re-collects statistics when the statistics are not the latest.

  • OFF: The IMCI query optimizer does not re-collect statistics when the statistics are not the latest.

loose_imci_max_enum_join_pairs

The number of equivalent execution plans that the IMCI query optimizer can retrieve when the IMCI and join reorder features are enabled.

Valid values: 0 to 4294967295. Default value: 2000.

Usage

To use the IMCI query optimization feature, you must first collect statistics based on the information collection policy that you select. After the information is collected, enable the IMCI query optimization feature and execute query statements.

  1. Collect statistics.

    You can collect statistics based on the following information collection policies:

    • Periodically execute the ANALYZE TABLE statement on databases that contain the tables where the IMCI query optimization feature is used to build latest statistics.

    • (Recommended) For a table where an IMCI is newly created, execute the ANALYZE TABLE statement on the read-only node to build initial statistics. Then, set the loose_imci_auto_update_statistic parameter to ASYNC to automatically update statistics.

  2. Enable the IMCI query optimization feature.

    You can set the loose_imci_optimizer_switch parameter to ON in the PolarDB console to enable the IMCI query optimization feature.

  3. Execute query statements.

Query result comparison

The following example uses TPCH-Q8. The query statement involves multiple tables and contains aggregate functions.

SELECT
  o_year,
  SUM(
    CASE
      WHEN nation = 'BRAZIL' THEN volume
      ELSE 0
    END
  ) / SUM(volume) AS mkt_share
FROM
  (
    SELECT
      EXTRACT(
        year
        FROM
          o_orderdate
      ) AS o_year,
      l_extendedprice * (1 - l_discount) AS volume,
      n2.n_name AS nation
    FROM
      lineitem,
      orders,
      part,
      supplier,
      customer,
      nation n1,
      nation n2,
      region
    WHERE
      p_partkey = l_partkey
      AND s_suppkey = l_suppkey
      AND l_orderkey = o_orderkey
      AND o_custkey = c_custkey
      AND c_nationkey = n1.n_nationkey
      AND n1.n_regionkey = r_regionkey
      AND r_name = 'AMERICA'
      AND s_nationkey = n2.n_nationkey
      AND o_orderdate BETWEEN DATE '1995-01-01'
      AND DATE '1996-12-31'
      AND p_type = 'ECONOMY ANODIZED STEEL'
  ) AS all_nations
GROUP By
  o_year
ORDER BY
  o_year;
  • The following figure shows the query plan when the IMCI query optimization feature is disabled. 未开启列存索引优化功能The query plan involves multiple joins and generates a large result set. This increases the data rows processed by operators and the processing costs and extends the latency. The 32-core cluster and TPCH SF100 data are used in the test. The query duration is 7,017 ms.

  • The following figure shows the query plan when the IMCI query optimization feature is enabled. 开启列存索引优化功能The query optimizer re-sorts joins to reduce the output data of almost all join operators to millions of rows. This decreases the processing costs for subsequent operators. The 32-core cluster and TPCH SF100 data are used in the test. The query duration is 1,900 ms. The query duration is 73% shorter than that when the IMCI query optimization feature is disabled.