Use IMCI to accelerate ETL - PolarDB - Alibaba Cloud Documentation Center

PolarDB supports the Extract Transform Load (ETL) feature. After the ETL from IMCI feature is enabled, you can use In-memory Column Indexes (IMCIs) on the primary node. The SELECT requests in the statements that you execute on the primary node are sent to read-only column store nodes. Then, you can use the IMCI feature to accelerate queries. After data is read, PolarDB transmits data back to the primary node over the internal network and writes data to destination tables.

How it works

从只读列存节点读取数据并写入表

Prerequisites

Your PolarDB cluster uses one of the following versions:

A cluster of PolarDB for MySQL 8.0.1 whose revision version is 8.0.1.1.29 or later.
A cluster of PolarDB for MySQL 8.0.2 whose revision version is 8.0.2.2.12 or later.

For information about how to check the version, see the "Query the engine version" section of the Engine versions topic.

Limits

This feature applies only to the following SQL statements:

CREATE TABLE table_name [AS] SELECT ...
INSERT ... SELECT ...

Usage

You can configure parameters in the database to specify whether to read data from read-only column store nodes and whether to compress files when reading data from read-only column store nodes. The following table describes the parameters.

Parameter

Description

etl_from_imci

Specifies whether to read data from read-only column store nodes. Valid values:

ON: reads data from read-only column store nodes.
OFF (default): does not read data from read-only column store nodes.
FORCED: reads data from read-only column store nodes even if the SQL query is part of an ongoing transaction.

etl_from_imci_compress

Specifies whether to compress files when reading data from read-only column store nodes. Valid values:

ON
OFF (default)

When you modify parameter values, you can set each parameter at one of the following levels: global, session, and statement. In the following examples, the etl_from_imci parameter is used.

Set the parameter at the global level.
```
SET GLOBAL etl_from_imci = ON;
```
After you set the parameter at the global level, all CREATE TABLE new_tbl [AS] SELECT .. and INSERT ... SELECT ... requests within the current cluster read data from read-only column store nodes.
Set the parameter at the session level.
```
SET etl_from_imci = ON;
```
After you set the parameter at the session level, only the CREATE TABLE new_tbl [AS] SELECT .. and INSERT ... SELECT ... requests in the current session read data from read-only column store nodes.
Set the parameter at the statement level.
You can use hints to specify whether to read data from read-only column store nodes for an SQL statement. Example:
```
CREATE TABLE t2 SELECT /*+ SET_VAR(etl_from_imci=ON) */ * from t1 where 'A' = 'a';
```

When the etl_from_imci parameter is set to ON, data is read from read-only column store nodes. After data is read from read-only column store nodes and sent back to the primary node, you can execute the SHOW processlist statement to view the current status of the process list, which is expected to be ETL FROM IMCI.

Usage notes

If complex query conditions are specified and the execution time of SQL statements is long but only a small amount of data is returned, you can enable the ETL from IMCI feature to significantly improve performance.

However, the ETL from IMCI feature may deteriorate performance in the following scenarios:

For simple queries, reading data from remote read-only column store nodes incurs additional overheads for network transmission and result set parsing. Therefore, performance may be degraded.
When the result set of a query contains a large amount of data, reading data from remote read-only column store nodes, sending data back to the primary node, and writing data to the table on the primary node may cause performance degradation.

Performance comparison

The following tests use 10 GB of data that is generated based on TPC Benchmark H (TPC-H) to test the performance of complex queries.

The query result contains one row of data.

select
    sum(l_extendedprice * l_discount) as revenue
from
    lineitem
where
    l_shipdate >= date '1994-01-01'
    and l_shipdate < date '1994-01-01' + interval '1' year
    and l_discount between .06 - 0.01 and .06 + 0.01
    and l_quantity < 24

Compare the consumed time values (in seconds) if the query result contains one row of data.

Query data on the read-only column store node	Execute the `CREATE TABLE new_tbl [AS] SELECT ...` statement on the primary node when the ETL from IMCI feature is not enabled	Execute the `CREATE TABLE new_tbl [AS] SELECT ...` statement on the primary node when the ETL from IMCI feature is enabled	Execute the `INSERT ... SELECT ...` statement on the primary node when the ETL from IMCI feature is not enabled	Execute the `INSERT ... SELECT ...` statement on the primary node when the ETL from IMCI feature is enabled
0.05	>60	0.17	>60	0.08

The query result contains four rows of data.

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
    l_shipdate <= date '1998-12-01' - interval '90' day
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus

Compare the consumed time values (in seconds) if the query result contains four rows of data.

Query data on the read-only column store node	Execute the `CREATE TABLE new_tbl [AS] SELECT ...` statement on the primary node when the ETL from IMCI feature is not enabled	Execute the `CREATE TABLE new_tbl [AS] SELECT ...` statement on the primary node when the ETL from IMCI feature is enabled	Execute the `INSERT ... SELECT ...` statement on the primary node when the ETL from IMCI feature is not enabled	Execute the `INSERT ... SELECT ...` statement on the primary node when the ETL from IMCI feature is enabled
0.58	>60	0.64	>60	0.58

The query result contains 27,840 rows of data.

select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt
from
    partsupp,
    part
where
    p_partkey = ps_partkey
    and p_brand <> 'Brand#45'
    and p_type not like 'MEDIUM POLISHED%'
    and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    and ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
group by
    p_brand,
    p_type,
    p_size
order by
    supplier_cnt desc,
    p_brand,
    p_type,
    p_size

Compare the consumed time values (in seconds) if the query result contains 27,840 rows of data.

Query data on the read-only column store node	Execute the `CREATE TABLE new_tbl [AS] SELECT ...` statement on the primary node when the ETL from IMCI feature is not enabled	Execute the `CREATE TABLE new_tbl [AS] SELECT ...` statement on the primary node when the ETL from IMCI feature is enabled	Execute the `INSERT ... SELECT ...` statement on the primary node when the ETL from IMCI feature is not enabled	Execute the `INSERT ... SELECT ...` statement on the primary node when the ETL from IMCI feature is enabled
0.55	>60	0.92	>60	0.82