All Products
Search
Document Center

PolarDB:Use IMCI to accelerate ETL

Last Updated:Mar 28, 2026

ETL from IMCI offloads the SELECT phase of CREATE TABLE ... SELECT and INSERT ... SELECT statements to read-only column store nodes. The primary node receives the results over the internal network and writes them to the destination table, letting you use In-memory Column Index (IMCI) acceleration without changing your ETL queries.

How it works

Data is read from read-only column store nodes and written to tables

Use cases

Use ETL from IMCI when complex query conditions are specified and the execution time of SQL statements is long but only a small amount of data is returned:

  • BI report pre-aggregation: Summarize large fact tables into daily or monthly aggregates using INSERT INTO ... SELECT. Downstream reports query the small result table directly, avoiding repeated full scans on the source data.

  • Intermediate table materialization: Pre-compute and store complex join or aggregation results so that subsequent queries run against a compact table instead of scanning gigabytes of raw data each time.

Avoid ETL from IMCI in the following scenarios:

  • Simple queries: Reading data from remote read-only column store nodes incurs additional overheads for network transmission and result set parsing, which may degrade performance.

  • Large result sets: When the result set of a query contains a large amount of data, reading data from remote read-only column store nodes, sending data back to the primary node, and writing data to the table on the primary node may cause performance degradation.

Prerequisites

Before you begin, ensure that your PolarDB cluster uses one of the following versions:

  • PolarDB for MySQL 8.0.1 with revision version 8.0.1.1.29 or later

  • PolarDB for MySQL 8.0.2 with revision version 8.0.2.2.12 or later

To check your version, see the "Query the engine version" section in Engine versions.

Limitations

ETL from IMCI applies only to the following SQL statements:

  • CREATE TABLE table_name [AS] SELECT ...

  • INSERT ... SELECT ...

Enable ETL from IMCI

Two parameters control the feature:

ParameterDescriptionValid valuesDefault
etl_from_imciWhether to read data from read-only column store nodesON, OFF, FORCEDOFF
etl_from_imci_compressWhether to compress files when reading data from read-only column store nodesON, OFFOFF

etl_from_imci values:

  • ON: Routes SELECT to read-only column store nodes.

  • OFF (default): Runs SELECT on the primary node.

  • FORCED: Routes SELECT to read-only column store nodes even if the statement is inside an active transaction.

Set the parameter at any of three levels. Higher levels affect a broader scope; statement-level hints override session and global settings for a single query.

Global level — applies to all CREATE TABLE ... SELECT and INSERT ... SELECT statements in the cluster:

SET GLOBAL etl_from_imci = ON;

Session level — applies only to statements in the current session:

SET etl_from_imci = ON;

Statement level — applies to a single statement using an optimizer hint:

CREATE TABLE t2 SELECT /*+ SET_VAR(etl_from_imci=ON) */ * FROM t1 WHERE 'A' = 'a';

Verify that the feature is active

Check routing with EXPLAIN

Before running a long ETL job, confirm that the SELECT phase will be routed to the column store. Run EXPLAIN on the SELECT portion of your statement and check the output for column store execution indicators.

Monitor with SHOW processlist

While an ETL statement is running, execute SHOW processlist on the primary node. When routing is active, the state column shows ETL FROM IMCI.

ETL FROM IMCI

End-to-end example

The following example materializes a daily fee summary from a raw transaction table. The SELECT aggregation runs on the read-only column store node; the primary node writes the result to the summary table.

Create the source and destination tables:

CREATE TABLE transaction_detail (
    ts          DATETIME,
    customer_id VARCHAR(20),
    fee         DECIMAL(20, 2)
);

CREATE TABLE daily_summary (
    rec_date    DATE,
    customer_id VARCHAR(20),
    daily_fee   DECIMAL(20, 2)
);

Enable ETL from IMCI for the session and insert aggregated data:

SET etl_from_imci = ON;

INSERT INTO daily_summary (rec_date, customer_id, daily_fee)
SELECT
    DATE(ts),
    customer_id,
    SUM(fee)
FROM transaction_detail
WHERE DATE(ts) > '2024-01-01'
GROUP BY DATE(ts), customer_id;

Query the compact summary table for monthly totals:

SELECT
    MONTH(rec_date),
    customer_id,
    SUM(daily_fee)
FROM daily_summary
GROUP BY MONTH(rec_date), customer_id;

The aggregation in INSERT INTO ... SELECT runs on the column store node and completes in under a second on 10 GB datasets (see Performance comparison). Subsequent monthly rollups run against the small summary table, not the full transaction history.

Performance comparison

The following results use 10 GB of data generated with TPC Benchmark H (TPC-H) to compare query times (in seconds) for complex aggregations.

Query returning 1 row:

SELECT
    SUM(l_extendedprice * l_discount) AS revenue
FROM
    lineitem
WHERE
    l_shipdate >= DATE '1994-01-01'
    AND l_shipdate < DATE '1994-01-01' + INTERVAL '1' YEAR
    AND l_discount BETWEEN .06 - 0.01 AND .06 + 0.01
    AND l_quantity < 24
ScenarioTime (s)
Query on read-only column store node directly0.05
CREATE TABLE ... SELECT on primary node, ETL from IMCI disabled>60
CREATE TABLE ... SELECT on primary node, ETL from IMCI enabled0.17
INSERT ... SELECT on primary node, ETL from IMCI disabled>60
INSERT ... SELECT on primary node, ETL from IMCI enabled0.08

Query returning 4 rows:

SELECT
    l_returnflag,
    l_linestatus,
    SUM(l_quantity)                                       AS sum_qty,
    SUM(l_extendedprice)                                  AS sum_base_price,
    SUM(l_extendedprice * (1 - l_discount))               AS sum_disc_price,
    SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    AVG(l_quantity)                                       AS avg_qty,
    AVG(l_extendedprice)                                  AS avg_price,
    AVG(l_discount)                                       AS avg_disc,
    COUNT(*)                                              AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= DATE '1998-12-01' - INTERVAL '90' DAY
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus
ScenarioTime (s)
Query on read-only column store node directly0.58
CREATE TABLE ... SELECT on primary node, ETL from IMCI disabled>60
CREATE TABLE ... SELECT on primary node, ETL from IMCI enabled0.64
INSERT ... SELECT on primary node, ETL from IMCI disabled>60
INSERT ... SELECT on primary node, ETL from IMCI enabled0.58

Query returning 27,840 rows:

SELECT
    p_brand,
    p_type,
    p_size,
    COUNT(DISTINCT ps_suppkey) AS supplier_cnt
FROM
    partsupp,
    part
WHERE
    p_partkey = ps_partkey
    AND p_brand <> 'Brand#45'
    AND p_type NOT LIKE 'MEDIUM POLISHED%'
    AND p_size IN (49, 14, 23, 45, 19, 3, 36, 9)
    AND ps_suppkey NOT IN (
        SELECT
            s_suppkey
        FROM
            supplier
        WHERE
            s_comment LIKE '%Customer%Complaints%'
    )
GROUP BY
    p_brand,
    p_type,
    p_size
ORDER BY
    supplier_cnt DESC,
    p_brand,
    p_type,
    p_size
ScenarioTime (s)
Query on read-only column store node directly0.55
CREATE TABLE ... SELECT on primary node, ETL from IMCI disabled>60
CREATE TABLE ... SELECT on primary node, ETL from IMCI enabled0.92
INSERT ... SELECT on primary node, ETL from IMCI disabled>60
INSERT ... SELECT on primary node, ETL from IMCI enabled0.82

As the result set grows (1 row to 27,840 rows), the overhead of transferring results from the column store node to the primary node increases. For queries that return very large result sets, evaluate whether the improvement justifies the transfer cost.