What is an In-Memory Column Index (IMCI)? - PolarDB - Alibaba Cloud Documentation Center

This topic describes the In-Memory Column Index (IMCI) feature of PolarDB for PostgreSQL. This feature is designed to accelerate complex queries.

Introduction

The In-Memory Column Index (IMCI) feature of PolarDB for PostgreSQL enables you to handle both high-concurrency online transactional processing (OLTP) and complex data analytics (OLAP) in a single system. You no longer need to maintain a separate, expensive, and architecturally complex external system for analytical queries. This simplifies your data architecture, reduces operations and maintenance (O&M) costs, and enables real-time analysis of massive business data.

Compared to the row store engine, IMCI enhances query performance in two key areas: columnstore indexes at the storage layer and vectorized operators at the execution engine layer, also known as the vectorized execution engine. These improvements overcome the limitations of the row store engine in handling complex queries. For example, in a TPC-H performance test that uses a 100 GB dataset on a 32-core, 256 GB cluster, the column store engine of PolarDB for PostgreSQL delivers query performance more than 60 times higher than the row store engine. For more information, see IMCI performance test results.

How it works

Architecture optimization

The IMCI feature of PolarDB for PostgreSQL optimizes both the execution engine and storage layers to improve the processing of complex queries.

Execution engine layer
- Unlike the row store engine, the vectorized execution engine uses CPU SIMD instructions to process data in batches. A single CPU instruction can process multiple data entries in parallel. This reduces function invocation overhead and cache misses.
- The engine implements fully vectorized operators. For example, it vectorizes operators such as Scan, Group By, Order By, Hash Join, Filter, Count, and Sum. This allows the engine to accept batch data inputs and process them using SIMD instructions.
Storage layer
- It uses a column store format that is more suitable for vectorized operators than the heap row store structure.
- The column store format is implemented as an index, which is the In-Memory Column Index (IMCI). IMCIs are similar to B-tree and GiST indexes but differ in storage structure and applicable scenarios. The vectorized execution engine can use IMCIs directly, whereas the row store engine uses B-tree and GiST indexes. A table can contain both IMCIs and other types of indexes to handle different queries. The PolarDB for PostgreSQL optimizer selects the appropriate index based on the cost of the query plan.

As shown in the following figure, you can create a B-tree index on column c2 of table t for point queries (SELECT * FROM t WHERE c2=10) and an IMCI on columns c4 and c5 for statistical queries (SELECT c4, SUM(c5) FROM t GROUP BY c4). The query optimizer determines the most efficient index to use based on the cost of the SQL statement.

Real-time row-column data synchronization

Data in an IMCI is stored in a columnar format in the database. Data is first written to a row-oriented table and then synchronized with the IMCI through the indexing mechanism. This process is called row-column data synchronization. The IMCI feature of PolarDB for PostgreSQL provides an efficient, real-time, and automated row-column synchronization mechanism. You do not need to build extra pipelines for this process or manually refresh the column store data.

The row-column synchronization mechanism parses Write-Ahead Logging (WAL) logs to retrieve modified data, which is then written to the IMCI asynchronously. The impact of this process on the performance and load of the row store is minimal, typically less than 3%. Because the IMCI of PolarDB for PostgreSQL can coexist with the row store engine on the same node, the WAL log parsing process is highly optimized. Despite the asynchronous nature of the row-to-column conversion process, it achieves near real-time synchronization with a latency of a few milliseconds to a few seconds, depending on the write load. For more information about optimizing row-column synchronization performance, see Improve the real-time performance of IMCIs.

Product form

The IMCI feature of PolarDB for PostgreSQL is deployed on all nodes. Therefore, all compute nodes in a cluster have both a row store engine and an IMCI. In this mode, two decisions are made when a SQL statement is executed:

The selection of a compute node to execute the statement.
The selection of an execution engine within that node.

Select a compute node

When a PolarDB cluster has multiple nodes, the system must decide which compute node executes SQL statements related to IMCIs.

All data modification statements, such as DDL and DML statements, are executed on the RW node. The RW node then selects the appropriate execution engine based on specific conditions.
IMCI creation and real-time data synchronization are performed on the RW node.
For all read-only SQL statements, you can configure the database proxy to determine which node is used for execution.

Select an execution engine

On a compute node, an execution engine is selected to execute the statement.

For DDL statements, such as CREATE TABLE and ALTER TABLE, the row store engine is generally used. However, for a CREATE TABLE AS SELECT statement, the system decides whether to use an IMCI based on the complexity of the SELECT subquery.
For DML statements, such as INSERT, UPDATE, and DELETE, the row store engine is used.
For DQL statements, such as SELECT statements, the system decides whether to use an IMCI based on the query cost and parameters. Generally, the higher the query cost of a SELECT statement, the more likely an IMCI is used. If the IMCI fails to execute the SELECT statement, the system falls back to the row store engine for execution.

Key features

High performance
Compared to the row store engine, IMCIs improve SQL query performance by an order of magnitude. For complex queries, they provide a performance boost of over 100 times.
Cost-effective
- You can create IMCIs only for the columns involved in queries. You do not need to convert the entire table to a column store.
- IMCIs use less storage space. For the same column, an IMCI uses only 10% to 50% of the storage space used by the row store, depending on the data type.
Easy to use
Usage is identical to native PostgreSQL.
- IMCIs inherit the index management method of native PostgreSQL and support syntax such as CREATE INDEX and DROP INDEX. No extra syntax is required. For more information, see Enable and use IMCIs.
- IMCIs are highly compatible with PostgreSQL data types and syntax. You do not need to modify existing SQL statements to use IMCIs.
- You can use parameter settings for fine-grained control over which SQL statements use IMCIs. This includes global usage, session-level usage, and specifying IMCI usage for specific SQL statements using hints. For more information, see Enable and use IMCIs.
Real-time IMCI maintenance
- Consistency between row store data and IMCIs is automatically maintained. You do not need to build extra data conversion pipelines or perform manual refresh and synchronization operations.
- New data inserted into the row store is synchronized to the IMCI in milliseconds or seconds. You can adjust the synchronization performance based on different business loads. For more information, see Enable and use IMCIs.
Consistency
Two consistency levels are provided for column store and row store data to meet different business needs.
- Eventual consistency (default): Suitable for queries with high write loads but less stringent requirements for data real-time performance.
- Strong consistency: Returns query results only after the column store data is fully consistent with the row store data. For more information, see Enable and use IMCIs.
Compatibility with various usage methods
- Supports the Prepared Statement syntax.
- Supports acceleration of SELECT statements in transaction blocks.
  Note
  The SELECT statement must be a 'read-before-write' SQL statement within the transaction block.
- Supports partitioned tables and partitioned tables managed by pg_pathman. Partition pruning is also supported. For more information, see Use IMCIs for partitioned tables.
- Supports acceleration of spatio-temporal multi-modal queries.

Scenarios

The IMCI feature of PolarDB for PostgreSQL provides a one-stop hybrid transactional and analytical processing (HTAP) experience and can be used in various business scenarios:

HTAP scenarios: For example, you need to perform many CRUD operations on transaction data daily and also generate real-time statistical reports for the past hour. The IMCI of PolarDB for PostgreSQL not only handles these two types of loads efficiently but also simplifies the system architecture. You do not need to maintain an extra system for real-time statistical OLAP queries.
Various types of slow SQL statements: These include statements in high-concurrency transaction scenarios, such as:
- Full-table aggregations, such as COUNT, SUM, and AVERAGE operations.
- GROUP BY and ORDER BY operations on columns.
- JOIN operations on multiple tables.
- Queries that filter on many columns in an unpredictable order. In this case, a composite index is inflexible and can easily become ineffective. Using an IMCI is a better choice.
Multi-modal and spatio-temporal queries: Extract KEY and VALUE from nested JSON or generate spatio-temporal heatmaps based on geographic grids.
ETL data acceleration scenarios: You can use SQL in PolarDB to implement ETL functions, leveraging the powerful and flexible computing capabilities that PolarDB provides based on IMCIs.

Billing

IMCIs can be executed directly on row store nodes or on added IMCI read-only nodes.

Using IMCIs on row store nodes: Free of charge.
Adding IMCI read-only nodes: This option incurs node fees. IMCI read-only nodes are billed as standard compute nodes. In addition, IMCIs occupy storage space, which incurs corresponding storage space fees.

Note

Using the IMCI feature directly on row store nodes may affect your business. After you add IMCI read-only nodes, you can isolate transactional processing (TP) and analytical processing (AP) workloads on different nodes. This ensures that they do not affect each other. For more information, see business impacts.