IMCI Overview: Columnar Index Architecture for HTAP - PolarDB

This topic describes the In-Memory Column Index (IMCI) feature that helps you better handle complex queries.

Introduction

PolarDB for PostgreSQL IMCI feature lets you handle both high-concurrency online transactional processing (OLTP) and complex data analytics (OLAP) in a single system. You no longer need to maintain a separate, expensive, and architecturally complex external system for analytical queries. This simplifies your data architecture, reduces operations and maintenance (O&M) costs, and enables real-time analysis of massive business data.

Compared with the row-oriented storage engine, the IMCI feature enhances query performance in two key areas: IMCIs at the storage layer and vectorized operators at the execution engine layer, also known as the vectorized execution engine. These improvements effectively overcome the limitations of the row-oriented storage engine in handling complex queries. In a TPC-H performance test that uses a 100 GB dataset and a 32-core, 256 GB cluster, the column-oriented engine of PolarDB for PostgreSQL delivers query performance more than 60 times higher than the row-oriented storage engine. For more information, see IMCI performance test results.

How it works

Architecture optimization

The IMCI feature of PolarDB for PostgreSQL optimizes both the execution engine and storage layers to better process complex queries.

Execution engine layer
- Unlike the row-oriented storage engine, the vectorized execution engine uses CPU SIMD instructions to process data in batches. A single CPU instruction can process multiple rows of data in parallel. This helps reduce the time for function calls and prevents cache miss issues.
- The vectorized engine achieves full vectorization of query operators. For example, it vectorizes operators such as Scan, Group By, Order By, Hash Join, Filter, Count, and Sum. This allows the engine to accept batch data inputs and process them using SIMD instructions.
Storage layer
- It uses a column-oriented storage format, which is more suitable for vectorized operators than the heap row-oriented storage structure.
- The column-oriented storage format is implemented using indexes, which are the IMCIs. IMCIs are similar to B-tree and GiST indexes but have different storage structures and applicable scenarios. IMCIs can be used directly. B-tree and GiST indexes are used by the row-oriented storage engine. A table can contain both IMCIs and other types of indexes to handle various queries. The PolarDB for PostgreSQL optimizer selects the most suitable index based on the query cost.

As shown in the following figure, you can create a B-tree index on column c2 of table t for point queries (SELECT * FROM t WHERE c2=10) and an IMCI on columns c4 and c5 for statistical queries (SELECT c4, SUM(c5) FROM t GROUP BY c4). The query optimizer determines the most efficient index to use based on the cost of the SQL statement.

Real-time row-column synchronization

Data in an IMCI is stored in a columnar format in the database. Data is first written to a row-oriented table and then synchronized to the IMCI using the indexing mechanism. This process is known as row-column synchronization. The IMCI feature of PolarDB for PostgreSQL offers an efficient, real-time, and automated mechanism for row-column synchronization, which eliminates the need for additional pipelines or manual updates of column-oriented data.

The row-column synchronization mechanism parses Write-Ahead Logging (WAL) logs to retrieve modified data, which is then written to the IMCI asynchronously. This process has a minimal impact on the performance and load of the row-oriented storage engine, which is less than 3%. Because the IMCI feature of PolarDB for PostgreSQL can coexist with the row-oriented storage engine on the same node, the WAL log parsing process is optimized. Despite the asynchronous nature of the row-column conversion process, it can still achieve real-time synchronization with a latency of a few milliseconds to a few seconds, depending on the write load. For more information about how to optimize row-column synchronization, see Improve the real-time performance of IMCIs.

Product form

The IMCI feature applies to all nodes deployed in a PolarDB for PostgreSQL cluster. Therefore, all compute nodes in the cluster have both a row-oriented storage engine and an IMCI. In this mode, the system makes the following decisions when it executes a SQL statement:

Select a compute node for execution.
Select an execution engine on the node.

Select a compute node

When the system executes a SQL statement that is related to an IMCI in a PolarDB cluster that contains multiple nodes, it must select a compute node for execution.

All data modification statements, such as Data Definition Language (DDL) and Data Manipulation Language (DML) statements, are executed on the RW node. The RW node then selects the appropriate execution engine based on specific conditions.
The RW node creates the IMCI and performs real-time synchronization for it.
For all read-only SQL statements, you can configure the database proxy to determine which node to use for execution.

Select an execution engine

The compute node selects an execution engine to execute the SQL statement.

For DDL statements such as CREATE TABLE and ALTER TABLE, the row-oriented storage engine is used. However, for the CREATE TABLE AS SELECT statement, the system decides whether to use an IMCI based on the complexity of the SELECT subquery.
For DML statements such as INSERT, UPDATE, and DELETE, the system uses the row-oriented storage engine.
For Data Query Language (DQL) statements such as SELECT, the system decides whether to use an IMCI based on the query cost and specific parameters. Typically, a higher query cost indicates a higher probability of using an IMCI. If the IMCI fails to execute the SELECT statement, the system uses the row-oriented storage engine to execute the statement again.

Key features and advantages

High performance
Compared to the row-oriented storage engine, IMCIs significantly increase SQL query performance. They can accelerate the execution of complex queries more than 100 times faster than the row-oriented storage engine.
Cost-effective
- To optimize queries, you can create IMCIs for related columns instead of the entire table.
- IMCIs occupy less storage space than row-oriented indexes. Depending on the data type of a specific column, an IMCI occupies only 10% to 50% of the storage space that is occupied by a row-oriented index.
Ease of use
The vectorized engine is fully compatible with native PostgreSQL and can be used in the same manner.
- IMCIs can be managed like native PostgreSQL indexes and support statements such as CREATE INDEX and DROP INDEX. No additional statements are required. For more information, see Enable and use IMCIs.
- IMCIs are highly compatible with PostgreSQL data types and syntax. You can use IMCIs for acceleration without the need to modify existing SQL statements.
- You can configure parameters to specify which SQL statements can use IMCIs in a fine-grained manner, such as all SQL statements, SQL statements in a session, or specific SQL statements with hints. For more information, see Enable and use IMCIs.
Real-time maintenance of IMCIs
- Data consistency between row-oriented data and IMCIs is automatically maintained. This eliminates the need to set up conversion or manual synchronization between row-oriented and column-oriented data.
- Data inserted into the row-oriented table is synchronized to the IMCI with a latency of a few milliseconds to a few seconds. You can adjust the data synchronization performance based on business loads. For more information, see Enable and use IMCIs.
Consistency
The following consistency levels for IMCIs and row-oriented data are provided to meet various business needs.
- Eventual consistency (default): This level is suitable for queries that involve heavy write loads but have low real-time performance requirements.
- Strong consistency: This level returns query results after the IMCI data is consistent with the row-oriented data. For more information, see Enable and use IMCIs.
Compatibility with various usage methods
- Supports the Prepared Statement syntax.
- Supports acceleration of SELECT statements within transaction blocks.
  Note
  The SELECT statement must be a read-before-write SQL statement within the transaction block.
- Supports partitioned tables and partitioned tables managed by pg_pathman. Partition pruning is also supported. For more information, see Use IMCIs for partitioned tables.
- Supports acceleration of spatio-temporal multimodal queries.

Common use cases

IMCI provides a one-stop Hybrid Transactional/Analytical Processing (HTAP) experience suitable for various business scenarios:

HTAP scenarios: Simultaneously handle high-volume transactions while running real-time analytical reports on the same data.
Accelerating slow queries: Ideal for speeding up queries that are traditionally slow on a row-store engine, such as:
- Full-table aggregations (COUNT, SUM, AVG).
- Complex GROUP BY and ORDER BY operations.
- Multi-table JOIN operations.
- Queries with dynamic filter conditions where a composite index would be inflexible.
Multimodal and geospatial queries: Efficiently query nested JSON data or perform statistical analysis on geospatial data.
ETL acceleration: Leverage the powerful computational capabilities of IMCI to perform complex data transformations and ETL processes directly within the database.

Billing

IMCIs can be executed directly on row-oriented nodes or on added IMCI read-only nodes.

Using IMCIs on existing nodes: Free of charge.
Adding dedicated read-only nodes for IMCI: Standard compute node fees apply for the additional compute resources. In addition, IMCI will consume additional storage space, which incurs standard storage fees.

Note

Workload isolation

For complete workload isolation, you can add dedicated read-only nodes for IMCI. This ensures that analytical processing (AP) queries do not impact the performance of your transactional processing (TP) workloads. See Enable and use a columnstore index for details.