X-Engine Columnar Tables for PB-Scale Analytics - PolarDB

PolarDB for MySQL provides X-Engine column-oriented tables to reduce storage costs and accelerate analytical queries on large datasets. Column-oriented storage, efficient compression, and parallel computing reduce storage costs to 10% of the original and improve analytical query performance by an order of magnitude—within the MySQL ecosystem, without a separate analytical system.

Architecture

X-Engine column-oriented tables use a compute-storage decoupled architecture. Compute and storage resources scale independently, delivering PB-scale data processing capabilities. The storage layer uses a distributed file system and Object Storage Service (OSS). The compute layer delivers high-performance analytics through the query optimizer, execution operator, and storage engine.

PolarDB overall architecture

PolarDB uses a compute-storage decoupled architecture built on a distributed storage layer. Using the ParallelRaft protocol, it ensures multi-replica consistency and transparently supports OSS. Compute nodes consist of one read/write node (RW) and one or more read-only nodes (RO). PolarDB supports both InnoDB and X-Engine engines. The database proxy acts as a bridge between applications and compute nodes, providing read/write splitting and load balancing.

Column-oriented table architecture

The architecture relies on three core components:

Query optimizer: A built-in cost-based optimizer (CBO) for hybrid row-column storage that automatically selects the optimal execution plan—row store or column store—based on query cost.
Execution operator: Uses column-oriented vectorization and parallel execution to accelerate analytical queries for single-table and multi-table joins through batch processing.
Storage engine: Supports real-time transactional updates. The primary table uses column-oriented storage and delivers fast update capability through the Non-Clustered Index (NCI) component. It marks deleted data using a Delete-mask mechanism, enabling efficient parallel queries without impacting real-time writes. A row-oriented secondary index filters out invalid data to further improve query efficiency.

Key capabilities

High cost-effectiveness: Column-oriented storage, efficient encoding, and high-ratio compression—combined with low-cost storage media such as OSS—reduce storage and processing costs for massive data by up to 90%.
High-performance real-time analytics: Data becomes immediately available for analytical queries after ingestion. Multi-core parallelism, vectorization, and Massively Parallel Processing (MPP) deliver query performance comparable to dedicated analytical databases.
100% MySQL compatibility: Provides a data type system and protocol consistent with MySQL and supports flexible type conversion. Existing applications and tools connect without modification.
Independent elastic scaling: The compute-storage decoupled architecture allows compute nodes and storage capacity to scale independently on demand, handling business peaks and data growth seamlessly.
Wide table support: A single table supports up to 10,000 columns, satisfying high-concurrency write and large wide table storage needs.
Simplified operations: Delivers high-compression storage and high-performance analytics within a single database engine, with full DDL and DML support—no additional technology stack required.

Column-oriented tables are optimized for analytical workloads. Update and delete performance is relatively lower than row-oriented tables, because record location requires the NCI component rather than direct row access. For workloads with high-frequency inserts, deletes, and updates, row-oriented tables (InnoDB) remain the better fit. See Row-oriented vs. column-oriented tables for a full comparison.

Row-oriented vs. column-oriented tables

The following table compares X-Engine row-oriented tables and X-Engine column-oriented tables to help you choose the right table type.

Dimension	Row-oriented table (X-Engine)	Column-oriented table (X-Engine)
Data organization	Stores data contiguously by row, with all columns of a row stored together.	Stores data contiguously by column, with all values of a column stored together.
Compression ratio	Medium. Can compress data to a maximum of 30% of its original size compared to InnoDB.	High. Using column-oriented storage and specialized encoding such as dictionary encoding, can compress data to 10% of its original size compared to InnoDB.
Query performance	High point query performance. Suitable for fast reads of single or a few rows based on primary keys or indexes.	Strong analytical performance. Reads only columns involved in the query, reducing I/O. Combined with vectorized parallel computing, aggregation and analysis performance improves by an order of magnitude compared to row-oriented tables.
Update and delete performance	High. Directly locates and modifies rows.	Relatively lower. Supports real-time updates via the NCI component.
Workload type	Online Transaction Processing (OLTP): high-concurrency insert, delete, update, and query operations.	Online Analytical Processing (OLAP): data archiving, report generation, ad hoc queries, and large-scale aggregation and analysis.

Use cases

Low-cost archiving of massive historical data

Historical data—such as orders, logs, and transaction records—expands rapidly as businesses grow, consuming large amounts of expensive storage. Traditional migration solutions reduce costs but take data offline, preventing direct queries and increasing operational complexity.

With column-oriented tables, migrate cold data or entire historical data tables from InnoDB to X-Engine within the same PolarDB cluster. Archived data stays online and queryable with standard SQL—no complex migration back required. A compression ratio of up to 10:1, combined with low-cost OSS storage, reduces storage costs by 90%.

Building a data warehouse

Dedicated data warehouses typically require expensive hardware, complex extract, transform, and load (ETL) pipelines, and high operations overhead—especially when introducing new technology stacks such as ClickHouse.

PolarStore's massive data storage capabilities let you aggregate data from multiple sources and store it uniformly using X-Engine column-oriented tables. The result: low-cost PB-scale storage and real-time aggregation without T+1 ETL latency, all within the MySQL ecosystem.

Federated query analysis

Enterprise data is often stored in separate locations—some in online databases such as PolarDB, some in open formats such as Parquet and ORC on OSS. Joint analysis across these sources typically requires ETL to import OSS data into the database first.

PolarDB's external table feature eliminates this step. Create external tables for files on OSS directly within PolarDB, then run JOIN operations between local tables (row store or column store) and OSS external tables using standard SQL—no data movement required.

Performance benchmarks

The following data was collected in a specific staging environment to help evaluate the benefits of column-oriented tables.

The TPC-H implementation in this topic is based on TPC-H benchmarking and cannot be compared with officially published TPC-H benchmarking results. The tests do not fully comply with all TPC-H requirements.

Data loading performance

TPC-H dataset: Storage space is 5.5 times smaller than InnoDB. Loading speed is five times faster than InnoDB.

Airline dataset: Storage space is 20 times smaller than InnoDB. Loading speed is 30 times faster than InnoDB.

TPC-H query performance

TPC-H 100 GB dataset: Query performance matches that of ClickHouse, with most queries faster.

Multi-node parallel analysis: On a TPC-H 1 TB dataset, query time improved from 1,400 seconds on a single node to 167 seconds on six nodes.