All Products
Search
Document Center

PolarDB:Introduction to X-Engine

Last Updated:Mar 28, 2026

X-Engine is an OLTP storage engine for MySQL, developed by the Database Products Business Unit of Alibaba Cloud. It uses a redesigned log-structured merge-tree (LSM tree) architecture to reduce storage costs and handle high-concurrency workloads at scale.

Background

Alibaba Group has been deploying MySQL databases at scale since 2010. As data volumes grew exponentially, two problems became increasingly difficult to solve by adding servers alone:

  • Processing highly concurrent transactions

  • Storing large amounts of data cost-effectively

The root cause was a mismatch between hardware and software. Modern systems run on multi-core CPUs, cache-only memory architecture (COMA), non-uniform memory access (NUMA), and heterogeneous accelerators such as GPUs and field-programmable gate arrays (FPGAs). But database software — B-tree indexing with fixed page sizes, ARIES-based (recovery and isolation exploiting semantics) transaction recovery, and independent lock managers — was designed for slow spinning disks. According to Michael Stonebreaker (Turing Award winner) in his paper "OLTP Through the Looking Glass, and What We Found There," conventional relational databases spend less than 10% of their time actually processing data. The remaining 90% is spent waiting for locks, managing buffers, and synchronizing logs.

X-Engine was built to close that gap.

The LSM tree trade-off

Every storage engine design must balance three forms of amplification:

AmplificationDefinitionLSM treeB-tree
Write amplificationExtra I/O beyond the logical writeLowerHigher
Read amplificationExtra I/O to satisfy a read (multiple levels to merge)HigherLower
Space amplificationExtra storage beyond the logical data sizeRequires compaction to reclaimModerate

LSM trees trade read amplification for write amplification. X-Engine addresses all three dimensions through tiered storage, a transaction pipeline, fine-grained compaction with data reuse, and multi-layer caching — accepting the read overhead of LSM while minimizing it as much as possible.

How X-Engine works

Data organization and tiered storage

X-Engine divides data across storage tiers based on access frequency:

  • Hot data (frequently accessed): stored in memory using lock-free skip lists and append-only data structures

  • Cold data (infrequently accessed): flushed to persistent storage tiers — non-volatile memory (NVM), SSDs, and HDDs — in order of cost and speed

Each persistence level is divided into extents of a fixed size. An extent is a small, complete sorted string table (SSTable) that stores data with a continuous key range. Each key range is further divided into read-only, variable-length data blocks (equivalent to pages in conventional databases, but immutable).

All extents at each level are indexed by a metadata tree with a B-tree-like structure. The root node is a metadata snapshot — a point-in-time view of all data up to a given log sequence number (LSN). All structures except the active memory table are read-only, which is the foundation for X-Engine's snapshot-level isolation.

X-Engine architecture
The architecture and optimization techniques are described in the paper "X-Engine: An Optimized Storage Engine for Large-scale E-Commerce Transaction Processing," presented at the 2019 SIGMOD Conference. This was the first time a company from the Chinese mainland published OLTP database engine research at an international academic conference.

Transaction processing

Transaction processing in X-Engine has two phases:

  1. Read and write phase: X-Engine checks for write-write and read-write conflicts. If no conflicts are detected, all modified data is written to a transaction buffer.

  2. Commit phase: Data is written to the write-ahead log (WAL), written to memory tables, committed, and the result is returned.

The pipeline: from group commit to parallel stages

Most storage engines use group commit to reduce I/O cost: multiple transactions are batched and flushed to disk together. This allows you to combine I/O operations. However, the transactions that are to be committed at a time still need to wait for a long period of time — for example, when logs are being written to the disk, nothing else is done except to wait for the data to be flushed to the disk.

X-Engine replaces this with a four-stage pipeline:

  1. Copy logs to the log buffer

  2. Flush logs to disk

  3. Write data to memory tables

  4. Commit data

Transaction threads enter the pipeline and freely pick up any available stage. Threads at different stages run concurrently, and no thread waits for another. This eliminates the stall points of group commit and increases transaction throughput by more than 10 times compared with similar engines such as RocksDB.

Transaction pipeline

Read operations

LSM trees accumulate records with the same primary key across multiple levels, each tagged with a sequence number (SN). A read must find the record with the correct SN according to the transaction's isolation level, which may require searching every level.

X-Engine uses three mechanisms to keep reads fast:

  • Bloom filter: Quickly determines whether a key exists at a given level, skipping levels that cannot contain it.

  • Row cache: Caches the latest version of hot rows above all persistent levels. A single-row query that misses the memory table hits the row cache before falling through to disk.

  • SuRF (succinct range filter): A range scan filter (best paper at the 2018 SIGMOD Conference) that determines whether range data exists at a given level, reducing the number of levels scanned during range queries. Asynchronous I/O and prefetching are also used for range scans.

Read operations

A block cache handles requests that miss both the memory table and the row cache, including range scan results. In standard LSM implementations, a compaction rewrites large amounts of data at once, invalidating large portions of the block cache and causing sharp read performance jitter. X-Engine reduces this through fine-grained compaction and data reuse (described below).

Compaction

Compaction in LSM trees has two goals:

  1. Control the level structure: When compaction falls behind writes, the LSM tree accumulates too many levels — a condition called an *inverted LSM* — and every read must scan more levels. X-Engine's compaction scheduler is designed to prevent this.

  2. Merge data: Records with the same key but different SNs are collapsed to keep only the version with the latest SN (where "latest" means greater than the earliest SN of any active transaction).

Compaction

Data reuse: reducing compaction I/O

The extent design enables data reuse. When compacting two adjacent levels, only extents and data blocks whose key ranges overlap with data at the other level need to be rewritten. Non-overlapping extents and data blocks are referenced directly in the new metadata snapshot without copying.

Data reuse
DimensionEffect
I/O and CPU costReduced — only the overlapping portion of data is rewritten
Cache invalidationReduced — most data blocks remain unchanged, so cached blocks stay valid after compaction
Fragmentation (trade-off)Fine-grained reuse at the data block level can introduce fragmentation; X-Engine's compaction scheduling policies balance reuse rate against fragmentation

The copy-on-write technique underlies all of this: every compaction and memory table freeze writes results to new extents, then generates a new metadata index and a new metadata snapshot. The old snapshot is preserved until no active transaction references it, providing snapshot-level read isolation without locking.

Metadata snapshot after compaction
The data structuring technology is similar to the approach described in B-trees, shadowing, and clones. 变更

FPGA-accelerated compaction

Compaction is the most resource-intensive background operation in an LSM-based engine. X-Engine uses FPGA hardware to offload and accelerate compaction — the first time hardware acceleration has been applied to an OLTP storage engine. This work is described in the paper "FPGA-Accelerated Compactions for LSM-based Key Value Store," accepted by the 18th USENIX Conference on File and Storage Technologies (FAST'20) in 2020.

Key capabilities

CapabilityMechanismBenefit
High write throughputLSM tree + transaction pipelineMore than 10x throughput vs. RocksDB
Low storage costCopy-on-write + encoding and compression on read-only data blocks50%–90% storage reduction vs. InnoDB
Fast single-row readsRow cache above all persistence levelsAvoids disk I/O for hot rows
Fast range scansSuRF range filter + async I/O + prefetchingFewer levels scanned
Stable read performanceFine-grained compaction + data reuseReduced cache invalidation jitter
Hardware-accelerated compactionFPGA offloadLower compaction CPU and I/O overhead

When to use X-Engine

X-Engine is optimized for workloads with large data volumes and cost pressure, where a large proportion of data is cold and accessed infrequently. The following table maps workload characteristics to expected outcomes:

WorkloadData temperatureExpected benefit
Transaction history databasesMostly cold after initial writeHigh storage savings (50%–90% vs. InnoDB)
Chat or message historyHigh write volume, cold data dominates over timeWrite throughput + storage efficiency
Peak traffic scenarios (e.g., flash sales)Mixed hot/coldPipeline throughput handles burst load
Frequently updated hot rowsMostly hotLimited storage benefit; evaluate InnoDB

X-Engine was deployed in production at Alibaba Group for use cases including transaction history databases, the DingTalk chat history database, and Alibaba Double 11 peak traffic (surging to hundreds of times greater than average).

For detailed guidance on matching workload characteristics to storage engine configuration, see Best practices.

What's next