X-Engine LSM Tree Design for High-Performance OLTP Storage - PolarDB

X-Engine is an OLTP storage engine for MySQL, developed by the Database Products Business Unit of Alibaba Cloud. It uses a redesigned log-structured merge-tree (LSM tree) architecture to reduce storage costs and handle high-concurrency workloads at scale.

Important

X-Engine is phased out. For more information, see [EOS/Discontinuation] ApsaraDB RDS for MySQL instances that run the X-Engine storage engine are no longer available for purchase from November 1, 2024.

Background

Alibaba Group has been deploying MySQL databases at scale since 2010. As data volumes grew exponentially, two problems became increasingly difficult to solve by adding servers alone:

Processing highly concurrent transactions
Storing large amounts of data cost-effectively

The root cause was a mismatch between hardware and software. Modern systems run on multi-core CPUs, cache-only memory architecture (COMA), non-uniform memory access (NUMA), and heterogeneous accelerators such as GPUs and field-programmable gate arrays (FPGAs). But database software — B-tree indexing with fixed page sizes, ARIES-based (recovery and isolation exploiting semantics) transaction recovery, and independent lock managers — was designed for slow spinning disks. According to Michael Stonebreaker (Turing Award winner) in his paper "OLTP Through the Looking Glass, and What We Found There," conventional relational databases spend less than 10% of their time actually processing data. The remaining 90% is spent waiting for locks, managing buffers, and synchronizing logs.

X-Engine was built to close that gap.

The LSM tree trade-off

Every storage engine design must balance three forms of amplification:

Amplification	Definition	LSM tree	B-tree
Write amplification	Extra I/O beyond the logical write	Lower	Higher
Read amplification	Extra I/O to satisfy a read (multiple levels to merge)	Higher	Lower
Space amplification	Extra storage beyond the logical data size	Requires compaction to reclaim	Moderate

LSM trees trade read amplification for write amplification. X-Engine addresses all three dimensions through tiered storage, a transaction pipeline, fine-grained compaction with data reuse, and multi-layer caching — accepting the read overhead of LSM while minimizing it as much as possible.

How X-Engine works

Data organization and tiered storage

X-Engine divides data across storage tiers based on access frequency:

Hot data (frequently accessed): stored in memory using lock-free skip lists and append-only data structures
Cold data (infrequently accessed): flushed to persistent storage tiers — non-volatile memory (NVM), SSDs, and HDDs — in order of cost and speed

Each persistence level is divided into extents of a fixed size. An extent is a small, complete sorted string table (SSTable) that stores data with a continuous key range. Each key range is further divided into read-only, variable-length data blocks (equivalent to pages in conventional databases, but immutable).

All extents at each level are indexed by a metadata tree with a B-tree-like structure. The root node is a metadata snapshot — a point-in-time view of all data up to a given log sequence number (LSN). All structures except the active memory table are read-only, which is the foundation for X-Engine's snapshot-level isolation.

The architecture and optimization techniques are described in the paper "X-Engine: An Optimized Storage Engine for Large-scale E-Commerce Transaction Processing," presented at the 2019 SIGMOD Conference. This was the first time a company from the Chinese mainland published OLTP database engine research at an international academic conference.

Transaction processing

Transaction processing in X-Engine has two phases:

Read and write phase: X-Engine checks for write-write and read-write conflicts. If no conflicts are detected, all modified data is written to a transaction buffer.
Commit phase: Data is written to the write-ahead log (WAL), written to memory tables, committed, and the result is returned.

The pipeline: from group commit to parallel stages

Most storage engines use group commit to reduce I/O cost: multiple transactions are batched and flushed to disk together. This allows you to combine I/O operations. However, the transactions that are to be committed at a time still need to wait for a long period of time — for example, when logs are being written to the disk, nothing else is done except to wait for the data to be flushed to the disk.

X-Engine replaces this with a four-stage pipeline:

Copy logs to the log buffer
Flush logs to disk
Write data to memory tables
Commit data

Transaction threads enter the pipeline and freely pick up any available stage. Threads at different stages run concurrently, and no thread waits for another. This eliminates the stall points of group commit and increases transaction throughput by more than 10 times compared with similar engines such as RocksDB.

Read operations

LSM trees accumulate records with the same primary key across multiple levels, each tagged with a sequence number (SN). A read must find the record with the correct SN according to the transaction's isolation level, which may require searching every level.

X-Engine uses three mechanisms to keep reads fast:

Bloom filter: Quickly determines whether a key exists at a given level, skipping levels that cannot contain it.
Row cache: Caches the latest version of hot rows above all persistent levels. A single-row query that misses the memory table hits the row cache before falling through to disk.
SuRF (succinct range filter): A range scan filter (best paper at the 2018 SIGMOD Conference) that determines whether range data exists at a given level, reducing the number of levels scanned during range queries. Asynchronous I/O and prefetching are also used for range scans.

A block cache handles requests that miss both the memory table and the row cache, including range scan results. In standard LSM implementations, a compaction rewrites large amounts of data at once, invalidating large portions of the block cache and causing sharp read performance jitter. X-Engine reduces this through fine-grained compaction and data reuse (described below).

Compaction

Compaction in LSM trees has two goals:

Control the level structure: When compaction falls behind writes, the LSM tree accumulates too many levels — a condition called an *inverted LSM* — and every read must scan more levels. X-Engine's compaction scheduler is designed to prevent this.
Merge data: Records with the same key but different SNs are collapsed to keep only the version with the latest SN (where "latest" means greater than the earliest SN of any active transaction).

Data reuse: reducing compaction I/O

The extent design enables data reuse. When compacting two adjacent levels, only extents and data blocks whose key ranges overlap with data at the other level need to be rewritten. Non-overlapping extents and data blocks are referenced directly in the new metadata snapshot without copying.

Dimension	Effect
I/O and CPU cost	Reduced — only the overlapping portion of data is rewritten
Cache invalidation	Reduced — most data blocks remain unchanged, so cached blocks stay valid after compaction
Fragmentation (trade-off)	Fine-grained reuse at the data block level can introduce fragmentation; X-Engine's compaction scheduling policies balance reuse rate against fragmentation

The copy-on-write technique underlies all of this: every compaction and memory table freeze writes results to new extents, then generates a new metadata index and a new metadata snapshot. The old snapshot is preserved until no active transaction references it, providing snapshot-level read isolation without locking.

The data structuring technology is similar to the approach described in B-trees, shadowing, and clones.

FPGA-accelerated compaction

Compaction is the most resource-intensive background operation in an LSM-based engine. X-Engine uses FPGA hardware to offload and accelerate compaction — the first time hardware acceleration has been applied to an OLTP storage engine. This work is described in the paper "FPGA-Accelerated Compactions for LSM-based Key Value Store," accepted by the 18th USENIX Conference on File and Storage Technologies (FAST'20) in 2020.

Key capabilities

Capability	Mechanism	Benefit
High write throughput	LSM tree + transaction pipeline	More than 10x throughput vs. RocksDB
Low storage cost	Copy-on-write + encoding and compression on read-only data blocks	50%–90% storage reduction vs. InnoDB
Fast single-row reads	Row cache above all persistence levels	Avoids disk I/O for hot rows
Fast range scans	SuRF range filter + async I/O + prefetching	Fewer levels scanned
Stable read performance	Fine-grained compaction + data reuse	Reduced cache invalidation jitter
Hardware-accelerated compaction	FPGA offload	Lower compaction CPU and I/O overhead

When to use X-Engine

X-Engine is optimized for workloads with large data volumes and cost pressure, where a large proportion of data is cold and accessed infrequently. The following table maps workload characteristics to expected outcomes:

Workload	Data temperature	Expected benefit
Transaction history databases	Mostly cold after initial write	High storage savings (50%–90% vs. InnoDB)
Chat or message history	High write volume, cold data dominates over time	Write throughput + storage efficiency
Peak traffic scenarios (e.g., flash sales)	Mixed hot/cold	Pipeline throughput handles burst load
Frequently updated hot rows	Mostly hot	Limited storage benefit; evaluate InnoDB

X-Engine was deployed in production at Alibaba Group for use cases including transaction history databases, the DingTalk chat history database, and Alibaba Double 11 peak traffic (surging to hundreds of times greater than average).

For detailed guidance on matching workload characteristics to storage engine configuration, see Best practices.

PolarDB:Introduction to X-Engine