X-Engine is an online transaction processing (OLTP) database storage engine that is developed by the Database Products Business Unit of Alibaba Cloud to suit the needs of PolarDB. This storage engine now is widely used in a number of business systems of Alibaba Group to reduce costs. These include the transaction history database and DingTalk chat history database. In addition, X-Engine is a crucial database technology that empowers Alibaba Group to withstand bursts of traffic that may surge by hundreds of times than usual during Double 11, a shopping festival in China.
X-Engine aims to cope with the challenges faced by the internal businesses of Alibaba Group. Alibaba Group has been deploying MySQL databases on a large scale since 2010. However, the explosive growth of data volume year by year still imposes the following challenges on these databases:
- To process highly concurrent transactions.
- To store large amounts of data.
You can increase the processing and storage capabilities by adding servers on which you can create more databases. However, this is not an efficient approach. Alibaba Cloud has been devoted to leveraging technical means to maximize performance with minimal resources.
The performance of the conventional database architecture has been carefully studied. Michael Stonebreaker, a leader in the database field and a winner of the Turing Award, wrote a paper on this topic: OLTP Through the Looking Glass, and What We Found There .In the paper, he pointed out that conventional general-purpose relational databases spend less than 10% of their time efficiently processing data. The remaining 90% of their time is wasted on other work, such as waiting for locked resources to be released, managing buffers, and synchronizing logs.
This is caused by significant changes to the hardware systems that we depend on in recent years. These include multi-core and many-core CPUs, new processor architectures such as the cache-only memory architecture (COMA) and the non-uniform memory access (NUMA), various heterogeneous computing devices such as GPUs and field-programmable gate arrays (FPGAs). However, the database software built on these hardware systems has not changed much. Such software includes the mechanism that fixes page sizes based on B-tree indexing, the mechanism that processes transactions and restores data by using the recovery and isolation exploiting semantics (ARIES) algorithms, and the concurrency control mechanism based on independent lock managers. These software mechanisms are designed based on slow disks and therefore cannot achieve the potential performance of the preceding hardware systems.
Alibaba Cloud has developed X-Engine to suit the needs of the hardware systems used today.
With the pluggable storage engine of MySQL, X-Engine can be seamlessly integrated with MySQL and benefit from the tiered storage architecture.
X-Engine is designed to store large amounts of data, increase the capability of processing concurrent transactions, and reduce storage costs. In most scenarios with large amounts of data, the data is not evenly accessed. Frequently accessed data (hot data) actually accounts for a small proportion. X-Engine divides data into multiple levels based on access frequency. In addition, X-Engine determines storage structures and writes the data to appropriate storage devices based on the access characteristics of each level of data.
X-Engine uses the log-structured merge-tree (LSM tree) architecture that is redesigned for tiered storage.
- X-Engine stores hot data and updated data in the memory by leveraging a number of memory database technologies to expedite the execution of transactions. These technologies include lock-free data structures and append-only data structures.
- X-Engine uses a transaction processing pipeline mechanism to run transaction processing stages in parallel, which greatly increases the throughput.
- Less frequently accessed data (cold data) is gradually deleted or merged into persistent storage levels, and stored in the hierarchical system with abundant storage devices, such as NVMs, SSDs, and HDDs.
- A lot of improvements are made to compactions that impose a significant impact on performance.
- The data storage granularity is refined based on the fact that data update hotspots are concentrated. This ensures that data is reused as much as possible in the compaction process.
- The hierarchy of the LSM tree is refined to reduce I/O and computing costs and to minimize the storage space usage incurred by compactions.
- More fine-grained access control and caching mechanisms are used to optimize read performance.
- FPGA hardware is used to accelerate compactions and further maximize the performance of your database system. This marks the first time that hardware acceleration is applied to the storage engine of an OLTP database. The achievement has been summarized into a paper titled FPGA-Accelerated Compactions for LSM-based Key Value Store .This paper has been accepted by the 18th USENIX Conference on File and Storage Technologies (FAST'20).
- The data reuse technology is used to reduce the costs of compactions and reduces performance jitter caused by data deletion from the cache.
- Queued multi-transaction processing and pipeline processing are used to reduce the thread context switching overheads and calculate the task ratio in each stage. This makes the entire pipeline work evenly and increases transaction processing performance by more than 10 times compared with other similar storage engines such as RocksDB.
- Copy-on-write is used to prevent data pages from being updated when they are read. This allows read-only data pages to be encoded and compressed and reduces the storage space usage by 50% to 90% compared with conventional storage engines, such as InnoDB.
- Bloom filter is used to quickly determine whether the target data exists, succinct range filter (SuRF) is used to determine whether the range data exists, and row cache is used to cache hot data rows to accelerate read operations.
Basic logic of LSM
The essence of LSM is that all write operations append data to the memory. Each time the written data is accumulated to a certain amount, the data is frozen as a level and then flushed to persistent storage. All rows of the written data are sorted based on primary keys, regardless of whether the data is stored in the memory or persistent storage. In the memory, data is stored in a sorted in-memory data structure, such as a skip list or B-tree. In persistent storage, data is stored in a read-only, fully sorted persistent storage structure.
To make a common storage engine support transaction processing in common storage systems, you must introduce a temporal factor, based on which we can build an independent view for each transaction. These views are not affected in the event of concurrent transactions. For example, the storage engine sorts the transactions, assigns them sequence numbers (SNs) that increase monotonically and globally, and logs the SN of each transaction. This allows the storage engine to determine visibility among independent transactions.
If data is continuously written to the LSM storage structure and none of other actions is performed, the LSM storage structure will eventually become the structure shown in the following figure.
This structure is writer-friendly, because the written data simply has to be appended to the latest memory table. To implement crash recovery, you only need to record the data to redo logs. New data does not overwrite old data, and therefore appended records form a natural multi-SN structure.
However, when more persistence levels of data are accumulated and frozen, query performance decreases. The multi-SN records generated for different transaction commits with the same primary key are distributed across different levels, as are those with different keys. In this case, read operations need to search all the levels of data and merge the found data to obtain the final results.
Compactions are introduced to LSM to resolve this problem. Compactions in LSM have two objectives:
- Control the hierarchy of LSM
In most cases, the data volume increases by a multiple number of times as the LSM level decreases. This is to improve read performance.
Data access in a storage system is localized, and a large proportion of access traffic is concentrated on a small portion of data. This is the basic prerequisite for effective operations in the cache system. In the LSM storage structure, you can store hot data at a high LSM level on high-speed storage devices, such as NVMs and DRAMs, and cold data at a low level on low-speed storage devices that cost less. This is the basis of hot and cold data separation in X-Engine.
- Merge data
Compactions continuously merge data at adjacent LSM levels and write the merged data to the lower LSM levels. During the compaction process, the system reads the to-be-merged data from two or more adjacent levels and then sort the data based on keys. If multiple records with the same key have different SNs, the system retains only the record with the latest SN (which is greater than the earliest SN of the current transaction that is being executed), discards the records with earlier SNs, and writes the record with the latest SN to a new level. This process is resource-consuming.
In addition to the separation of hot and cold data, compactions also require considerations on other factors such as the data update frequency. Queries for a large number of multi-SN records waste more I/O and CPU resources. Therefore, records that have the same key but different SNs must be preferably merged to reduce the number of SNs per record. Alibaba Cloud designs a proprietary compaction scheduling mechanism for X-Engine.
A highly optimized LSM
In X-Engine, lock-free skip lists are used in the memory tables, which expedites the execution of highly concurrent read and write queries. A proper data structure must be planned at each LSM level to ensure efficient organization of data at persistent levels.
- Data structuring
In X-Engine, each level is divided into fixed-sized extents. An extent stores the data with a continuous key range at the level. A set of meta indexes are created for the extents at each level. All of these indexes together with all of the active and immutable memory tables form a metadata tree. The metadata tree has a structure similar to the structure of the B-tree, and its root nodes are metadata snapshots. The metadata tree helps quickly locate extents.
Except for the active memory tables to which data is being written, all structures in X-Engine are read-only and cannot be modified. When a point in time is specified, for example, when the log sequence number (LSN) is 1000, the structure referenced by metadata snapshot 1 in the preceding figure contains the snapshots of all the data logged at the moment associated with the LSN 1000. This is also why this structure is called a snapshot.
The metadata structure itself does not change after it is generated. All read operations start from this snapshot structure. This is the basis on which X-Engine implements snapshot-level isolation. All operations such as compactions and memory table freezes are implemented using copy-on-write. Specifically, the result of each modification is written into a new extent. Then, a new meta index structure is generated. Finally, a new metadata snapshot is generated.
For example, each compaction generates a new metadata snapshot, as shown in the following figure.
In this example, metadata snapshot 2 is slightly different from metadata snapshot 1. Only some leaf nodes and index nodes that have changed are modified.Note This data structuring technology is similar to that presented in the paper titled B-trees, Shadowing, and Clones, which will help you understand this process.
- Transaction processing
With its lightweight write mechanism, LSM has significant advantages in write operations. However, transaction processing is not as simple as writing updated data to a system. A complex process is required to ensure atomicity, consistency, isolation, and durability (ACID). X-Engine divides the entire transaction processing process into two phases:
- The read and write phase
In the read and write phase, X-Engine checks for write-write conflicts and read-write conflicts in the transaction and determines whether the transaction can be executed, rolled back, or locked. If no transaction conflicts are detected, all the modified data is written to the transaction buffer.
- The commit phase
The commit phase includes the entire process of writing data to WALs, writing data to memory tables, committing the data, and returning results to the user. This process involves both I/O operations (logging and returning results) and CPU operations (copying logs and writing data to memory tables).
To increase the throughput during transaction processing, the system concurrently processes a large number of transactions. A single I/O operation is costly, and therefore most storage engines tend to commit a number of transactions at a time, which is called "group commit". This allows you to combine I/O operations. However, the transactions that are to be committed at a time still need to wait for a long period of time. For example, when logs are being written to a disk, nothing else is done except waiting for the data to be flushed to disks.
To further increase the throughput during transaction processing, X-Engine adopts a pipeline technology that divides the commit phase into four independent and more fine-grained stages:
- Copying logs to the log buffer
- Flushing logs to disks
- Writing data to memory tables
- Committing the data
When a transaction commit thread enters the commit phase, it can freely choose any stage of the pipeline to process the data. In this way, threads can concurrently process data at different stages. If the tasks for each stage are properly divided based on sizes, all the stages of the pipeline can be nearly fully loaded. In addition, transaction processing threads instead of background threads are used in the commit phase. Each stage is either executing tasks in a stage or processing requests. This process does not involve any waiting or switching, and therefore the capabilities of each thread are fully utilized.
- The read and write phase
- Read operations
In LSM, if multiple records with the same key have different SNs, the records with later SNs are appended to the record with the earliest SN. Records that have the same key but different SNs may be stored at different levels. The system must identify the appropriate SN of each requested record in compliance with the visibility rules that are defined based on the transaction isolation levels. In most cases, the system searches for records with the latest SNs from the highest level to the lowest level.
For single-record queries, the query process ends after the single record is found. If the record is located at a high level, for example, in a memory table, it will be returned quickly. If the record is located at a low level, for example, at a level used for random reading, the system must search downwards level by level. In this case, a bloom filter can be used to skip some levels to expedite the query, but this involves more I/O operations. A row cache is introduced into X-Engine to expedite single-row queries. The row cache stores data above all the persistent data levels. When a single-row query does not hit data in memory tables, it will hit data in the row cache. The row cache needs to store each record with the latest SN at all persistence levels. However, the records in the row cache may change. For example, every time after a read-only memory table is flushed to a persistence level, the records in the row cache must be updated accordingly. This operation is subtle and requires careful design.
For range scans, it is impossible to determine the level where the data associated with a specific key range is stored. In this case, the final result can be returned only after all levels are scanned for the data and the data is merged. X-Engine adopts a series of methods to address this problem. For example, SuRF presented at the best paper at SIGMOD 2018 provides a range scan filter to reduce the number of levels to be scanned. The asynchronous I/O and prefetching mechanism are also provided to address this problem.
The core to read operations is the cache design. A row cache handles single-row queries. A block cache handles requests missed by the row cache or range scan requests. However, in LSM, a compaction incurs updates to a large number of data blocks at a time. This causes a large amount of data in the block cache to expire instantly and results in a sharp performance jitter. The following optimizations are made to address this problem:
- Reduces the granularity of compaction.
- Reduces the amount of data modified during each compaction.
- Updates the existing cached data only when the data is modified during each compaction.
Compactions are important. The system needs to read data associated with overlapped key ranges from adjacent levels, merge the data, and write the merged data to a new level. This is the cost of simple write operations. The storage architecture of X-Engine is redesigned to optimize compactions.
As mentioned previously, X-Engine divides each level of data into fixed-sized extents. An extent is equivalent to a small but complete Sorted String Table (SSTable), which stores the data with a continuous key range at the level. A key range is further divided into smaller continuous segments that are called data blocks. Data blocks are equivalent to pages in conventional databases, except that data blocks are read-only and their lengths are not fixed.
A comparison between metadata snapshots 1 and 2 unveils the intent of the extent design. Only a small portion of overlapped data and the meta index node need to be modified each time. The structures of metadata snapshots 1 and 2 actually share a large number of data structures. This is called data reuse, and the extent size is a crucial factor that determines the data reuse rate. As a completely reusable physical structure, the extent size is minimized to reduce the amount of overlapped data. However, the extent size must be proper. If the extent size is abnormally small, a large number of indexes will be required, which increases management costs.
In X-Engine, the data reuse rate is high in compactions. Assume that you want to merge the extents that contain overlapped key ranges at Level 1 and Level 2. In this case, the merge algorithm scans the data row by row. Any physical structure, including data blocks and extents, that does not overlap with the data at other levels can be reused. The difference between the reuse of extents and the reuse of data blocks is that the meta indexes of extents can be modified while data blocks only support data copying. Data blocks are not recommended although they significantly reduce CPU utilization.
The following figure shows a typical data reuse process in a compaction.
The data reuse process is completed by using row-by-row iteration. However, this fine-grained data reuse causes data fragmentation.
Data reuse benefits the compaction itself, reduces I/O and CPU consumption during the compaction, and improves the overall performance of the system. For example, in the compaction process, data does not need to be completely rewritten, which greatly reduces the storage space occupied by written data. In addition, most of the data remains unchanged, and therefore the cached data remains valid after data updates. This reduces read performance jitters caused by the expiration of the cached data during the compaction.
In fact, optimizations to compactions are only part of what X-Engine does. X-Engine also optimizes the compaction scheduling policies and defines the method of selecting extents, the granularity of compactions, and the execution priorities of the specified compactions. These all affect the performance of the system. Although no perfect policies exist, X-Engine has accumulated valuable experience and defined a number of rules to define proper compaction scheduling policies.
For more information, see Best practices of X-Engine.
Get started with X-Engine
For more information, see X-Engine usage notes.
As a storage engine for MySQL, X-Engine must be continuously improved in terms of its compatibility with MySQL systems. Based on the most urgent needs, some features such as foreign keys will be gradually enhanced, and more data structures and index types will be supported.
The core value of X-Engine lies in cost-effectiveness. Continuously improving performance at lower costs is a long-term fundamental goal. Alibaba Cloud continues its exploration for new approaches that make X-Engine more efficient on operations, such as compaction scheduling, cache management and optimization, data compression, and transaction processing.
X-Engine will not be limited to a storage engine for standalone databases. It will serve as the core of the Alibaba Cloud proprietary distributed PolarDB to provide enterprise-grade database services.