By Gushen and Beilou
Despite the shared LSM-tree structure, X-Engine and traditional LSM-tree-based engines such as RocksDB differ in their design philosophies, as shown in the following figure.
Key design 1: Data on X-Engine disks are usually stored at two layers, L1 and L2. The L0 layer is enabled only when data in MemTable fails to be compacted promptly and is temporarily stored on disks to relieve the memory pressure. Under normal circumstances, the frozen MemTable can be merged with L1 of the disk.
Key design 2: During the compaction between L1 and L2, the hot and cold data merging algorithm of X-Engine tends to store frequently accessed data at L1, and store less accessed data at L2 for compression and storage. This is a process of physically separating hot and cold data. After this process, L1 stores hot data only and L2 stores cold data only, ensuring higher memory usage during L1 caching.
According to the design intention, when X-Engine runs normally, the MemTable caches the recently written data that has not been flushed to the disk, and L1 stores the most frequently accessed data on the disk, most of which is cached in the memory. After layered data separation, optimizing X-Engine's read performance can be divided into two independent sub-problems:
In view of the smaller data amount in the MemTable, the overall performance of X-Engine depends on the memory query efficiency of the data stored at L1, on the premise that the cold and hot data identification algorithm is accurate and the memory is sufficient to cache all the hot data. This brings out the topic of this article: how to maximize the read performance of memory hits.
When the dataset size is less than the free memory capacity, the read performance of X-Engine depends on CPU resources. When we analyze the performance of a CPU-bound program, we need to identify the CPU-resource-consuming items. Based on the definition of CPU usage, the CPU is in the busy state when it is executing an instruction or when it stalls waiting for data. Therefore, the read performance optimization of the storage engine is narrowed down to two points.
The process of organizing and reading data at L1 is as follows. First, X-Engine divides the data into extents, each with a size of 2 MB. In the extents, records are encoded into 16-KB blocks, and each extent contains an index block to help locate data blocks. Overall, data organization at the L1 and L2 layers in X-Engine is an index structure similar to a B+ tree. We suppose that all operations hit the memory and a record with a key of X is read in the extent. These operations are performed in four steps.
Based on the preceding logic of implementing the read path and a Perf analysis on X-Engine's read processes of in-memory hits, we tried to make three improvements. (1) Optimize the encoding and searching instructions of data pages. (2) Reduce the management overhead of the buffer pool. (3) Optimize the cache flushing problem of multi-threaded operation on multiple cores. With these improvements, we managed to record a 124% improvement in the overall read performance.
In the next section, we will describe the root causes of the three problems and our solutions in detail. Meanwhile, we will evaluate the benefits of the optimization methods in each step through tests. In view of the generality of the three problems in database storage engines, these optimization methods are also applicable to InnoDB, RocksDB, and other engines.
Like in the RocksDB or InnoDB engine, index pages and data pages in X-Engine adopt the same formats. Each block stores key-value pairs in sequence, and performs segmented key prefix compression, where each segment is called a restart interval. Each key-value pair is followed by a restart array, which stores the offset of each restart interval in the block. During a search, the engine first performs a binary search in the restart array to locate the smallest restart interval that may contain the target key, as shown in figures 1, 2, 3, and 4. Then, the engine starts the comparison in sequence from this restart interval, as shown in figure 5.
The search process is essentially a binary search in a pointer array. The restart interval offset stored in the restart array can be understood as a pointer to the restart interval. Perf analysis shows that this process has the following problems.
To solve the CPU cache unfriendliness in binary searches of pointer arrays, we tried to convert the frequently accessed data pages into a two-layer B+ tree structure that is CPU-cache-friendly, and cluster time-adjacent records of access in the physical space. The following figure shows the logical storage model.
The following figure shows the physical storage model.
Each node in a block is built in sequence based on the physical storage model, and data query follows a method similar to that of a B+ tree. The CPU-cache-friendly two-layer B+ tree structure improves CPU cache utilization from three perspectives.
This encoding format can well support range queries, and enables point-query-like acceleration for the first seek operation of a range query, without any negative impact on subsequent SCAN iterations. For more information about the optimization of B+ tree layers, see the previous article under this WeChat public account: Let Database Engines Bring out the Best of CPU Cache.
After the cache access problem for data page record queries is solved, further Perf analysis shows that the compare function in the query path takes more CPU cycles, resulting in even more prominent computing overhead for long key queries. In the new data page encoding structure, multiple index keys that need to be compared are simultaneously loaded to the CPU cache in each read. In view of this, the introduction of a Single Instruction, Multiple Data (SIMD) instruction set can realize parallel retrieval of multiple index keys in the cache within one CPU cycle.
To use the SIMD instruction set, we need to further adjust X-Engine's data page encoding, as shown in the following figure.
Perform lossy compression on the keys. Segment, the processing unit of SIMD registers, is 64-bit long at the maximum, such as an AVX-512 instruction set. Segmenting long keys for parallel comparison cancels out the improved performance of parallel computing due to extra logic. Here, we split a key into three parts, common prefix, significant Nbits, and suffix, to save redundant operations on prefixes for the purpose of segmented parallel comparison. However, significant Nbits is highly distinct in logic, and N depends on the segment size in the SIMD register. In most cases, only one SIMD comparison is needed to obtain the result. If the significant Nbits are equal, you only need to compare the suffixes of some index keys in sequence.
Select a fixed fanout value and apply padding. The processing bandwidth of SIMD instructions, which is the number of segments in the SIMD register, is an exponent of 2. If the fanout value remains the root square of N, the preparation process will include many additional operations. For the purposes of keeping the node length CPU-cache-friendly and adapting to the SIMD instruction processing bandwidth, as well as accommodating the restart key quantity range in the data block, we finalized the leaf node fanout value to 8 and the root node fanout value to 16. For the nodes with insufficient numbers of nodes, the largest key in the node applies as the padding of the fanout value. In this case, the SIMD instructions are hard-coded to reduce branch prediction overhead.
Recoding index pages and data pages and accelerating SIMD instructions in combination can improve the point query performance by up to 20% in a test with different row lengths. In the TP system, the data size read or written by a single request is very small. In our test scenario, only one record is operated at a time. Therefore, the data page encoding and SIMD instruction optimization together improve the end-to-end performance by markedly 20%.
In-memory databases and on-disk databases have different design principles. A major difference lies in that during data page access, an in-memory database tends to use the pointer for direct access, whereas an on-disk database generally obtains the memory address of the data page in a data structure by using the page ID and reads the disk first when the data page is not cached before it finally accesses the data page. This indirect access leads to additional search overhead. Previous studies have shown that the buffer pool management structure consumes more than 30% of CPU resources when the dataset can be fully stored in the memory.
The virtual memory address space of the operating system is sufficient relative to the dataset size of a database. One possible solution is to use pointers for direct access in an on-disk database engine. When the memory space is insufficient, the swap mechanism of the operating system can be employed to replace memory pages. In this way, the storage engine has the same performance as an in-memory database when the dataset size is smaller than the free memory capacity, because the overhead for page ID mapping and search has been eliminated.
However, this method has a defect. That is, when the dataset size is larger than the free physical memory of the system, the swap mechanism of the operating system cannot accurately perceive the data access characteristics in the database, and may hence swap a critical memory object to the disk, resulting in performance fluctuations.
X-Engine is a general-purpose storage engine, and the datasets it loads exceed the free memory capacity in most cases. Therefore, you must introduce a buffer pool to manage the access to data pages. However, in our design that highlights precise separation of hot and cold data, most of the queries will hit the memory if the user load features obvious locality. In this case, the 30% CPU resource consumption rate of the buffer pool is quite wasteful and needs to be eliminated.
The extent information, index pages, and data pages in X-Engine are managed by the buffer pool cache. Specifically, the cache manages data by using hash tables and doubly linked lists, obtains data by using hash-based searches, and maintains the sequence information of the least recently used (LRU) linked lists after hits. The overhead is incurred from the following three aspects.
PointSwizzling is a method to optimize the CPU resource overhead caused by the indirect reference of the buffer pool. This method uses direct pointers instead of the page ID + HashMap search method. To tackle the problem of high computing overhead for hash-based searches in the X-Engine cache, we introduce direct access pointers for data access and use cache only for data management. The specific solution is described as follows.
After the target index page, data page, or extent meta is stored into the memory for the first time, we add a direct access pointer to the PointSwizzling target object. In this way, the direct access pointer can be used in the next access without HashMap calculation.
Concurrent access to and reclamation of target memory objects. When multiple accessing threads exist, one thread may trigger cache elimination when another thread is using a memory object. If this is the case, the memory object needs to be eliminated and the space occupied by this memory object needs to be reclaimed. There are two solutions to this problem: epoch-based memory reclamation and reference-counting-based memory reclamation. We have tested both solutions and concluded that the former solution outperforms the latter. For specific epoch-based implementations, see the appendix. The reason of this test result is that though reference counting is an atomic variable, its performance fails to live up to expectations in a multi-core environment where the cache is prone to failures due to cache inconsistencies from frequent updates and reads of the counter.
By using PointSwzzling, X-Engine's memory data access pattern features a high CPU usage efficiency, just like an in-memory database does. When some data is eliminated or replaced, X-Engine can continue using the buffer pool to ensure a high memory hit rate. With the introduction of PointSwizzling, X-Engine's point query performance in in-memory hit scenarios across key-value interfaces has been improved by nearly 60%.
Data page encoding and PointSwizzling are both the optimization methods for the single-thread operation process. The new encoding method of data pages improves the hit rate of CPU cache, whereas SIMD and PointSwizzling improve CPU efficiency for instruction execution, or in other words, the inter-process communication (IPC) of the program.
After we analyzed the overall execution framework of X-Engine, we found another cause of low CPU cache utilization: multi-threading concurrency.
X-Engine is a storage engine in ApsaraDB RDS for MySQL and PolarDB MySQL. Its execution framework is similar to that of MySQL, implementing a one-thread-per-client or thread pool mode. For such a framework, any thread can access the data in all tables. On today's servers with 32, 64, or even 128 cores, the highly concurrent execution threads cause the following problems.
To solve these problems, industry pioneers such as H-Store and VoltDB have provided an optimization method for reference.
We made the following modifications to X-Engine's execution framework.
We set the number of active data-reading threads to the number of available CPU cores and bind specific threads to specific cores for execution to avoid frequent thread scheduling.
We also partitioned all the subtables in X-Engine. Specific table partitions can only be accessed by the threads that run on specific cores. This prevents the transmission of multiple replicas of the same data or the same data between the caches of different cores.
This per-core-level shared-nothing architecture has the following advantages.
High CPU cache efficiency. The same copy of data can only be cached in the L1 or L2 cache of one CPU core, cutting the overhead for cache consistency. No context switches occur for threads. Context switching can cause cache invalidation.
Greatly reduced overhead for thread synchronization. Data is not shared between foreground threads, and inter-thread synchronization does not require locks or atomic operations. This solves cache lock contention, swizzle or unswizzle operations of pointer swizzling, and memory reclamation among other problems.
This method can improve the read-only performance by about 30%. However, this solution has not been fully implemented. On the one hand, we need to further optimize the access to hot subtables. On the other hand, cross-table access is common in transaction engines. If a thread runs a transaction that needs to access multiple tables, the scheduling of the transaction is very challenging.
64 threads with 2 sockets
L1 cache: 32-KB data cache or 32-KB instruction cache
L2 cache: 256 KB
L3 cache: 32 MB
RAM: 512 GB. You can run the "numactl --interleave=all" command to disable NUMA.
The key is 16 bytes long, the value is 8 bytes long, and there are 32 tables, each of which contains 20 million entries. After data is written to the tables, flush the tables to the L1 layer. Run a pre-reading and pre-heating round to load the data to the cache, and then run the readrandom test.
We used the random read performance of the original X-Engine architecture as the baseline and then applied the preceding optimization methods in sequence.
First, we added the new data page coding method and the SIMD instruction search algorithm on data pages. As a result, the performance was improved by 13.8%.
Second, we added the PointSwizzling optimization in the code. As a result, the performance was improved by 90% over the baseline. Finally, we reconstructed the X-Engine execution framework to introduce the multi-core shared-nothing architecture.
As a result, the overall performance was 123% higher than that of the baseline. We also analyzed the changes in the miss rates of the CPU L1 data cache and in the program InterProcess Communication (IPC) during the tests.
To identify the paths in which our optimization methods play the most critical role, we performed a breakdown test on the performance improvements by each optimization method and summarized the findings in the following table. The table shows that the most marked performance improvement was seen when indirect reference to the data block was changed to direct reference. Given the high number of data blocks, and the large hash lookup table in the original implementation, which resulted in low cache efficiency, this improvement is not surprising. However, the number of index and ExtentMeta objects is relatively small, which accounts for about 1/128 of the number of data blocks, so you can implement better search efficiency with a smaller impact on performance.
ApsaraDB - May 14, 2020
ApsaraDB - March 3, 2020
Alibaba Clouder - November 6, 2018
ApsaraDB - November 17, 2020
ApsaraDB - April 13, 2020
ApsaraDB - October 20, 2020
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
Leverage cloud-native database solutions dedicated for FinTech.Learn More
Migrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.Learn More
Migrating to fully managed cloud databases brings a host of benefits including scalability, reliability, and cost efficiency.Learn More
More Posts by ApsaraDB