What Capabilities Should an Excellent Database Storage Engine Have

Guided reading
The author of this article is Qu Shan, a senior technical expert in Alibaba's OLTP database team. As the person in charge of the self-developed high-performance and low-cost storage engine X-Engine, what capabilities should an excellent relational database storage engine in Qu Shan's eyes have?

The database kernel is divided into two layers: SQL & Storage. The SQL Layer is responsible for passing the SQL statement you enter through a series of steps (
parse/resolve/rewrite/optimize...) is converted into a physical execution plan, and is responsible for the execution of the plan. The execution plan is usually in the form of a tree, and the leaf node (executor operator) part of the tree is often responsible for data operations on a single table , these operation operators will be executed in the storage layer.

Therefore, the main job of a database storage engine is simply to access data, but the premise is to ensure the ACID (
atomicity/consistency/isolation/durability) semantics. The interface provided by the storage engine is actually relatively simple, mainly data writing/modifying/querying, transaction processing (start
transaction/commit/rollback...), modify the schema object/data dictionary (optional), data statistics, and some peripheral operation and maintenance or data import and export functions.

From a functional point of view, it seems that it is not difficult to implement a storage engine. Nowadays, there are many Key-Value Stores that have transformed themselves into database storage engines, which is nothing more than adding a set of transaction processing mechanisms. However, as the chassis of the database, a mature storage engine must consider efficiency, and how to achieve data access efficiently (maximizing performance/cost) has become the main consideration in making various trade-offs in design. It can be discussed from several main components of the storage engine:

data organization
The organization of data in memory and disk largely determines the efficiency of access. Different application scenarios have different choices. Typical examples are:

Data is stored in rows (NSM), which is friendly to transaction processing, because transaction data is always written in complete rows, which is mostly used in OLTP scenarios.
Stored by column (DSM), the same column values ​​in tuples are physically stored together, so that only the required columns need to be read, reducing a lot of I/O during large-scale data scans. In addition, the compression effect of column storage is better, which is suitable for OLAP scenarios, but transaction processing is not so convenient, and it is necessary to convert rows to columns. Therefore, most AP database transaction processing efficiency is not very high, and some even only support batch import.
Mixed storage (FSM), mixed layout of rows and columns, data is first grouped by row (Segment, SubPage), and DSM is used in the group to organize, such as PAX, RCFile, and there is also first group by column (Column Group), the specified columns in the group are NSM organizations such as Peloton's Tile. This format attempts to combine the advantages of both NSM and DSM to achieve the purpose of handling mixed load (HTAP), but it also inherits the shortcomings of both.
Therefore, when making a storage engine, you must face the problem of which storage format to choose from the beginning. Even if a major category is selected, there are countless details to be considered in each format, how to encode the fields of each data type, how to store null/not null in row storage, whether column indexing is required to speed up project operation, and whether Rearranging column values, how to compress data in column storage, etc., all need to be balanced between storage space and access speed.

In order to cope with complex application scenarios, modern databases often use more than one storage format. For example, Oracle has In-memory Column Store to convert row-stored pages into column-stored pages in memory to speed up complex queries.

When the data storage format is selected, it is also necessary to select the aggregation method of the data in disk and memory. Taking row-by-row storage as an example, most storage engines use fixed-size pages to store several consecutive rows. Of course, how to arrange data rows consecutively, there are two types of heap table (random) and index-organized table (in index order). Now the more popular LSM-Like storage engine uses pages of indeterminate size (called DataBlock), which only supports pressing The primary key index is aggregated in order; the main difference between the two methods is that the former is designed to be updatable, leaving space in each page, while the latter is read-only, and the data is stored tightly without padding, which is easy to compress. The difference between the two is actually caused by the big difference in the transaction processing mechanism, which will be discussed later.

For In-Memory Database, the way of data organization will be quite different, because there is no need to exchange data between memory and persistent storage, and the page form is generally not used in memory, but the index storage structure is directly used (such as B +Tree) directly indexes to records (tuples), without indirect reference at the page level, reducing cpu cache misses.

Cache management
The granularity of the cache is generally page, and the key lies in the cache replacement algorithm. Currently widely used LRU, LFU, ARC.. and various variant algorithms are used in the database. Another problem is how to manage memory more efficiently. In this regard, fixed-length pages have more advantages than variable-length pages.

Of course, the impact of various query patterns on the cache must also be considered. If there are many single-row queries, it will be more efficient to use a more fine-grained (such as row) cache, but the elimination strategy will be more complicated, and many new researches have begun to try to introduce machines. Learn methods to optimize cache elimination algorithms, and manage caches efficiently.

Transaction processing
The core of the storage engine ensures the ACID of the database. To ensure D, everyone's approach is similar. They all write WAL (Write Ahead Log) for recovery. The key is how to efficiently implement ACI, which is the so-called multi-version concurrency control (MVCC) mechanism.

The complete implementation of MVCC is relatively complicated and will not be elaborated for now. The key here is how to deal with data races during concurrent execution, including write-write conflicts and read-write conflicts; because the load of the database is generally more reads than writes Yes, to be efficient, read-only transactions cannot be blocked by read-write transactions, which requires that our writes cannot directly update the current data, but must have a set of capabilities to maintain multi-version data. The current storage engine manages multi-version There are two ways of data:

The written data is updated in place, and the updated old version is written to the undo chain. The writing cost is high, and the transaction processing is complicated, but it is efficient to recycle the old version of the data.
Writing data does not directly update the original data, but appends it to a new version. The writing cost is small, but reading, especially scanning, requires many levels of reading. The more serious problem is that recycling the old version of the data needs to be compacted. The price is high.
The former called ARIES algorithm is used by most of the mainstream database storage engines, and the latter structure called LSM-Tree is also used by many new storage engines and has received more and more attention.

The difference from the KV store is that the database has a strict schema, so the records in most storage engines are structured. When many KV stores are used as database storage engines, they do a layer of conversion in the middle and convert the upper layer. The processed tuples are converted into binary key-values ​​in a specific encoding method, written to KVStore, and after being read to the upper layer, they are interpreted as tuples format by the schema for the upper layer to process.

This approach certainly works, but many optimizations cannot be implemented: a. The data iteration must be the entire row, even if only one of the columns is needed, the serialization/deserialization overhead is inevitable. b. The work of project and filter cannot be delegated to the storage layer for processing; c. Without column information, it is impossible to do column-by-column coding and compression. d. Schema change can only reshape data violently... Therefore, in order to be truly efficient, more and more storage engines choose to be fully aware of the schema and store the subtle structure.

The above discussion is only a few big problems of the storage engine of a stand-alone database, and modern databases have put forward higher requirements for the storage engine, scalability, and high availability have become standard, now what to consider is how to give you The storage engine plus distributed capabilities, which in turn involve a series of more complex issues such as high availability consistency guarantee, automatic expansion, distributed transactions, etc., are far beyond the scope of this article and require a separate chapter.

Finally, let's introduce the self-developed distributed database X-DB we are developing. The storage engine uses our self-developed X-Engine. X-Engine uses a layered data storage architecture, because the goal is to store large-scale mass data, provide high concurrent transaction processing capabilities and reduce costs as much as possible.

We divide the data into multiple levels according to the data access frequency (hot and cold), design the corresponding storage structure according to the access characteristics of each level of data, and write to the appropriate storage device. X-Engine uses LSM-Tree as the architectural basis for tiered storage, and has been redesigned on top of it.

In short, the hot data layer and data update use memory storage, and use a large number of in-memory database technology (Lock-Free index structure/append only) to improve the performance of transaction processing. Several stages of processing are parallelized, greatly improving throughput. The cold (warm) data with low access frequency is gradually eliminated or merged into the persistent storage layer, combined with the current rich storage device hierarchy (NVM/SSD/HDD) for storage.

We have made a lot of optimizations on the compaction process that has a relatively large performance impact, mainly by splitting the data storage granularity, using the features that data update hotspots are more concentrated, reusing data as much as possible in the merging process, finely controlling the shape of the LSM, reducing I/O and computational costs, and at the same time greatly reduce the space enlargement in the merge process. At the same time, more fine-grained access control and caching mechanisms are used to optimize read performance. Of course, the optimization is endless, and thanks to the rich application scenarios, we have gained a lot of engineering experience in it.

X-Engine is now more than a stand-alone database storage engine, combined with our X-Paxos (distributed strong consistent high availability framework), GMS (distributed management service), and X-Trx (distributed transaction processing framework), has evolved It is a distributed database storage system.

Author: Seven Acts

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us