ClickHouse kernel analysis-MergeTree storage structure and query acceleration-Alibaba Cloud Developer Community

Note: The following analysis is based on open-source v19.15.2.2-stable


ClickHouse is a popular open-source columnar storage analytic database recently. Its core features are ultimate storage compression ratio and query performance. I am learning ClickHouse this product recently. From my personal perspective, storage is the key to determining the core competitiveness and application scenarios of a database. Therefore, I will publish a series of articles to analyze the most important ClickHouse the storage kernel in the MergeTree. This article introduces the storage format of MergeTree and thoroughly analyzes the ultimate retrieval performance of MergeTree storage.

MergeTree storage

MergeTree thought

when it comes to the word MergeTree, you may think of LSM-Tree data structure. We often use it to solve the performance problem of random disk writing. The core idea of MergeTree is the same as that of LSM-Tree. MergeTree storage structure needs to sort the data written by users and then store them in order. Ordered storage brings two core advantages:

• When the column storage file is compressed by block, the column values in the sorting key are continuous or repeated, so that the data compression of the column storage block can obtain the ultimate compression ratio.

Storage order itself is an index structure that can accelerate queries, according to the equivalent condition or range condition of the column in the sorting key, we can quickly find the approximate position range of the target row (described in detail below), this index structure does not incur additional storage overhead.

Everyone from ClickHouse official document found on a series of MergeTree table engine, including based MergeTree, have data to heavy ability of ReplacingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, have data aggregation ability of SummingMergeTree, aggregatingMergeTree. However, these MergeTree table engines with special capabilities have no difference in storage and basic MergeTree. They all add "additional merging logic" in the process of data merging ", this section describes the asynchronous Merge mechanism of MergeTree in detail.

MergeTree storage structure

to help you understand the storage structure of a table, the following table lists the DDL of a POC user. We will analyze the kernel design of MergeTree storage from this table. From the PARTITION BY statement of DDL, we can see that you have created data partitions at the hourly granularity of each zone server, and the data inside each data PARTITION is based on (action_id, scene_id, time_ts, level, uid) is used as the sorting key for orderly storage.

CREATE TABLE user_action_log (
  `time` DateTime DEFAULT CAST('1970-01-01 08:00:00', 'DateTime') COMMENT '日志时间',
  `action_id` UInt16 DEFAULT CAST(0, 'UInt16') COMMENT '日志行为类型id',
  `action_name` String DEFAULT '' COMMENT '日志行为类型名',
  `region_name` String DEFAULT '' COMMENT '区服名称',
  `uid` UInt64 DEFAULT CAST(0, 'UInt64') COMMENT '用户id',
  `level` UInt32 DEFAULT CAST(0, 'UInt32') COMMENT '当前等级',
  `trans_no` String DEFAULT '' COMMENT '事务流水号',
  `ext_head` String DEFAULT '' COMMENT '扩展日志head',
  `avatar_id` UInt32 DEFAULT CAST(0, 'UInt32') COMMENT '角色id',
  `scene_id` UInt32 DEFAULT CAST(0, 'UInt32') COMMENT '场景id',
  `time_ts` UInt64 DEFAULT CAST(0, 'UInt64') COMMENT '秒单位时间戳',
  index avatar_id_minmax (avatar_id) type minmax granularity 3
) ENGINE = MergeTree()
PARTITION BY (toYYYYMMDD(time), toHour(time), region_name)
ORDER BY (action_id, scene_id, time_ts, level, uid)
PRIMARY KEY (action_id, scene_id, time_ts, level);

The MergeTree storage structure of the table is as follows: MergeTree storage structure of a table, each data partition is independent of each other and is not logically associated. Multiple MergeTree Data Part exist in a single Data partition. Once these Data Part are generated, they are Immutable. The generation and destruction of Data Part are mainly related to writing and asynchronous Merge. The write link of MergeTree table is an extreme batch load process, Data Part does not support a single append insert. Each batch insert generates a new MergeTree Data Part. If you insert a record at a time, an independent Data Part is generated for that record, which is unacceptable. When we use the MergeTree table engine, we need to perform aggregation on the client for batch writing or create a MergeTree table based on the Distributed table to proxy MergeTree the writing and query of the table, by default, Distributed table caches the user's written data. If the data volume exceeds a certain period of time or exceeds a certain period of time, it will be asynchronously forwarded to MergeTree table. MergeTree storage engines require high real-time data visibility. The preceding figure shows the core Part of the disk files in a single MergeTree Data Part (only the storage files related to action_id and avatar_id columns are drawn). There are three main functional categories:

1. Data Files: action_id.bin, avatar_id.bin, and so on are column storage files compressed by block for a single column. ClickHouse uses a very extreme column storage mode. Here, some details are shown. A single column data may correspond to multiple column storage files, for example, if you declare a Nullable field, a nullable column is added. If you declare an Array field, an array size column is added, when dictionary compression is used, the dictionary Key is also converted into a separate column storage file. There is one small tip: When users do not need special identifiers of Null values, it is best not to declare Nullable, which is ClickHouse extremely simplified design idea.

2 Mark identification files: action_id.mrk2 and avatar_id.mrk2 are the Mark marks in the column storage file. The Mark Mark is related to two important concepts in MergeTree column storage: Granule and Block.

  • Granule is a logical concept used to divide data by row. As for how many rows are a Granule, in the old version, this is a constant set with the index_granularity parameter, that is, every certain row is a Granule. In the current version, the index_granularity_bytes parameter affects the number of rows in the Granule. This parameter specifies that the sum size of all columns in each Granule must not exceed the specified value. The main problem of setting the fixed-length Granule in the earlier version is that the data in the MergeTree is indexed by Granule granularity. This rough index granularity is used in the scenario of analyzing super large wide tables, from the storage read data size will Expansion badly, need user very cautious to set parameters.
  • Block is a compression unit in a column storage file. The Block of each column storage file contains several Granule. The specific number of Granule is controlled by the min_compress_block_size parameter. Each time a Granule data is written in the Block of a column, it checks whether the current Block Size reaches the set value. If the value is reached, it compresses the current Block and writes it to the disk.
  • From the above two points, we can see that MergeTree Block is neither a data size nor a fixed number of rows, and Granule is not a fixed-length logical concept. Therefore, we need additional information to quickly find a Granule. This is the function of Mark identification file, which records the number of rows of each Granule and the offset of the Block in which it is located in the column stored compressed file, the offset position of the Granule in the decompressed Block.

3 primary key index: primary.idx is the primary key index of the table. The definition of primary key index in ClickHouse is slightly different from that in traditional databases. Its primary key index does not have the meaning of primary key deduplication, but it still has the ability to quickly find primary key rows. ClickHouse primary key index stores the primary key values of the starting row in each Granule, while the data stored in MergeTree is strictly sorted by primary key. Therefore, when querying a given primary key condition, we can determine the possible Granule Range of the data based on the primary key index, and then combine the Mark Mark described above, we can further determine the position interval of data in the column store file. ClickHoue primary key index is a rough index with relatively balanced index construction cost and index efficiency. By default, the primary key sequence of the MergeTree is the same as that of the Order By sequence, but you can define the primary key sequence as a partial prefix of the Order By sequence.

4 partition Key indexes: minmax_time.idx and minmax_region_name.idx are the partition key indexes of the table. MergeTree stores the maximum and minimum values of partition keys in each Data Part. When the user query contains partition key conditions, irrelevant Data Part can be directly eliminated, this is a partition pruning technique commonly used in OLAP scenarios.

5Skipping index: **is the MinMax index defined in the avatar_id column. The Merge Tree in the Skipping Index is a rough Index of a type of local aggregation. When defining a skipping index, you need to set the granularity parameter. The granularity parameter specifies the number of Granule of data to be aggregated to generate index information. You also need to set aggregate functions for indexes, such as minmax, set, bloom_filter, and ngrambf_v1. Aggregate functions generate index information by counting column values in several consecutive Granule. The idea of Skipping indexes is similar to that of primary key indexes, because data is sorted by primary key, and the statistics of primary key indexes are actually the MinMax value of the primary key sequence at each Granule granularity, skipping index provides a variety of aggregate functions, which is a supplement to the primary key index. In addition, both indexes need to be designed based on your own business scenarios based on your understanding of the indexing principles.

MergeTree query

this chapter mainly analyzes the data query process on ClickHouse table engine based on MergeTree source code. I roughly divide this process into two parts: index retrieval and data scanning. The index retrieval section performs a serial operation on each MergeTree Data Part, but there is no correlation between Data Part searches. In the Data scan section, the bottom-level column scan is performed in parallel with all Data Part, and there is no correlation between column scan of each Data Part.

Index Search

when a select query is received, MergeTree stores the KeyCondition that extracts the partition keys and primary key conditions from the query. KeyCondition class implements the following three methods to determine the Mark Range that the filter conditions may meet. As mentioned in the previous chapter, the column storage Data in the MergeTree Data Part is indexed by the Mark-identified array at the granularity of Granule, and the Mark Range indicates the subscript interval in the Mark-identified array that meets the query criteria.

/// Whether the condition is feasible in the key range.
    /// left_key and right_key must contain all fields in the sort_descr in the appropriate order.
    /// data_types - the types of the key columns.
    bool mayBeTrueInRange(size_t used_key_size, const Field * left_key, const Field * right_key, const DataTypes & data_types) const;
    /// Whether the condition is feasible in the direct product of single column ranges specified by `parallelogram`.
    bool mayBeTrueInParallelogram(const std::vector<Range> & parallelogram, const DataTypes & data_types) const;
    /// Is the condition valid in a semi-infinite (not limited to the right) key range.
    /// left_key must contain all the fields in the sort_descr in the appropriate order.
    bool mayBeTrueAfter(size_t used_key_size, const Field * left_key, const DataTypes & data_types) const;

In the Index retrieval process, the partition key KeyCondition is used to crop out irrelevant data partitions, then the primary key Index is used to select rough Mark Range, and then the Skipping Index is used to filter the Mark Range generated by the primary key Index. The algorithm of selecting rough Mark Range with primary key indexes is a process of continuously splitting Mark Range, and the returned result is a set of Mark Range. The initial Mark Range covers the entire MergeTree Data Part interval. Each split will take out the last split Mark Range and split it into finer-grained Mark Range according to a certain granularity step, then eliminate the Mark Range that do not meet the conditions in the split result, and stop the split when the Mark Range reaches a certain granularity. This is a simple and efficient rough filtering algorithm.

Before using Skipping Index to filter the Mark Range returned by the primary key Index, you need to construct a Skipping Index for each IndexCondition. Different Skipping Index aggregate functions have different IndexCondition, however, the interface that determines whether the Mark Range meets the conditions is similar to the KeyCondition.

Data Sampling

after the index filtering in the previous section, we have obtained the set of Mark Range to be scanned, and the next step is the data scanning section. This section describes how to implement the data MergeTree in the Sampling. It is not implemented in the process of data scanning, but has been completed in the process of index retrieval. This practice is to achieve extreme sample efficiency. When creating a table, you can specify a column or expression in the primary key as the Sampling key. ClickHouse use the simple and rude method here: the value of the Sampling key must be of the numeric type, and the system assumes that its value is a state of random and uniform distribution. If the value type of the Sampling key is Uint32, when we set the sample ratio to 0.1, the sample is converted into a filter condition during index retrieval: the value of the Sampling key Sampling function, otherwise Sampling deviation may occur. Generally, we recommend that Sampling key is the column value plus a Hash function to randomly break up.

Data scanning

the data scanning section of MergeTree provides three different modes:

  • Final mode: This mode provides a Final merged data view for table engines such as CollapsingMergeTree and SummingMergeTree. As mentioned above, MergeTree advanced table engines based on MergeTree all adopt specific Merge logic for MergeTree Data Part. The problem is that MergeTree Data Part is an asynchronous Merge process, and users cannot see the final Data result without the final Merge into a Data Part. Therefore, ClickHouse provides a final query mode, which provides some advanced Data Part based on multiple BlockInputStream of each Merge Stream, such as DistinctSortedBlockInputStream and SummingSortedBlockInputStream, this logic is consistent with that in asynchronous Merge, so that users can see the final data results in advance.
  • Sorted mode: sort mode can be considered as a query acceleration optimization method for order by push-down storage. Because the Data in each MergeTree Data Part is ordered, therefore, when the order by condition of the sorting key is included in the user query, you only need to set a Data Part for orderly merging of Data on the BlockInputStream of each InputStream to achieve global order.
  • Normal mode: this is the most commonly used Data scanning mode for basic MergeTree tables. Parallel Data scanning is performed among multiple Data Part to achieve high throughput Data reading for a single query.

Next, we will introduce several key performance optimization points in Normal mode:

  • parallel scanning: the concurrency of Data scanning in traditional computing engines is mostly bound to the number of stored files. Therefore, MergeTree Data Part parallel scanning is a basic capability. However, MergeTree storage structure requires Data to be continuously mege and eventually merged into a Data Part, which is the most efficient way to index and Data compression. Therefore, ClickHouse adds MergeTree Data Part parallelism on the basis of Mark Range parallelism. You can set the parallelism during Data scanning. Each scan thread is assigned a Mark Range In Data Part-granularity Task, and multiple scan threads share Mark Range Task Pool, this avoids long tails in storage scans.
  • Data Cache: The data involved in MergeTree query link has different Cache designs. The primary key index and partition key index are loaded into the memory during load Data Part. The Mark file and the column storage file have corresponding MarkCache and UncompressedCache,MarkCache the binary content in the Mark file is cached directly, the UncompressedCache caches the decompressed Block data.
  • SIMD Deserialization: some column types are deserialized by using the handwritten sse command, which has some effects when the data hits the UncompressedCache.
  • PreWhere filtering: ClickHouse syntax supports additional PreWhere filtering conditions, which are determined before the Where condition. When you add a PreWhere filter condition to the SQL filter condition, the storage scan is performed in two phases. First, read the values of the columns that depend on the PreWhere condition, and then calculate whether each row meets the condition. This is equivalent to further narrowing the scanning Range based on the Mark Range. After PreWhere column scanning calculation, the ClickHouse adjusts the number of specific rows to be scanned in the Granule corresponding to each Mark, which is equivalent to discarding some rows at the beginning and end of the Granule.


as you read ClickHouse source code and learn more about its kernel implementation, I think ClickHouse is not yet a perfect analytic database. However, it still has many excellent performance optimization designs, which are originated from Yandex's real analysis scenarios and can indeed solve some business problems under massive data. I believe that in some business scenarios that are suitable for ClickHouse, it is a database that can provide users with the best performance experience.

Clickhouse product link:

more analysis articles will be released in the future. If you are interested, you can communicate and follow more. First, name the following articles:

Merge and MergeTree Mutation full parsing of MergeTree write links MergeTree Table management design: Alter, TTL, and hierarchical storage

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now