An In-depth Discussion on the LSM Compaction Mechanism

This article discusses the LSM compaction mechanism, analyzes key-value stores in the industry, and goes through the most relevant academic research in this topic.

By Yeti

Introduction

Compaction is critical to systems that use the Log-structured Merge Tree (LSM-Tree) architecture. High-throughput writes due to the log appending mode and constant data flush from the memory to the disk when data exhausts the memory capacity are causing more and more overlap of data ranges and accumulation of data with the same key. Consequently, they have compromised the reading performance and caused space expansion. Therefore, the compaction mechanism was introduced to optimize reading performance and space problems by continuously recycling old data and merging multiple layers into one with periodic background tasks. However, the compaction policy and task scheduling method have become new problems. The seemingly simple functions require a balance among various resources, including space, I/O, CPU, and cache. This article will discuss the following aspects: the compaction policy, challenges, implementation of mainstream LSM systems, and academic research directions.

Compaction Policies

Compaction is used for garbage collection (GC) and merge sort on data, which is necessary for an LSM-Tree system to function. However, the compaction tasks lead to a lot of overhead, such as CPU resources consumed by data compression, decompression, copying and comparison, and also the disk I/O of data reads and writes. Therefore, compaction policies are needed to constrain the shape of the LSM-Tree and determine the files to be merged, the size of different tasks, and their trigger conditions. Different policies have different impacts on read-write amplification, space amplification, and the size of the temporary space. Generally, the system will support several policies and have multiple adjustable parameters, which can be selected according to application scenarios.

Basic Concepts

Read Amplification

Read amplification indicates the number of disk reads a read request causes.

Write Amplification

The write amplification n is defined as the bytes of data actually written to the disk when one byte of data needs to be written. In fact, write amplification needs to be balanced with global order. The higher the requirement of order, the more serious the write amplification will be. The B-Tree series are representatives of order at any time, and the write amplification is more serious for them. LSM-Tree postpones the sorting process to backend compaction, which reduces the write amplification. Furthermore, LSM-Tree has different compaction policies that result in different amplification.

Space Amplification

Space amplification equals the size of space occupied divided by the actual size of data. The space amplification is mainly related to the amount of uncollected but expired data, which is either old versions of data or deleted entries. Traditional databases use inexpensive disks, hence space amplification is not a major concern. Now, as solid-state drive (SSD) disks become the mainstream storage medium, we have to consider the cost.

Temporary Space

A compaction task requires a temporary space. Old data is deleted only after completing the compaction. The size of the temporary space depends on the sharding granularity of the compaction task.

Sstable

Sstable stands for ordered string table. Some systems use other names to describe the same concept. Here, when introducing the policies, we adopt sstable as the unified name.

Size-tiered Compaction

Size-tiered compaction with a low write amplification is suitable for write-intensive workloads. Its disadvantage is that the read amplification and space amplification are relatively high. Take an overview of the implementation of this policy, as shown in the following figure from Scylla. Memtable periodically flushes data to a sstable. Each level has multiple ordered runs, and the bottom level is the largest level. Size-tiered compaction tries to keep the size of sstables or runs close on the same level. When the data size of one level reaches the limit, the whole level is merged and flushed to the next level to become a larger sstable. It keeps merging in this way until the size of one level does not reach the limit.

The disadvantage of size-tiered compaction is that the read amplification and space amplification are relatively high.

For read operations, upper-level optimizations such as cache and bloom filter are ignored here. According to the amount of data to be read, most point lookups will fall to the largest level. Whereas, range lookups need to merge the results of all levels. This means the more the runs, the more severe the read amplification. In his paper, Dostoevsky analyzes the I/O costs of point lookups (considering bloom filter), short-range lookup, and long-range lookups. For more information, see Dostoevsky's paper.
The following analysis uses the case that the same batch of data is repeatedly updated or deleted. Assume that the size ratio of adjacent levels is T. When the number of runs in each level reaches T, the merging of the entire level is triggered. Then, space amplification is O(T). In actual operations, the temporary space needs to be considered and reserved to accommodate the amplification. In the Scylla test, when T is set to 4, the maximum space usage of 1.2 GB data reaches 9.3 GB, which shows that the tiered policy has a rigorous space requirement.

Leveled Compaction

The leveled compaction policy will reduce space amplification and read amplification.

An LSM-Tree consists of multiple levels. The size of each level is maintained at T times that of the previous level. When the size ratio of each adjacent level pair is the same, the write amplification is minimal.

As shown in the following figure, each leveled level is an ordered run that consists of multiple sstables. These sstables also maintain an orderly relationship with each other. When the data size of each level reaches the upper limit, this level will merge with the run of the next level. This method combines multiple runs of the level to one, reducing the read amplification and space amplification. Also, the smaller sstables provide fine-grained task splitting and control. This way, controlling the task size is actually controlling the size of the temporary space.

When the data volume ratio of adjacent levels is set to 10 and the largest level has enough data, the worst-case will be when all levels are updating the entries in the largest level. The data volume of other levels divided by that of the bottom level is about 0.11, hence the space amplification is 11%. Even if the largest level does not have enough data and fails to reach the data volume ratio, space amplification is only doubled in the worst case. Compared with the T (size ratio of adjacent levels) times space amplification of the size-tiered policy, the leveled policy results in a lower space amplification and is more suitable for scenarios with many reads and few writes.

The problem facing leveled policy is the write amplification. Similarly, analyze the worst case when the data volume ratio of adjacent levels is set to 10. The data at level a overlaps with all the data at level a+1. When level a is merged downward, all data in level a+1 is involved. Then, a total of 11 times the size of the original data is written to the disk, and the write amplification is 11 times that of the size-tiered policy. This is the worst case in theory. The actual write amplification will be closer to the aforementioned value if there are many random writes. In general, the probability that the newly written data is updated again after a short period of time is high, which conforms to the time locality. In this case, the total amount of data is almost unchanged due to the old-version data merging, and downward merging will not be triggered.

Hybrid

This solution is a combination of the tiered policy and leveled policy. Many systems use a hybrid architecture to balance between space amplification and read-write amplification. The hybrid policy has a lower space amplification and read amplification compared to the tiered policy and has a lower write amplification compared to the leveled policy.

Time Series

The characteristics of time-series data are as follows:

Index keys are related to the write time.
Data is written chronologically, and only a small amount of data does not comply with this order.
Data can be deleted only based on the time-to-live (TTL) index or by removing the entire partition.
The speed of data writing is almost constant.
Data queries often fall into a specific partition, such as a query for "values from the last hour, last day, or last week".

Therefore, many systems adopt specific compaction policies to improve query performance based on these characteristics. Each sstable has a start time tag and an end time tag. Then, sstables with similar ranges are gradually merged, hence the older the sstable is, the wider the time range is, and there is almost no time range overlapping between sstables. In this way, when a query is needed for a specific time, the corresponding sstables can be accurately selected.

Key-value Stores in the Industry

Every system selects policies and scheduling mechanisms centered on "how and when", which implies how does the policy select files and when the selection is triggered.

Systems specific to write optimization include Bigtable, HBase, and Canssdra. RocksDB is a system specific to space optimization and read optimization.

RocksDB

RocksDB supports many ways of compaction and has also made a lot of optimizations. In addition to the classic tiered and leveled policies, RocksDB has two hybrid modes, leveled-N and tiered+leveled.

Leveled

The leveled compaction of RocksDB is actually combined with tiered compaction. Level0 adopts the tiered policy, and the rest of the levels adopt the leveled policy. Each run on level0 comes from a flush of memtables and multiple runs have range overlaps. On other levels, an ordered run is composed of multiple sstables. This combination has some advantages. The write amplification is reduced, and the memtable is quickly released to mitigate memory pressure when the write load is high. Leveled policy is the default compaction policy for RocksDB.

Tiered (Universal)

Maintaining N copies of full data at the largest level leads to N times the space amplification. RocksDB uses additional parameters to limit the worst space amplification, by allowing only K runs at the largest level at most. The range of K is 2 to N.

The tiered policy in RocksDB relies on universal compaction. Users can use the universal style if they cannot cope with high write rates by using the leveled policy.

Leveled-n

Leveled-n optimizes the write amplification by allowing multiple ordered runs per level. When the compaction is performed, a sort run on the level Ln will be merged with all the runs on the other levels. The lazy compaction proposed by Dostoevsky is also based on a similar idea which is to balance read-write amplification and space amplification by adjusting the number of runs on the largest level and the size ratio T of adjacent levels.

Tiered+Leveled

The tiered+leveled policy is a hybrid policy that allows level Lk to adopt the tiered policy, whereas level Ln to adopt the leveled policy (n > k). In this case, RocksDB will keep multiple memtables and allow multiple runs at level0. In this case, you can regard it as the memory level and the level0 level are tiered, while the rest levels are leveled. However, level0 will not be merged with the memory but marked as tiered instead.

Scylla

Scylla supports the tiered and leveled compaction policies. The default policy is the size-tiered compaction policy for write optimization scenarios. Scylla also supports the incremental compaction policy as an optimized size-tiered policy and the date-tiered compaction policy that focuses on time-specific scenarios.

Incremental Compaction (IC)

Incremental compaction (IC) is an optimized version of size-tiered compaction and is proposed to solve the temporary space amplification. In the original size-tiered compaction, each sstable is an ordered file, and multiple sstables are merged into a larger sstable. Hence, at least 50% of the original space must be reserved as temporary space. Incremental compaction divides sstables into multiple shards, each of which is 1 GB in size by default. Then, the shards are merged incrementally to release the space occupied by old data. This reduces temporary space amplification.

Time-window Compaction (TWCS)

Scylla previously supported date-tiered compaction, which was initially implemented and used by Cassandra. Time-window compaction (TWCS) is a substitute for date-tiered compaction. For this compaction mode, sstables of the same time window are merged according to the tiered policy, whereas sstables of different time windows are not merged. Both TWCS and date-tiered compaction are time-series compaction modes, and they optimize read performance in time-specific scenarios.

HBase

The compaction type in HBase is divided into minor and major. Similar to Bigtable, minor compaction selects from the store files and merges some of them. Hence, the resulting data volume is low and the merge frequency is high. In addition, multi-version data is not recycled to avoid problems that involve the processing of the same keys. Major compaction merges all store files into one; hence, this compaction mode involves a large amount of data and is usually triggered at a large interval (usually in days) or is triggered manually. Major compaction additionally handles expired data.

The store files in HBase are similar to sstables. HBase has weakened the concept of layers so that each file is an ordered run. In the versions 0.94 and 0.96, HBase proposed two policies for selecting and merging files: RatioBasedCompactionPolicy and ExploringCompactionPolicy.

RatioBasedCompactionPolicy chronologically scans the store files that are not in the merging queue until the merging size, which relies on the ratio parameter, is satisfied. In addition, RatioBasedCompactionPolicy always merges old store files first. In comparison, ExploringCompactionPolicy scans all store files that are not in the merging queue and selects the ones that are smallest in size from multiple collections that meet the ratio to minimize I/O consumption. In terms of classification, both policies belong to tiered compaction policies. HBase also supports date-tiered compaction policies.

In HBase version 2.0.0, the new in-memory compaction policy is proposed. The in-memory compaction policy is designed to reduce compaction and write amplification by reducing the flush frequency. This policy is suitable for scenarios where hot data is updated repeatedly and old-version data is removed directly from the memory. However, for scenarios where most writes are unique, in-memory compaction reduces the performance due to additional CPU resource consumption.

Challenges

The compaction task has a significant impact on performance, which involves two aspects: the consumption of I/O and CPU resources in the compaction process and batch cache failures after the compaction. In addition, the LSM-Tree structure has incurred another tricky problem that the entries to be deleted cannot be removed immediately. For example, the entries to be deleted may exist on every level, so they can only be eliminated by full compaction, which consumes a lot of resources.

Resource Consumption

When a compaction task runs, many of its operations, such as compression, decompression, copying, and comparison, will consume CPU resources. At the same time, some other operations will consume I/O resources, such as data reads and writes.

Cache Failure

When a compaction task is completed, old data files become invalid, so do data in the cache. The larger the compaction task is and the hotter the data is, the more severe the cache failure becomes. If this is the case, the cache failure leads to an increase in the cache miss rate, which will incur massive read I/Os. These read I/Os will further compete with the read I/Os generated by other compaction tasks for resources and exacerbate the deterioration of read performance.

Delete Entries

In an LSM-Tree, it writes a new entry for a deletion or update operation. Being different from an update operation, a deletion operation is usually distinguished by a one-bit flag. Take deletion as an example, the accumulation of entry deletion increases the space amplification and write amplification. Even worse, the amplification will affect read performance greatly in range-delete scenarios. Currently, to remove deleted entries, the system needs to trigger full compaction, which is resource-consuming and costly. RocksDB selects files for merging according to the distribution of deletions, in order to eliminate deletion-related records and avoid redundant data merging. This is an optimization method. However, if the deletions span various files, full compaction is still inevitable.

The deletion operation also has an impact on statistics. If the LSM compaction is used in a relational database, for example, MyRocks, the inaccurate statistics will mislead the SQL optimizer in decision-making.

Lethe also mentioned in a paper the risk of infringing privacy. Due to the deletion mechanism, the data required to be deleted by the user was not physically deleted within a certain period of time. The data could even still exist after the user had been deregistered, resulting in potential legal risks.

Academic Research

Compaction Management of Distributed KV Stores

In this paper (Reference 1), the author proposed to perform offload compaction on a separate server and use incremental cache backfilling to resolve resource consumption and cache failure in distributed KV stores. The implementation and evaluation of offload compaction are based on HBase. HBase is derived from Bigtable. Therefore, their architectures are similar. HBase uses ZooKeeper to maintain consistency and evenly splits data into multiple region servers according to the key range, and its data is persisted in HDFS. HBase maintains memstores in region servers to accommodate new data writes and caches data files from HDFS in block cache.

Offload Compaction

In the HBase architecture, two new components are added: compaction manager and compaction servers. Compaction servers is dynamically added or removed. Like region servers, compaction servers read data from HDFS and then write the data back to HDFS after merging. The compaction manager is similar to the HBase master. It manages a set of compaction servers and the mappings from region servers to compaction servers. Compaction tasks of region servers will be offloaded to compaction servers, and the merged data received by compaction servers will be used to preheat the cache.

Remote Cache

After the compaction in the region server is offloaded to the remote server, the compaction data is synchronized to the memory of the remote server. The access request for this data either reads from a local disk or reads from the memory of a remote server. This is actually a choice between disk I/O and network I/O. The test result does not show obvious improvement by accessing the remote cache.

Smart Cache

As accessing the remote cache does not improve the performance, the author proposed to warm up the local cache by using remote cache increments. The method replaces outdated data that corresponds to a specified key range in order to avoid the elimination of large amounts of data in the cache. Therefore, the requested data may either fall into the old cache or the newly replaced cache. There is little chance that the data had been evicted from the cache and had not been filled back in. Needless to say, this method works pretty well, and almost found no cache miss.

In distributed systems, compaction problems are solved externally, which is beyond the ability of a single machine. Take the smart cache solution as an example. For a single machine, it cannot keep both the old and new data in memory and avoid performance churning at the same time. In distributed systems, however, the new data can be stored in a remote memory and gradually replace the data in the local cache. It turns out that the data returned for a read request is the same before and after the compaction. Hence, this method works correctly and satisfactorily.

PebblesDB

This paper (Reference 2) proposes an improved structure of LSM-Tree called FLSM. It is designed to reduce write amplification by reducing rewrite operations in the compaction process. In addition, a high-performance KV store named PebblesDB is built based on FLSM. The design concept of FLSM comes from the combination of the skip list and LSM-Tree. Write amplification actually comes from data rewrite operations in the compaction process. In addition, write amplification reduces the life span of storage devices that have limited erasure times, such as SSD disks. Hence, write amplification increases storage costs and reduces write throughput.

FLSM divides data into disjointed units by adding guards to each level. Each unit contains multiple sstables with overlapping ranges. When level i is merged into level i+1, instead of rewriting sstables on level i+1, FLSM only places the new sstables that are divided according to guards to level i+1 and will not merge them with the existing sstables on level i+1. This mechanism not only reduces write amplification but also accelerates compaction. However, it will have an impact on reads. Hence, the author also proposed a way to optimize read performance. One way is adopting seek-based compaction. The seek-based compaction uses the number of times, which the seek() function is called, to trigger compaction, in order to reduce the number of sstabels in hot guards and consequently reduce read overhead. Another way to optimize read performance is by adopting parallel seek. The parallel seek uses multiple threads to search for sstables concurrently. Each thread reads one sstable and then merges the result. In this way, even if multiple sstables exist in a guard, only a small overhead is required.

The guard information is stored in memory. Like LSM metadata, it is written to a WAL log. When a crash occurs, the manifest log can be used to restore the data.

Partitioning with guards and adopting seek-based compaction are creative; and indeed, FLSM reduces the write amplification. But in my opinion, the read problems are not solved as parallel seeking has too many restrictions. By contrast, the seek-based compaction can be further optimized though. Read requests often reflect the hotness of data and the urgency of merging. If we can make a more precise judgment here, merging will be much more effective.

Dostoevsky

In detail, this paper (Reference 3) analyzes the I/O complexity of various operations under different tiered and leveled policies. It uses the level num, adjacent level size ratio, the number of entries, buffer size, and a series of related variables to derive the specific I/O cost formula and space amplification for different operations. For point lookups, the influence of a bloom filter is taken into account. For range queries, short and long queries are analyzed respectively. In addition, lazy leveling, a combination of tiering and leveling, was proposed to optimize read performance. In lazy leveling, the largest level works in leveling mode and the other levels work in tiering mode. It is suitable for mixed workloads including updates, point lookups, and long-range lookups. Tiering is suitable for update-intensive workloads, whereas leveling is suitable for lookup-intensive workloads. To adapt to different workloads, the merge frequency is adjustable under different policies, including tiering, leveling, and lazy leveling. This mechanism is called fluid LSM, which is similar to leveled-n in RocksDB. In addition, Dostoevsky also proposed to achieve the maximum throughput by dynamically calculating the optimal configuration during execution.

The analysis in the paper shows the complexity of the compaction mechanism. It is almost impossible to achieve the best performance by using a general mechanism that consumes the least resources in different scenarios. More factors, such as the cache, deletion, task splitting granularity, and reuse technologies, must be considered in practical industrial implementation. Therefore, most systems use multiple policies and parameters to adapt to different scenarios.

Reflection

The cost of compaction on a single machine is inevitable, but different policies will lead to different read-write and space amplification results. This is also one of the academic research directions in LSM-Tree systems in recent years. The focus is on how to optimize the worst-case by properly selecting policies. Most systems support multiple compaction policies and different parameters to adapt to different business scenarios. This is a continuously improved process that tries to solve problems as they emerge. Though most of the time it works, extreme situations in complex scenarios or under mixed workloads are inevitable. Therefore, compaction based on parameter adjustment or manual operations is not elegant and efficient. In distributed scenarios, policies are regarded as a black box. Many problems are solved by external resources, and more importantly, they rely on scheduling policies.

In practical applications, most business workloads will not keep the system running in the worst-case scenario. Then, is it necessary to put in the enormous effort needed to solve the worst-case scenario? How about we "bypass" the worst case to maintain the system in a more stable state so that the system can take more tasks during the off-peak period to remove the low-level burdens and get ready for the peak period? I know this is way too ideal. Things are always more complicated than expected and progress unexpectedly. So, this makes it impossible to proceed with preset rules. What we can come up with is that we introduce scheduling control and shift focus from merging policies to scheduling policies, which schedule tasks based on the load curve to meet the resource requirements at different times. Maybe this is a more complicated topic, and we hope to see more research and exploration in this direction.

References

1) https://www.scylladb.com/2016/08/30/date-tiered-compaction-strategy/
2) Compaction management in distributed key-value datastores
3) PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
4) Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree-Based Key-Value Stores via Adaptive Removal of Superfluous Merging
5) Lethe: A Tunable Delete-Aware LSM Engine
6) https://github.com/facebook/rocksdb/wiki/Compaction
7) https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html
8) http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html