Design and Implementation of Multi Part AOF

One AOF

AOF ( append only file ) persistence records each write command in the form of an independent log file, and replays the commands in the AOF file when Redis starts to achieve the purpose of data recovery.

Since AOF will record each redis write command in an additional way, as the number of write commands processed by Redis increases, the AOF file will become larger and larger, and the time for command playback will also increase. In order to solve this problem, Redis The AOF rewrite mechanism (hereinafter referred to as AOFRW) is introduced. AOFRW will remove redundant write commands in AOF, rewrite and generate a new AOF file in an equivalent way to reduce the size of the AOF file.

Two AOFRW

Figure 1 shows the implementation principle of AOFRW. When AOFRW is triggered to execute, Redis will first fork a child process to perform a background rewrite operation. This operation will rewrite all Redis data snapshots at the moment the fork is executed to a temporary named temp-rewriteaof-bg-pid.aof AOF file.

Since the rewriting operation is performed in the background by the child process, the main process can still respond to user commands normally during AOF rewriting. Therefore, in order for the child process to finally obtain the incremental changes generated by the main process during rewriting, the main process will not only write the executed write command into aof_buf, but also write a copy to aof_rewrite_buf for caching. In the later stage of sub-process rewriting, the main process will use the pipe to send the data accumulated in aof_rewrite_buf to the sub-process, and the sub-process will append the data to the temporary AOF file (see [1] for detailed principles).

When the main process undertakes a large write traffic, a lot of data may accumulate in aof_rewrite_buf, so that the child process cannot consume all the data in aof_rewrite_buf during rewriting. At this time, the remaining data of aof_rewrite_buf will be processed by the main process at the end of rewriting.

When the child process completes the rewriting operation and exits, the main process will handle subsequent things in backgroundRewriteDoneHandler. First, append the unconsumed data in aof_rewrite_buf during rewriting to the temporary AOF file. Secondly, when everything is ready, Redis will use the rename operation to atomically rename the temporary AOF file to server.aof_filename, and the original AOF file will be overwritten. At this point, the entire AOFRW process ends.

Three problems with AOFRW

1: memory overhead

As can be seen from Figure 1, during AOFRW, the main process will write the data changes after fork into aof_rewrite_buf, and most of the contents in aof_rewrite_buf and aof_buf are duplicated, so this will bring additional memory redundancy overhead .

In the aof_rewrite_buffer_length field in Redis INFO, you can see the memory size occupied by aof_rewrite_buf at the current moment. As shown below, under high write traffic, aof_rewrite_buffer_length occupies almost the same memory space as aof_buffer_length, almost doubling the memory.

When the memory size occupied by aof_rewrite_buf exceeds a certain threshold, we will see the following information in the Redis log. It can be seen that aof_rewrite_buf occupies 100MB of memory space and 2135MB of data is transferred between the main process and the child process (the child process will also have the memory overhead of internally reading the buffer when reading the data through the pipe).

For the in-memory database Redis, this is not a small overhead.

The memory overhead caused by AOFRW may cause the Redis memory to suddenly reach the maxmemory limit, which affects the writing of normal commands, and even triggers the operating system limit to be killed by the OOM Killer, resulting in Redis being unserviceable.

2: CPU overhead
There are three main places for CPU overhead, which are explained as follows:

1. During AOFRW, the main process needs to spend CPU time writing data to aof_rewrite_buf, and use the eventloop event loop to send the data in aof_rewrite_buf to the child process:

2. In the later stage of the rewriting operation performed by the child process, the incremental data sent by the main process in the pipe will be read cyclically, and then appended to the temporary AOF file:

3. After the child process completes the rewriting operation, the main process will finish the work in backgroundRewriteDoneHandler. One of the tasks is to write the data that has not been consumed in aof_rewrite_buf during rewriting to a temporary AOF file. If there is a lot of data left in aof_rewrite_buf, CPU time will also be consumed here.

The CPU overhead caused by AOFRW may cause Redis to jitter on the RT when executing commands, and even cause the client to time out.

3: Disk IO overhead

As mentioned above, during AOFRW, the main process will not only write the executed write command to aof_buf, but also write a copy to aof_rewrite_buf. The data in aof_buf will eventually be written to the currently used old AOF file, generating disk IO. At the same time, the data in aof_rewrite_buf will also be written into the new AOF file generated by rewriting, resulting in disk IO. Therefore, the same piece of data will generate two disk IOs.

4: Code complexity

Redis uses the six pipes shown below for data transmission and control interaction between the main process and child processes, which makes the entire AOFRW logic more complicated and difficult to understand.

Four MP-AOF implementation

1 Program overview

As the name implies, MP-AOF is to split the original single AOF file into multiple AOF files. In MP-AOF, we divide AOF into three types, namely:

BASE: Indicates the basic AOF, which is generally generated by subprocesses through rewriting, and there is only one file at most.
INCR: Indicates incremental AOF, which is generally created when AOFRW starts to execute, and there may be multiple files.
HISTORY: Indicates historical AOF, which is changed from BASE and INCR AOF. Every time AOFRW is successfully completed, the corresponding BASE and INCR AOF before this AOFRW will become HISTORY, and the AOF of HISTORY type will be automatically deleted by Redis.

In order to manage these AOF files, we introduced a manifest (list) file to track and manage these AOFs. At the same time, in order to facilitate AOF backup and copy, we put all AOF files and manifest files into a separate file directory, and the directory name is determined by the appenddirname configuration (a new configuration item in Redis 7.0).


Figure 2 shows the general flow of performing AOFRW in MP-AOF. At the beginning, we will still fork a child process for rewriting operation. In the main process, we will open a new AOF file of type INCR at the same time. During the rewriting operation of the child process, all data changes will be written to In this newly opened INCR AOF. The rewriting operation of the child process is completely independent. During the rewriting period, there will be no data and control interaction with the main process. The final rewriting operation will generate a BASE AOF. The newly generated BASE AOF and the newly opened INCR AOF represent all the data of Redis at the current moment. At the end of AOFRW, the main process will be responsible for updating the manifest file, adding the newly generated BASE AOF and INCR AOF information, and marking the previous BASE AOF and INCR AOF as HISTORY (these HISTORY AOF will be deleted asynchronously by Redis). Once the manifest file is updated, it marks the end of the entire AOFRW process.

As can be seen from Figure 2, we no longer need aof_rewrite_buf during AOFRW, so the corresponding memory consumption is removed. At the same time, there is no data transmission and control interaction between the main process and the child process, so the corresponding CPU overhead is also completely removed. Correspondingly, the six pipes and their corresponding codes mentioned above are all deleted, making the logic of AOFRW simpler and clearer.

2: Key realization

Manifest

1) Representation in memory

MP-AOF strongly relies on the manifest file, and the manifest is represented in memory as the following structure, where:

aofInfo: Indicates an AOF file information, currently only includes file name, file serial number and file type
base_aof_info: Indicates BASE AOF information, when there is no BASE AOF, this field is NULL
incr_aof_list: Used to store information of all INCR AOF files, all INCR AOF will be arranged in the order in which the files are opened
history_aof_list: used to store HISTORY AOF information, the elements in history_aof_list are moved from base_aof_info and incr_aof_list

2) Representation on disk

Manifest is essentially a text file containing multiple lines of records. Each line of records corresponds to an AOF file information. The information is displayed in the form of key/value pairs, which is convenient for Redis processing, easy to read and modify. The following is a possible manifest file content:

The Manifest format itself needs to have a certain degree of extensibility in order to add or support other functions in the future. For example, it can conveniently support adding new key/value and annotations (similar to the annotations in AOF), which can ensure better forward compatibility.

file naming convention

Before MP-AOF, the file name of AOF is the setting value of appendfilename parameter (default is appendonly.aof).

In MP-AOF, we use basename.suffix to name multiple AOF files. Among them, the appendfilename configuration content will be used as the basename part, and the suffix consists of three parts in the format of seq.type.format, where:

seq is the sequence number of the file, monotonically increasing from 1, BASE and INCR have independent file sequence numbers
type is the type of AOF, indicating whether the AOF file is BASE or INCR
format is used to indicate the internal encoding method of this AOF. Since Redis supports the RDB preamble mechanism, BASE AOF may be encoded in RDB format or AOF format:

Therefore, when using the default configuration of appendfilename, the possible names for BASE, INCR, and manifest files are as follows:

Compatible with old version upgrades

Since MP-AOF strongly relies on the manifest file, Redis will load the corresponding AOF file strictly according to the instructions of the manifest when it starts. However, when upgrading from an old version of Redis (referring to a version before Redis 7.0) to Redis 7.0, since there is no manifest file at this time, how to make Redis correctly recognize that this is an upgrade process and load the old AOF correctly and safely is a must ability to support.

Identification is the first step in this important process. Before actually loading the AOF file, we will check whether there is an AOF file named server.aof_filename in the Redis working directory. If it exists, it means that we may be performing an upgrade from an old version of Redis. Next, we will continue to judge. When one of the following three conditions is met, we will consider it an upgrade start:

If the appenddirname directory does not exist
Or the appenddirname directory exists, but there is no corresponding manifest file in the directory
If the appenddirname directory exists and there is a manifest file in the directory, and there is only BASE AOF-related information in the manifest file, and the name of the BASE AOF is the same as server.aof_filename, and there is no file named server.aof_filename in the appenddirname directory

The upgrade preparation operation is Crash Safety. If a crash occurs in any of the above three steps, we can correctly identify and retry the entire upgrade operation in the next startup.

Multi-file loading and progress calculation

When Redis loads AOF, it will record the loading progress and display it through the loading_loaded_perc field of Redis INFO. In MP-AOF, the loadAppendOnlyFiles function will load AOF files according to the incoming aofManifest. Before loading, we need to calculate the total size of all AOF files to be loaded in advance, and pass it to the startLoading function, and then continuously report the loading progress in loadSingleAppendOnlyFile.

Next, loadAppendOnlyFiles will sequentially load BASE AOF and INCR AOF according to aofManifest. After all the AOF files are currently loaded, stopLoading will be used to end the loading state.

AOFRW Crash Safety

When the child process completes the rewriting operation, the child process will create a temporary AOF file named temp-rewriteaof-bg-pid.aof. At this time, this file is still invisible to Redis because it has not been added to the manifest. in the file. In order for it to be recognized by Redis and loaded correctly when Redis starts, we also need to rename it according to the naming rules mentioned above, and add its information to the manifest file.

Although AOF file rename and manifest file modification are two independent operations, we must ensure the atomicity of these two operations, so that Redis can correctly load the corresponding AOF at startup. MP-AOF uses two designs to solve this problem:

The name of BASE AOF contains the file serial number to ensure that the BASE AOF created each time will not conflict with the previous BASE AOF;
Perform the rename operation of AOF first, and then modify the manifest file;
For the sake of illustration, we assume that before AOFRW starts, the content of the manifest file is as follows:

After the rewriting of the child process is completed, in the main process, we will rename temp-rewriteaof-bg-pid.aof to appendonly.aof.2.base.rdb, and add it to the manifest, and at the same time, the previous BASE and INCR AOF marked as HISTORY. At this point, the content of the manifest file is as follows:

At this point, the results of this AOFRW are visible to Redis, and the HISTORY AOF will be cleaned up asynchronously by Redis.

The backgroundRewriteDoneHandler function implements the above logic through seven steps:

Before modifying the server.aof_manifest in the memory, first dup a temporary manifest structure, and subsequent modifications will be made to this temporary manifest. The advantage of this is that once the subsequent steps fail, we can simply destroy the temporary manifest to roll back the entire operation, avoiding polluting the server.aof_manifest global data structure;
Get the new BASE AOF file name (denoted as new_base_filename) from the temporary manifest, and mark the previous (if any) BASE AOF as HISTORY;
Rename the temp-rewriteaof-bg-pid.aof temporary file generated by the child process to new_base_filename;
Mark all the last INCR AOF in the temporary manifest structure as HISTORY type;
Persist the information corresponding to the temporary manifest to the disk (persistAofManifest will ensure the atomicity of the modification of the manifest itself);
If the above steps are successful, we can safely point the server.aof_manifest pointer in the memory to the temporary manifest structure (and release the previous manifest structure), so far the entire modification is visible to Redis;
Clean up the AOF of the HISTORY type. This step is allowed to fail because it will not cause data consistency problems.

Support AOF truncate

When the process crashes, the AOF file may be incompletely written. For example, only MULTI is written in a transaction, but Redis crashes before EXEC is written. By default, Redis cannot load this incomplete AOF, but Redis supports the AOF truncate function (opened through the aof-load-truncated configuration). The principle is to use server.aof_current_size to track the last correct file offset of AOF, and then use the ftruncate function to delete all file contents after the offset, so that although some data may be lost, the integrity of AOF can be guaranteed.

In MP-AOF, server.aof_current_size no longer indicates the size of a single AOF file but the total size of all AOF files. Because only the last INCR AOF may have the problem of incomplete writing, we introduced a separate field server.aof_last_incr_size to track the size of the last INCR AOF file. When the last INCR AOF is incompletely written, we only need to delete the file content after server.aof_last_incr_size.

AOFRW current limiting

Redis supports automatic execution of AOFRW when the AOF size exceeds a certain threshold. When a disk failure occurs or a code bug is triggered that causes AOFRW to fail, Redis will repeatedly execute AOFRW until it succeeds. Before the advent of MP-AOF, this didn't seem to be a big problem (at most it consumed some CPU time and fork overhead). But in MP-AOF, because each AOFRW will open an INCR AOF, and only when AOFRW is successful, the previous INCR and BASE will be converted to HISTORY and deleted. Therefore, continuous AOFRW failures will inevitably lead to the coexistence of multiple INCR AOFs. In extreme cases, if the AOFRW retry frequency is high we will see hundreds or thousands of INCR AOF files.

To this end, we introduced the AOFRW current limiting mechanism. That is, when AOFRW has failed three times in a row, the execution of the next AOFRW will be forcibly delayed by 1 minute. If the next AOFRW still fails, it will be delayed by 2 minutes, and so on. The delay is 4, 8, 16..., the current maximum delay time for 1 hour.

During the AOFRW current limit, we can still use the bgrewriteaof command to execute AOFRW immediately.

The introduction of the AOFRW current limiting mechanism can also effectively avoid the CPU and fork overhead caused by AOFRW high-frequency retries. Many RT jitters in Redis are related to fork.

Five summary

The introduction of MP-AOF has successfully resolved the negative impact of the previous AOFRW memory and CPU overhead on Redis instances and even business access. At the same time, in the process of solving these problems, we also encountered many unexpected challenges. These challenges mainly come from the huge user group and diverse usage scenarios of Redis. Therefore, we must consider the use of MP- AOF may encounter problems. Such as compatibility, ease of use, and reducing intrusiveness to Redis code as much as possible. This is the top priority of the evolution of Redis community functions.

At the same time, the introduction of MP-AOF also brings more room for imagination to the data persistence of Redis. For example, when aof-use-rdb-preamble is enabled, BASE AOF is essentially an RDB file, so we do not need to perform a separate BGSAVE operation when performing a full backup. Just back up BASE AOF directly. MP-AOF supports the ability to turn off the automatic cleaning of HISTORY AOF, so those historical AOFs have a chance to be preserved, and currently Redis already supports adding timestamp annotations to AOF, so based on these we can even implement a simple PITR capability (point-in -time recovery).

The design prototype of MP-AOF comes from the binlog implementation of Tair for redis enterprise version [2]. Live, PITR and other enterprise-level capabilities to meet the needs of users in more business scenarios. Today we contribute this core capability to the Redis community, hoping that community users can also enjoy these enterprise-level features, and create their own business code through better optimization of these enterprise-level features. For more details about MP-AOF, please refer to the related PR (#9788), where there are more original designs and complete codes.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us