Is Your Redis Slowing Down? – Part 2: Optimizing and Improving Performance

2. How to Optimize Redis

2.1 Slow Query Optimization

Try not to use commands with too high complexity above O(N). Put data aggregation operations on the client.
Execute the O(N) command to ensure that n is as small as possible (N <= 300 is recommended) and obtain as little data as possible each time so Redis can process the return in time.

2.2 Centralized Expiration Optimization

There are two ways to avoid this problem:

Add a random expiration time to the centralized expired key to break up the centralized expired time and reduce the pressure on Redis to clean up the expired key.
If you are using Redis version 4.0 or above, you can turn on the lazy-free mechanism. When deleting expired keys, the operation of releasing memory is executed in the background thread to avoid blocking the main thread.

The first solution is to add a random time when setting the expiration time of the key. The pseudo code can be written this way:

# Randomly expire within 5 minutes after the expiration time.
redis.expireat(key, expire_time + random(300))

The second solution is to enable the lazy-free mechanism for Redis 4.0 or later.

# Release the memory of expired keys and put it into a background thread for execution.
lazyfree-lazy-expire yes

At the O&M level, you need to monitor the running status data of Redis. You can run the INFO command on Redis to obtain all the running status data of this instance.

Here, we need to focus on expired_keys, which represents the cumulative number of expired keys deleted in the entire instance so far.

You need to monitor this metric. When this metric has a sudden increase in a short period, you need to report it in time, and compare and analyze it with the time point when the business application reports slow down to check whether the time is consistent. If they are consistent, you can confirm that the latency is indeed caused by the centralized expired key.

2.3 The Instance Memory Reaches the Upper Limit

Avoid storing bigkey and reduce the time-consuming of releasing memory
The elimination strategy is changed to random elimination, which is much faster than LRU (adjusted according to business conditions).
Split the instance and spread the pressure of eliminating keys to multiple instances
If a Redis version 4.0 or above is used, turn on the lazy-free mechanism and put the operation of eliminating the key to release memory into the background thread for execution (configuration lazyfree-lazy-eviction=yes).

2.4 Fork Time-Consuming Optimization

Control the Memory of a Redis Instance: Try to keep the memory below 10 GB. The duration of fork execution depends on the size of the instance. The larger the instance, the longer the duration.
Reasonable Configuration of Data Persistence Policies: RDB backup is performed on slave nodes. We recommend performing an RDB backup during off-peak hours. For businesses that are not sensitive to lost data (such as using Redis as a cache), you can disable AOF and AOF rewrite.
Do Not Deploy an ApsaraDB for Redis Instance on a Virtual Machine: The time required for a fork is related to the system. A virtual machine takes longer than a physical machine.
Reduce the Probability of Full Synchronization between Master and Slave Databases: Appropriately increase the repl-backlog-size parameters to avoid full synchronization between master and slave.

When you create a synchronization node, Redis preferentially checks whether you can attempt to synchronize only some data. The replication link is temporarily disconnected due to a fault in this case. When you re-establish synchronization after the fault is recovered, Redis preferentially attempts to synchronize some data to avoid resource consumption for full synchronization. If the synchronization condition is not met, Redis triggers full synchronization. This judgment is based on the size of the replication buffer maintained on the master. If this buffer is configured too small, the data in the replication buffer will likely be overwritten due to the writes generated by the master during the period when the master-slave replication is disconnected. The offset position that the slave needs to synchronize when re-establishing synchronization cannot be found in the master buffer, then full synchronization will be triggered at this time. How to avoid this situation? The solution is to increase the size of the copy buffer repl-backlog-size. The default size of this buffer is 1MB. If the instance writes a large amount of data, you can increase this configuration.

2.5 Multi-Core CPU Optimization

How Can We Solve This Problem?

If you want to bind the CPU, the optimized solution is not to bind the Redis process to only one CPU logical core but to multiple logical cores. Moreover, the bound multiple logical cores should preferably be the same physical core, so they can share L1/L2 Cache.

Even if we bind Redis to multiple logical cores, it can only alleviate the competition for CPU resources among the main thread, sub-process, and background threads to a certain extent.

Since these sub-processes and sub-threads switch on these multiple logical cores, there is a performance loss.

How Can We Optimize It Further?

Perhaps, you have thought about whether we can make the main thread, sub-process, and background thread bind to fixed CPU cores and prevent them from switching back and forth. Then, the CPU resources they use do not affect each other.

Redis thought of this plan.

Redis introduced this function in version 6.0. We can bind fixed CPU logic cores to the main thread, background thread, background RDB process, and AOF rewrite process through the following configuration.

Bind CPU Cores before Redis6.0

taskset -c 0 ./redis-server

Bind CPU Cores after Redis6.0

# Redis Server and I/O threads are bound to CPU cores 0,2,4,6.
server_cpulist 0-7:2
# Bind the background child thread to CPU cores 1,3.
bio_cpulist 1,3
# Bind the background AOF rewrite process to CPU cores 8,9,10, and 11.
aof_rewrite_cpulist 8-11
# Bind the background RDB process to CPU cores 1,10,11.
# bgsave_cpulist 1,10-1

If you are using Redis version 6.0, you can use the configuration above to improve Redis performance.

Reminder: Generally, Redis performance is good enough. Unless you have more stringent requirements on Redis performance, we do not recommend binding the CPU.

2.6 Check Whether the Redis Memory Is Swapped

$ redis-cli info | grep process_id
process_id: 5332

Then, go to the process directory in the /proc directory of the machine where Redis is located.

$ cd /proc/5332

Finally, run the following command to view the usage of the Redis process. Here, I only intercepted part of the results:

$cat smaps | egrep '^(Swap|Size)'
Size: 584 kB
Swap: 0 kB
Size: 4 kB
Swap: 4 kB
Size: 4 kB
Swap: 0 kB
Size: 462044 kB
Swap: 462008 kB
Size: 21392 kB
Swap: 0 kB

Once a memory Swap occurs, the most direct solution is to increase the machine’s memory. If the instance is in a Redis slicing cluster, you can increase the number of instances in the Redis cluster to allocate the data volume of each instance and reduce the amount of memory required by each instance.

2.7 Memory Large Pages

If large memory pages are used, Redis needs to copy 2MB of large pages even if the client requests only 100B of data to be modified. In contrast, if it is a conventional memory page mechanism, only 4KB is copied. Compared with the two, you can see that when the client requests to modify or write a large amount of new data, the memory large page mechanism will lead to a large number of copies, which will affect the normal memory access operation of Redis and eventually lead to slower performance.

First of all, we need to check the memory large page. Run the following command on the machine where the ApsaraDB for the Redis instance runs:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

If the execution result is always, the memory large page mechanism is enabled. If the execution result is never, the memory large page mechanism is disabled.

I do not suggest using the memory large page mechanism in the actual production environment. The operation is simple; just execute the following command.

echo never /sys/kernel/mm/transparent_hugepage/enabled

The memory large page mechanism provided by the operating system has the advantage that the number of times that an application program applies for memory can be reduced on a certain program.

However, Redis is sensitive to performance and latency, so we hope Redis will take as little time as possible to apply for memory each time. Therefore, I do not recommend enabling this mechanism on Redis machines.

2.8 Delete the Use of Lazy Free

Supported Versions: Redis 4.0 +

2.8.1 Actively Delete Key Using Lazy Free

UNLINK Command

127.0.0.1:7000> LLEN mylist
(integer) 2000000
127.0.0.1:7000> UNLINK mylist
(integer) 1
127.0.0.1:7000> SLOWLOG get
1) 1) (integer) 1
   2) (integer) 1505465188
   3) (integer) 30
   4) 1) "UNLINK"
      2) "mylist"
   5) "127.0.0.1:17015"
   6) ""

Note: DEL commands or concurrent blocking delete operations

FLUSHALL/FLUSHDB ASYNC

127.0.0.1:7000> DBSIZE
(integer) 1812295
127.0.0.1:7000> flushall // Synchronously cleans instance data. The 1.8 million key takes 1020 milliseconds.
OK
(1.02s)
127.0.0.1:7000> DBSIZE
(integer) 1812637
127.0.0.1:7000> flushall async // Asynchronously cleans up instance data. The 1.8 million key takes about 9 milliseconds.
OK
127.0.0.1:7000> SLOWLOG get
1) 1) (integer) 2996109
    2) (integer) 1505465989
    3) (integer) 9274       // The instruction takes 9.2 milliseconds to run.
    4) 1) "flushall"
       2) "async"
    5) "127.0.0.1:20110"
6) ""

2.8.2 Passively Delete Key Using Lazy Free

Lazy-free is applied to passive deletion. Currently, there are four scenarios, and each scenario corresponds to a configuration parameter. It is disabled by default.

lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
slave-lazy-flush no

lazyfree-lazy-eviction

lazyfree-lazy-eviction means whether to use lazy free mechanism when Redis memory usage reaches maxmemory and an elimination policy is set. If lazy free is enabled in this scenario, the memory release using the elimination key may not be timely, resulting in Redis memory over-usage and exceeding the limit of maxmemory. When operating undering this scenario, test it with your business. We don’t recommend setting yes in the production environment.

lazyfree-lazy-expire

lazyfree-lazy-expire means whether to use the lazy-free mechanism when a key with TTL expires. It is recommend to enable it in this scenario because TTL is the speed of adaptive adjustment.

lazyfree-lazy-server-del

For some instructions, when processing existing keys, there will be an implicit DEL key operation (such as the rename command). When the target key already exists, Redis will delete the target key first. If these target keys are a bigkey, it will cause performance problems that block deletion. This parameter setting is to solve this type of problem and is recommended to be enabled.

slave-lazy-flush

For full data synchronization for the slave, the slave will run flushall to clean up its data scenarios before loading the RDB file of the master. The parameter settings determine whether to use the exceptional flush mechanism. If the memory changes are small, we recommend enabling it. This reduces the time required for full synchronization. This reduces the memory usage growth of the primary database due to output buffer popping.

2.8.3 Lazy Free Monitoring

The data metric that lazy free can monitor only has one value: lazyfree_pending_objects, which indicates the number of keys that Redis performs lazy free operations and is waiting to be recycled. It does not reflect the number of elements of a single large key or the size of memory waiting to be reclaimed by lazy free. Therefore, this value has a certain reference value, which can be used to monitor the efficiency of the Redis lazy free or the number of stacked keys. For example, there will be a small number of stacked keys in the flushall async scenario.

# info memory

# Memory
lazyfree_pending_objects:0

Note: The unlinkCommand() and del functions of the unlink command call the same function delGenericCommand() to delete a key. Lazy indicates whether the key is lazyfree. If lazyfree, the dbAsyncDelete() function is called.

However, lazy-free is not necessarily enabled for every unlink command. Redis will judge the cost of releasing the key (cost) and perform lazy-free only when the cost is greater than LAZYFREE_THRESHOLD(64).

Release key cost calculation function lazyfreeGetFreeEffort(), set type key, and meet the corresponding encoding. Cost is the number of elements of the set key; otherwise, the cost is 1.

Sample request:

A list key containing 100 elements, and its free cost is 100.
A 512MB string key has 1 free cost, so it can be seen that the cost calculation of lazy free of Redis is mainly related to the time complexity.

2.9 AOF Optimization

Redis provides a configuration item that allows the background child thread not to flush the disk (without triggering the fsync system call) when the child process is in AOF rewrite.

This is equivalent to temporarily setting appendfsync to none during the AOF rewrite. The configuration is listed below:

# During the AOF rewrite, the AOF background sub-thread does not flush the disk.
# This is equivalent to temporarily setting appendfsync to none during this period.
no-appendfsync-on-rewrite yes

If you turn on this configuration item, if the instance goes down during the AOF rewrite, more data will be lost at this time. You need to weigh performance and data security.

If the disk resources are occupied by other applications, it is relatively simple. You need to locate which application is writing a large number of disks and then migrate this application to other machines for execution to avoid affecting Redis.

If you have high requirements for Redis performance and data security, we recommend optimizing the hardware level, replacing it with an SSD disk to improve the I/O capability of the disk, and ensuring that sufficient disk resources can be used during AOF. At the same time, make Redis run on a separate machine as much as possible.

2.10 Swap Optimization

Increase the memory of the machine, so Redis has enough memory to use
Arrange the memory space, release enough memory for Redis to use, and then release Redis's Swap to allow Redis to reuse the memory

In most cases, you must restart a Redis instance when you release a Swap. To avoid the impact of the restart on your business, you must perform a master-replica switchover first, release the Swap on the original master node, restart the original master node, and perform the master-replica switchover after data synchronization from the database is complete.

The preventive method is that you need to monitor the memory and Swap usage of the Redis machine and alert when the memory is insufficient, or Swap is used. Handle it in time.

3. Redis Slowdown Troubleshooting Steps

Obtain the baseline performance of the Redis instance in the current environment
Is the slow query command used? If so, use other commands instead of slow query commands or put aggregate computing commands on the client to do.
Do you set the same expiration time for expired keys? For keys that are deleted in batches, you can add a random number to the expiration time of each key to avoid simultaneous deletion.
Is there a bigkey? For the deletion operation of bigkey, if your Redis is version 4.0 or above, you can directly use the asynchronous thread mechanism to reduce the blocking of the main thread. If Redis is 4.0 or previous versions, you can use the SCAN command to iterate and delete. For bigkey's set query and aggregation operations, the SCAN command can be used to complete the operation at the client.
What is the Redis AOF configuration level? Is this level of reliability needed at the business level? If we need high performance and allow data loss, we can set the configuration items no-appendfsync-on-rewrite to yes to prevent AOF rewriting and fsync from competing for disk I/O resources, resulting in increased Redis latency. If both high performance and high reliability are required, it is best to use a high-speed solid-state disk as the write disk of the AOF log.
Is the memory usage of the Redis instance too large? Did the Swap happen? If so, increase the memory of the machine or use the Redis cluster to allocate the number of key-value pairs and memory pressure of the single-machine Redis. At the same time, it is necessary to avoid the situation that Redis and other applications with large memory requirements share machines.
Is the transparent large page mechanism enabled in the running environment of a Redis instance? If so, just turn off the memory large page mechanism directly.
Is a master-slave Redis cluster running? If yes, limit the data size of the primary database instance to 2 to 4GB to prevent the slave database from being blocked due to loading large RDB files during master-slave replication.
Are multi-core CPUs or NUMA-based machines used to run Redis instances? If you use a multi-core CPU, you can bind a physical core to a Redis instance. If you use the NUMA architecture, note that the Redis instance and the network interrupt handler run on the same CPU socket.

Community