A Detailed Explanation of the Detection and Processing of BigKey and HotKey in Redis

This blog introduces BigKey and HotKey in Redis and discusses what leads to HotKeys or BigKeys, how to recognize HotKeys or BigKeys, further, the problem it brings, and ways you can take to handle.

By Yanquan

Preface

When using Redis, we often encounter BigKey and HotKey. If not detected and processed in time, BigKey and HotKey are likely to degrade service performance, deteriorate user experience, and even cause large-scale system failures.

Definition of BigKey and HotKey

We can often see the definitions of BigKey and HotKey in Guideline for Developing With Redis in companies, or in numerous articles about Redis best practices on the Internet. Even though the criteria for judging BigKey and HotKey are different, it is clear that the dimensions of judging are the same. BigKey is usually determined by the data size and number of members, while HotKey is determined by the frequency and number of requests Redis receives.

Definition of BigKey

Generally, we call a key that contains large size of data or a large number of members and lists as BigKey. There are some examples to help you fully understand the characteristics of BigKey, find as below.

A STRING key whose size of value is 5 MB (the data is too large).
A LIST key with 20,000 elements (the number of elements in the list is excessive).
A ZSET key with 10,000 members (the number of members is excessive).
A HASH key whose size is 100MB even if only contains 1,000 members (the size of the key is too large).

It should be noted that the definition of BigKey might be different according to actual use cases and business scenarios of Redis. This is to say that you should judge taking all factors into consideration. In given examples, you can see some specific numbers for the size, members, and elements in a key, which is not a common definition of a BigKey, it's only to simplify the explanation, cannot be regarded as a standard in fact.

Definition of HotKey

When the workload of access to a key is significantly higher than that of other keys, we can call it a HotKey. To help you better understand what HotKeys looks like, here are some examples, please check as below.

The total workload of access per second of a Redis instance is 10,000, while that of one key reaches 7,000. (Obviously, this key is a HotKey whose access is significantly higher than that of other keys.)
A HASH key with thousands of members and a total size of 1MB receives a large amount of HGETALL requests each second. (In this case, we call it a HotKey since accesses to one key cost significantly more bandwidth than that of other keys.)
A ZSET key with tens of thousands of members receives a large number of ZRANGE requests each second. (It's clear that the CPU run time is significantly higher than that spent on requests of other keys. Similarly, we can say this kind of key with large CPU consuming is a HotKey.)

Problems Caused by BigKey and HotKey

When using Redis, BigKey and HotKey bring various problems. The most common ones are performance degradation, access timeout, access skew and data skew.

Typical Problems Caused by BigKey

Clients get delayed responses and it feels like the Redis is slowing down.
The Redis memory usage keeps growing which triggers OOM, or write blocking and eviction of important keys occurs because of reaching the maxmemory.
Access skew occurs which may cause a Redis instance to reach the performance bottleneck, so that the whole cluster also reaches the performance bottleneck. In this case, the memory usage of one node in Redis Cluster usually far exceeds that of other nodes because of access requsts for BigKeys on it, and the minimum granularity of data migration in Redis Cluster is key that the memory on that node cannot be balanced. This is to say the problem cannot be resolved unless you find a method to divide BigKeys into small keys.
All services the Redis should be offering are affected as Redis is getting slower, which is all because read requests on a BigKey occupy all the bandwidth of Redis.
Synchronization interruption or failover occurs because of long time blocking of primary database when deleting a BigKey.

Typical Problems Caused by HotKey

HotKeys usually have long CPU run time, which deteriorates Redis performance and affects other requests.
Hotspot on some Redis nodes/machines, instead of on keys that sharding to different Redis nodes, often prevent you from taking full advantage of your Redis Cluster. As the memory/cpu load of these nodes will be quite heavy, while other nodes are not efficiently used.
As for panic buying and flash sales scenarios, oversold often occurs because the number of read requests for inventory on keys of corresponding goods is too much beyond the instance performance that the Redis cannot handle.
The traffic on HotKeys suddenly surges to the maximum threshold the Redis can tolerate, even causing the cache service to crash, which is commonly referred to as a Redis avalanche. If this is the case, a large number of requests will directly hit backend database, causing a high load to the database layer that the database may be unable to withstand and downtime, thus affecting the business.

Common Causes of BigKey and HotKey

Insufficient business planning, incorrect use of Redis, accumulation of invalid data, and sudden increase in access can cause BigKey and HotKey issues. For example:

1) BigKey: Use Redis for inappropriate data types can introduce BigKey problems. For example, using String keys to store large binary files will make the values of the keys too large.

2) BigKey: Insufficient planning and design before business launch and no proper sharding policies or split plans to divide members in an individual key to multiple keys, resulting in an excessive number of members in a particular key.

3) BigKey: No regular cleanup on invalid data in HASH keys, resulting in a continuous increase of members in HASH keys, which may bring BigKey problems.

4) HotKey: Unexpected increase of access traffic owing to hot products, hot news, KOL (Key Opinion Leader) live streaming event or an online games battle that invloves a large number of players.

5) BigKey: Logical failures on business side that prevents LIST keys from being consumed, which result in an increasing number of members in the corresponding Key with no trend to decrease.

Dicover BigKey and HotKey in Redis

【Discover BigKeys】

It's not that difficult to discover BigKeys and HotKeys since there are many ways and means to analyze keys in Redis and to find out "problem" keys, such as built-in functions, open source tools of Redis, and Key Analysis functions in CloudDBA which you can find on the ApsaraDB for Redis console.

Discover BigKey and HotKey with Built-in Functions of Redis

Some built-in commands and tools in Redis can help us find these problem keys. If you already have a clear analysis target, for example, there are some specious keys in your mind that might be the BigKey and HotKey, run the following commands to analyze.

Analyze keys by using Redis built-in commands

[Built-in command] debug object <key_name> (not recommended)

You may choose to use the debug object command to analyze the keys. This command can analyze the key according to the incoming object (the key name) and return a variety of information, where the value of serializedlength is the serialization length of the key. You can choose to use this value to determine whether the corresponding key is a BigKey according to your determination criteria.

It should be noted that the serialization length of the key is not equal to its actual size in memory. In addition, debug object is a debugging command and costs a lot to run, it's generally regarded as a command in high risk. It blocks other requests during its its run time, in other words, the Redis is out of service until its execution is completed. The serialization length of the incoming object (key name) determines the time occupied, the larger the key is then the more time it spends. Therefore, this command is not recommended to use in production environments to analyze BigKey.

[Built-in command] memory usage <key_name> (not recommended)

Redis has provided the MEMORY USAGE command since the version 4.0 to help you analyze the memory usage of keys. Even if its execution cost is lower than the debug object command, it still risks blocking when analyzing BigKey, because its time complexity is O(N),

[Built-in command] strlen, hlen, scard, zcard, llen, xlen (recommended)

We recommend analyzing keys in a less risky way. Redis provides different commands to get the length or number of members of keys with different data types, as shown in the following table:

By using the preceding Redis built-in commands, we can easily and safely analyze keys without affecting online services. However, the length it results usually doesn't equal to a key's memory use, so it can only be used for reference.

Discover BigKeys by using the BigKeys parameter in redis-cli

If you do not have a clear target key for analysis but want to find out the BigKey in Redis, you can use redis-cli to achieve this goal by adding '--BigKeys' in the end.

[example] redis-cli -h xxx -p xxx -a xxx --bigKeys

Redis provides the BigKeys parameter to enable redis-cli to analyze all keys in the entire Redis instance in a traversal manner and return a standard summary report. It's convenient and safe, but it's also very obvious to see that the analysis results cannot be customized.

The BigKeys parameter can only output the largest keys of all six Redis data types respectively. If you want to analyze only the STRING keys or find the HASH keys with more than 10 members, then it's impossible to achive by using the BigKeys parameter. What you can turn to if customized report is really in need? There are some open source projects on GitHub that can implement the enhanced version of BigKeys parameter so that the results can be customized according to the configuration. In addition, you can write a python or other scripts with a loop for all data types, using all commands listed in the table above, combining with scan functions to realize a BigKey analysis tool at the Redis instance level.

Similarly, this solution returns a result that is not so accurate and real-time, it's only for reference.

【Discover HotKeys】

Discover HotKeys by using HotKeys parameter in redis-cli

Redis has provided the HotKeys parameter since the version 4.0 to facilitate instance-level HotKey analysis. It can return the number of times that all keys have been accessed but the prerequisite before using is to set the maxmemory-policy to 'allkeys-lfu', which is only supported after Redis 4.0 and later versions.

Discover HotKeys from the business layer

Every access to Redis comes from the business layer. This means that we can record, asynchronously summarize, and analyze the access to Redis by adding corresponding codes to the business layer. In this way, you can accurately analyze HotKeys in real time, but at the expense of business code simplicity.

Discover HotKeys by using monitor command for emergency

The monitor command of Redis can print all requests in Redis according to the fact, including access time, client IP, commands, and keys. In emergency, we can execute the monitor command for a short period of time and redirect the output to the file because it causes extra cost of CPU, memory and network. After terminating the monitor command, HotKeys during this period can be found by classifying and analyzing the requests in the file.

As we mentioned before, the monitor command consumes the CPU, memory, and network resources of ApsaraDB for Redis. Therefore, for a Redis with a high workload, the monitor command may make things worse. Meanwhile, the timeliness of this asynchronous collection and analysis solution is poor, and the accuracy of the analysis depends on the execution duration of the monitor command. So, the accuracy of the result is not good enough in most online scenarios where the command cannot be executed for a long time.

Discover BigKeys by Using Open Source Tools

The popularity of Redis enables us to easily find a large number of open source solutions to solve the current problems we are facing, such as getting accurate analysis reports without affecting online services.

Use redis-rdb-tools to discover BigKeys in a customized way

The redis-rdb-tools is great if you want to accurately analyze the real memory usage of all keys in a Redis instance according to your own standards. It also can avoid disrupting online services, and you can get a concise, easy-to-understand report after the analysis finishes.

Redis rdstools allows you to perform customized analysis on RDB files of Redis. Since the analysis of RDB files works offline, it has no impact on online services. This is its biggest advantage but also its biggest disadvantage because offline analysis represents poor timeliness of analysis results. For a large RDB file, the analysis may last a long time.

Discover BigKeys and HotKeys by using the Key Analysis Service on Public Cloud

For some public cloud service vendors, you can find one-click functions like 'Key Analysis Service', which allows you to analyze all the keys in the Redis instance in real time to discover the current and historical presence of BigKeys and HotKeys. In addition, it helps you know which BigKeys and HotKeys have appeared in the running timeline of Redis, so that you can have a comprehensive and accurate judgment on the running status of the entire Redis instance.

CloudDBA on the ApsaraDB for Redis console

CloudDBA is an intelligent service for Alibaba Cloud database services (ApsaraDB). For ApsaraDB for Redis, it supports real-time analysis and discovery of BigKeys and HotKeys.

The Key Analysis Service is a kernel based function, which can directly discover and output information about BigKeys and HotKeys from the Redis kernel layer. It captures information from kernel which is the brain of the Redis and where all things organized, in other words, the data comes from the original source, therefore, the result is accurate and efficient. And because there is no extra calculating involved just some simple information capturing, it almost has no impact on performance. You can use this function by clicking "Key Analysis" under CloudDBA, as shown in figure 1-1:

Figure 1-1: CloudDBA on the ApsaraDB for Redis console

There are two tabs on the page of Key Analysis,'Real-time' and 'History', which allows you to analyze keys in the corresponding Redis instance in different time dimensions:

Real-time: Analyze the current instance immediately and display all the existing BigKeys and HotKeys.
History: Display BigKeys and HotKeys that have appeared in the instance recently. On the History page, all BigKeys and HotKeys that have appeared are recorded, even if these keys no longer exist now. This function can well reflect the historical status of keys in Redis and help trace the history problems or problems with damaged scenes.

Managing BigKeys and HotKeys

Now, we have found the problematic keys in Redis through various means, it is time to resolve to prevent potential problems in the future.

Common Methods for Dealing with BigKey

Split BigKey

For a string type BigKey, you can consider splitting it into multiple key values. For hash or list types, consider splitting them into multiple hash or list types. But you should ensure that the number of members of each key is within a reasonable range. To avoid uneven memory space, the splitting plan of BigKeys plays a significant especially for a Redis Cluster.

Clean up BigKey

Choose the right storage for your data is very important, particularly you can store data that is not suitable for Redis to other storage media and delete in Redis. As mentioned before, BigKeys may cause interruption to Redis cluster synchronization, so when deleting a key, it's better to use UNLINK command which can slowly and gradually clean up the incoming keys in a non-blocking way and it's provided since Redis 4.0. Through UNLINK, you can safely delete all kinds of BigKeys.

Monitor the memory usage level of Redis at all times

The sudden occurrence of a BigKey problem can often catch us off guard. Therefore, finding and dealing with it before any further problems is an important means to keep the service stable. We can monitor the system and set a reasonable memory alarm threshold for Redis to remind us that BigKeys may be generated at this time. For example, alarm will be sent when Redis memory usage exceeds 70% or Redis memory growth rate exceeds 20% in one hour.

Through monitoring, we can solve the problem before it occurs. For example, when the failure of the consumer program of LIST causes the number of members in a list to grow continuously and no trend to decrease, we can turn alarms into warnings so the owner of the module may notice and fix the problem in time, by which we can avoid failures.

Regularly clean up invalid data

For example, when we incrementally write a large amount of data to a HASH key, ignoring the timeliness of these data. These accumulated invalid data will cause the generation of a BigKey. You can clean up the invalid data by using scheduled tasks. HSCAN and HDEL are recommended to use, by which invalid data can be cleaned up without blocking.

Use Alibaba Cloud Tair (Enterprise Edition of Redis) to avoid tedious work of invalid data cleanup

You may have too many HASH keys, and a large number of invalid members needs to be cleaned up. For this scenario, scheduled tasks can no longer clean up invalid data in a timely manner. However, you can use Alibaba Cloud's Tair to solve such problems well.

Tair is the Enterprise Edition of Alibaba Cloud ApsaraDB for Redis. It provides a large number of additional advanced features except for all the features of ApsaraDB for Redis, including the high-performance features.

TairHash is a hash data type that allows you to set the expiration time and version for a field. Not only does It support a wide range of data interfaces and provide high processing performance like Redis Hash, but it also breaks the hash restriction that a hash key can only have expiration time in addtion to its value, and apparently it supports both expiration time and version setting for a TarHash key. This greatly improves the flexibility of the hash data and simplifies business code in many scenarios.

TairHash uses the efficient Active Expire algorithm to achieve the function of discovering and deleting expired data with almost no impact on the response time. The reasonable use of such advanced functions can liberate a large number of Redis O&M and fault handling work and reduce the complexity of the business code. By doing so, O&M personnel can devote their effort to other more valuable work, and R&D personnel can have more time to focus on business code.

Common Methods for Dealing with HotKeys

Copy HotKeys in the Redis Cluster

As mentioned before, the minimum migration granularity is key, when the memory usage of a particular node in a Redis Cluster increased to a high level because of a HotKey, it can not be back to normal unless you can split the HotKey into pieces and distribute to other nodes. If you are facing such scenario, you can copy the corresponding HotKeys and migrate them to other nodes. For example, you can copy three keys with exactly the same content of the HotKey 'foo' and name them 'foo2', 'foo3', and 'foo4'. Then, migrate these three copies to other nodes to share the workload of a single node.

However the disadvantage of this solution is obvious that the code needs to be modified by linkage and key copy brings data consistency challenges. That is, compared with updating one key, you need to update multiple keys at the same time now. In many cases, this is recognized as a temporary solution for some difficult environment.

Use read/write splitting architecture

If HotKeys are generated from read requests, read/write splitting is a good solution. When using the read/write splitting architecture, the read request workload on each Redis node can be decreased by continuously adding secondary nodes.

However, this solution increases the complexity of business code and the complexity of the Redis cluster architecture. For example, for Redis layer, on the one hand, forwarding layers (such as Proxy and LVS) is needed now for multiple secondary nodes to implement load balancing. On the other hand, considering with high availability of the service, increased node failure rate and node tolerance should be taken into consideration when adding more secondary nodes. The changes of the Redis cluster architecture bring greater challenges to monitoring, O&M, and failure processing.

However, all these are extremely simple in Alibaba Cloud ApsaraDB for Redis which provides out-of-box services. At the same time, when business grows, Alibaba Cloud's ApsaraDB for Redis allows users to adjust the cluster architecture by changing configurations. For example, the primary/secondary architecture can be changed to read/write splitting architecture, the read/write splitting architecture can be changed to a cluster, the primary/secondary architecture can be changed to a cluster that supports read/write splitting, and Redis Community Edition can be directly changed to Redis Enterprise Edition (Tair) that supports a large number of advanced features.

The read/write splitting architecture can be used for dealing with HotKey issues but it also has shortcomings. In scenarios with a large number of requests, the read/write splitting architecture will incur an inevitable delay, that's where dirty data may be read. Therefore, the read/write splitting architecture is not appropriate in scenarios where both read and write workload is high and high data consistency is required.

Use the Proxy Query Cache feature of Alibaba Cloud Tair

Proxy Query Cache is one of the enterprise-level features of Alibaba Cloud Tair (Redis Enterprise Edition). Its principle is shown in figure 2-1:

Figure 2-1: Tair QueryCache principle

Alibaba Cloud ApsaraDB for Redis identifies hotkeys in instances based on efficient sorting and statistical algorithms. After enabling this function, Proxy servers will cache requests and corresponding query results based on the rules you set. Proxy servers cache only request and response data of a HotKey, instead of the entire key. If a proxy server receives a duplicate request within the validity period, the proxy server directly returns a response to the client without the need to interact with backend data shards. This improves the read speed and reduces the performance impact of HotKeys on data shards to avoid request skew.

So far, the same request from the client no longer needs to interact with the Redis behind Proxy. Instead, Proxy returns the data directly. The access request to the HotKey is processed by multiple proxy servers instead of a single Redis node, greatly reducing the HotKey workload on one particular Redis node. Moreover, the Proxy Query Cache function of Tair also provides a large number of commands to facilitate users to view and manage HotKeys. For example, run the querycache keys command to view all cached HotKeys, and run querycache listall to obtain all cached commands.

Combining Key Analysis Service in CloudDBA and Tair Proxy Query Cache feature, using ApsaraDB for Redis can reduce your cost of O&M and improve R&D efficiency.

Use Global Distributed Cache for Redis on Alibaba Cloud

For HotKeys with frequent write instead of frequent read access over and over, then traditional Redis read/write splitting and cluster structures have nothing to do for resolving this kind of HotKey issues. What you need now is Global Distributed Cache feature which can extend the write capability of Redis by parallel writes upon a same key on distributed nodes, reducing the write workload to a particular node.

Global Distributed Cache for Redis of Alibaba Cloud, also known as Global Replica, is an active geo-redundancy database system that is developed based on ApsaraDB for Redis. It is ideal for business scenarios in which multiple sites in different regions provide services at the same time. It helps enterprises quickly realize similar active geo-redundancy architecture to Alibaba. Global Distributed Cache for Redis solves the cross-region and cross-country active problems. In addition, as it supports distributed write on up to three nodes, it can be used to improve write performance as well at up to three times. The architecture of Global Distributed Cache for Redis is shown in figure 2-2:

Figure 2-2: Architecture of Global Distributed Cache for Redis

Compared with traditional Redis synchronization middleware, Global Distributed Cache for Redis features high reliability, high throughput, low latency, and high synchronization accuracy.