Detect and Fix Large Keys and Hot Keys in Tair - Tair

Large keys and hot keys can degrade service performance, cause request timeouts, or even lead to system failures. This topic describes how to quickly find and optimize large and hot keys. It also analyzes their causes and effects, and provides preventive measures to reduce their impact on your business.

Step 1: Quickly find large and hot keys

Alibaba Cloud console tools

Tair and Redis provide the Top Key Statistics and Offline Full Key Analysis features in the console to help you quickly find large and hot keys.

Method	Limits	Description	Procedure
Top Key Statistics (Recommended)	This feature is supported only by Redis Open-Source Edition 5.0 or later, and memory-optimized and persistent memory instances of Tair (Enterprise Edition).	Displays real-time information about the top three large keys and hot keys for each data structure in each shard. Lets you view the historical information of large and hot keys for the last four days.	Log on to the console and go to the Instances page. In the top navigation bar, select the region in which the instance that you want to manage resides. Then, find the instance and click the instance ID. In the navigation pane on the left, click CloudDBA > Top Key Statistics or Offline Key Analysis.
Offline Full Key Analysis	This feature is not supported for disk-based instances.	Performs a custom analysis of the RDB backup file to obtain information such as memory usage, distribution, and expiration time of keys. The analysis is not real-time and can be time-consuming for large RDB files. Cannot analyze hot key information.

If your instance does not support these features, you can use the following methods.

Other methods to find large and hot keys

Method	Pros and cons	Description
Use the bigkeys, memkeys, and hotkeys parameters of redis-cli	Pros: Convenient, fast, and safe. Cons: The analysis results are not customizable and may have low accuracy or timeliness. This method traverses all keys in the instance, which can affect performance.	The bigkeys, memkeys, and hotkeys parameters of redis-cli retrieve overall key statistics and the top large or hot key for each data structure. The differences are as follows: bigkeys: Statistics for large keys. For collections or lists, it returns the number of elements. memkeys: Statistics for large keys. Returns the memory size occupied by the key value. hotkeys: Statistics for hot keys. Supported data structures: STRING, LIST, HASH, SET, ZSET, and STREAM. For example, the command for bigkeys is `redis-cli -h r-***************.redis.rds.aliyuncs.com -a <password> --bigkeys`.
Analyze target keys using built-in commands	Pros: Minimal impact on online services. Cons: The returned serialized length of a key is not the same as its actual length in memory. Therefore, the result is not precise and should only be used as a reference.	For keys of different data structures, you can use the following low-risk commands to determine whether they are large keys. STRING type: The STRLEN command returns the number of bytes of the value for the corresponding key. LIST type: The LLEN command returns the length of the list for the corresponding key. HASH type: The HLEN command returns the number of members for the corresponding key. SET type: The SCARD command returns the number of members for the corresponding key. ZSET type: The ZCARD command returns the number of members for the corresponding key. STREAM type: The XLEN command returns the number of members for the corresponding key. Note The DEBUG OBJECT and MEMORY USAGE commands are resource-intensive and have a time complexity of O(N). They can block the instance and are not recommended.
Locate hot keys at the business layer	Pros: Accurately and promptly locates hot keys. Cons: Increases the complexity of business code and may slightly degrade performance.	You can add code at the business layer to record instance access and perform asynchronous analysis.
Use the redis-rdb-tools tool for custom analysis of large keys	Pros: Supports custom analysis with no impact on online services. Cons: The analysis is not real-time and can be time-consuming for large RDB files.	Redis-rdb-tools is an open-source tool written in Python that supports custom analysis of RDB snapshot files. After you download the RDB file, you can analyze the memory usage of all keys in the instance and perform flexible queries as needed.
Find hot keys using the MONITOR command	Pros: Convenient and safe. Cons: Consumes CPU, memory, and network resources. The results may not be timely or accurate.	The MONITOR command prints all requests sent to the instance, including time, client information, commands, and key information. In an emergency, you can briefly run the MONITOR command and save the output to a file. After stopping the MONITOR command, you can analyze the requests in the file to identify hot keys during that period. Note Because the MONITOR command can significantly degrade instance performance, do not use the MONITOR command except in special circumstances.

Step 2: Optimize large and hot keys

Large keys

Solution	Scenarios	Recommended actions
Clean up expired data	A large amount of expired data has accumulated, such as uncleaned incremental data in a HASH.	You can use the HSCAN command with the HDEL command to clean up invalid data. This prevents the instance from being blocked, which can occur when you clean up a large amount of data at once.
Compress large keys	Compressible data such as JSON and XML text data, including logs and configurations.	You can enable compression during serialization, such as GZIP or Snappy. You can use a binary serialization protocol, such as Protocol Buffers. Note Compression and decompression operations consume extra CPU resources and may affect processing performance.
Split large keys	Frequently accessed HASH, ZSET, and other data structures, such as leaderboards.	You can split keys based on business logic, such as by user ID or time range. You can use a sharding key design, such as user:1001:shard1 and user:1001:shard2. Splitting large keys can effectively prevent data skew.
Offload large keys	Large files or Binary Large Objects (BLOBs) of the String type.	You can store unsuitable data in other storage systems, such as OSS, and delete it from the instance. For Redis Open-Source Edition 4.0 and later: You can use the UNLINK command to safely delete large or even extra-large keys. This command cleans up keys asynchronously to avoid blocking the main thread. For versions earlier than Redis Open-Source Edition 4.0: You can use the SCAN command to traverse and delete data in batches. This avoids blocking the main thread that can be caused by deleting many keys at once.

Hot keys

Solution	Scenarios	Recommended actions
Replicate hot keys in a cluster architecture	A hot key is stored as a whole in a single shard, and requests cannot be distributed by migrating partial data.	Copy the hot key and migrate the replicas to other data shards. For example, copy a hot key named `foo` to create three identical keys named `foo2`, `foo3`, and `foo4`. Migrate these three keys to other data shards to relieve the pressure on the single data shard with the hot key. Note The disadvantage of this solution is that you must modify your code to maintain multiple replicas, and it is difficult to ensure data consistency between them. For example, an update operation must be synchronized across all replicas. Use this solution as a temporary measure to mitigate urgent issues.
Enable read/write splitting	Read-heavy and write-light workloads	If the read request load is still high after you enable this feature, add more read-only nodes to further relieve the load. Note In scenarios with extremely high request volumes, primary/secondary synchronization will inevitably have latency, which can cause you to read dirty data. Therefore, do not enable read/write splitting in scenarios with high read and write pressure and strict data consistency requirements.

Step 3: Prevent large and hot keys from affecting your business

Causes of large and hot keys

In Tair and Redis, the minimum unit for data distribution is a key. A single key is stored in a specific data shard and is not split. Factors such as insufficient business planning, accumulation of invalid data, and sudden increases in access volume can all lead to the generation of large and hot keys in an instance. Examples include the following:

Category	Cause
Large key	Using Tair and Redis in unsuitable scenarios, which can result in excessively large key values. For example, using a String-type key to store large binary file data. Insufficient planning and design before the business goes online. Members within keys are not reasonably split, resulting in an excessive number of members in some keys. Failure to periodically clean up invalid data, causing members of HASH-type keys to continuously increase. A code failure on the consumer side of a business that uses LIST-type keys, causing the members of the corresponding key to only increase.
Hot key	Unexpected sharp increases in access volume. Examples include a sudden hit product, a hot news story with soaring traffic, a streamer's event in a live channel that generates many likes, or a battle between multiple guilds in a game involving many players in a specific area.

Effects of large and hot keys

Category	Effect
Large key	Slows down command execution on the client. When the instance memory reaches the maxmemory limit, it can cause operations to be blocked, important keys to be evicted, or even an out-of-memory (OOM) error. In a cluster architecture, the memory usage of one data shard far exceeds that of others, preventing balanced use of memory resources across shards. Executing read requests on a large key can saturate the instance's network bandwidth, slowing down its own services and affecting related services. Deleting a large key can easily block the primary database for a long time, which may trigger a synchronization break or a primary/secondary failover.
Hot key	Consumes a large amount of CPU resources and may increase network bandwidth usage, which affects other requests and reduces overall performance. In a cluster architecture, it causes access skew, where one data shard is heavily accessed while others are idle. This can lead to issues such as exhausting the connection limit for that shard and rejecting new connection requests. In flash sale scenarios, the request volume for the inventory key of a product may exceed the instance's processing capacity, leading to overselling. If the request pressure on a hot key exceeds the instance's capacity, it can easily cause a cache breakdown. This means many requests are directed to the backend storage layer, causing a surge in storage access or even a breakdown, which in turn affects other services.

Prevention strategies

Strategy	Description
Configure monitoring and alerts	Set reasonable alert thresholds for metrics such as CPU usage, memory usage, and Connections. For example, set an alert for when memory usage exceeds 70% or when memory grows by more than 20% in one hour. When an alert is triggered, follow the instructions in Step 1 and Step 2 of this topic to locate and optimize large and hot keys. This resolves the issue before it affects your business.
Use Tair (Enterprise Edition) to avoid cleaning up invalid data	For scenarios involving large keys of the hash type, Tair (Enterprise Edition) provides an enhanced data structure, TairHash. It supports setting an expiration time and version for each field. Using TairHash correctly, you can significantly reduce O&M workload, simplify business code complexity, and effectively handle issues caused by large and hot keys.