All Products
Search
Document Center

Tair:Identify and handle large keys and hotkeys

Last Updated:Jul 05, 2024

When you use Tair, you may encounter performance degradation, a poor user experience, and even large-scale failures if you do not identify and handle large keys or hotkeys in a timely manner. This topic describes the causes of large keys and hotkeys, the issues that may be caused by large keys and hotkeys, and how to identify and optimize large keys and hotkeys in a timely manner.

Definitions of large key and hotkey

Important

The numbers used in the preceding examples are for reference only. You must determine whether a key is a large key or a hotkey based on the actual usage of the instance.

Term

Description

Example

large key

The size of a key and the number of members in the key determine whether the key is considered a large key.

  • A key that is large in size is considered a large key. For example, a STRING key that is 5 MB in size is considered a large key.

  • A key that has a large number of members is considered a large key. For example, a ZSET key that has 10,000 members is considered a large key.

  • A key whose member data is large in size is considered a large key. For example, a HASH key is considered a large key if the key has only 2,000 members but these members have a total size of 100 MB.

hotkey

The frequency at which a key is requested determines whether the key is considered a hotkey.

  • A key that receives a large number of queries per second (QPS) is considered a hotkey. For example, if a Tair instance has a total QPS of 10,000 and one key in the instance receives 7,000 QPS, the key is considered a hotkey.

  • A key that has a high bandwidth usage is considered a hotkey. For example, if a HASH key that has thousands of members and is 1 MB in size receives a large number of HGETALL commands per second, the key is considered a hotkey.

  • A key that has a high CPU utilization is considered a hotkey. For example, if a ZSET key that has tens of thousands of members receives a large number of ZRANGE commands per second, the key is considered a hotkey.

Issues caused by large keys and hotkeys

Category

Description

Large keys

  • It takes longer for the client to run commands.

  • An operation may be blocked, an important key may be evicted, or an out of memory (OOM) error may occur when the memory usage of a Tair instance reaches the upper limit specified by the maxmemory parameter.

  • The memory usage of a data shard in a Tair cluster instance far exceeds that of other data shards, which results in imbalanced memory usage across data shards in the instance.

  • When a read request is made for a large key, the response time may increase and other services may be affected. This is because the bandwidth of the Tair instance to which the key belongs is exhausted.

  • The primary database may be blocked for an extended period of time while a large key is being deleted. This may lead to a synchronization failure or a master-replica switchover.

Hotkeys

  • Hotkeys consume large amounts of CPU resources, which slows down the response to requests for regular keys and degrades the performance of Tair instances.

  • Request skews may take place for Tair cluster instances. Request skews occur when one data shard in an instance receives a large number of requests while other data shards in the instance remain idle. In this situation, the maximum number of connections to a data shard may be reached and new connections to the shard may be rejected.

  • During flash sales, overselling may occur if the key corresponding to a commodity receives more requests than can be handled by Tair.

  • A cache breakdown may occur if a hotkey receives more requests than can be handled by Tair. In this case, a large number of requests are directly sent to the backend storage, and a backend storage breakdown may occur. This affects other business.

Causes of large keys and hotkeys

Large keys and hotkeys may occur for a variety of reasons, such as incorrect use of Tair, insufficient workload planning, accumulation of invalid data, and traffic spikes.

  • Large keys

    • Incorrect use of Tair: If Tair is used in an improper scenario, the size of a key may be larger than necessary. For example, if a STRING key is used to store a large binary file, the size of the key may be larger than necessary.

    • Insufficient workload planning: Before a feature is released, a failure to sufficiently plan for workloads can result in problems. For example, members may not be properly split between keys and some keys may have more members than required.

    • Accumulation of invalid data: This occurs when invalid data is not deleted on a regular basis. For example, the number of members of a HASH key constantly increases when invalid data is not cleared in a timely manner.

    • Code failures: Code failures occur on consumer applications that use LIST keys, which causes the members of the keys to only increase.

  • Hotkeys

    • Unexpected traffic spikes: Unexpected traffic spikes may occur for a variety of reasons, such as high product popularity, hot news, a large number of "likes" flooding in from the viewers of a livestream, or battles between multiple large teams in a game.

Identify large keys and hotkeys

Tair provides a variety of methods for you to identify large keys and hotkeys.

Method

Advantage and disadvantage

Description

Use the real-time key statistics feature (recommended)

  • Advantages: This method features high precision and minimal impact on performance.

  • Disadvantages: The number of keys displayed is limited, but sufficient for most common scenarios.

You can use the real-time key statistics feature to display the statistics of large keys and hotkeys in an instance in real time. You can also query the historical statistics of large keys and hotkeys that were generated within the last four days. You can use this feature to obtain key statistics such as memory usage and access frequency. Then, you can troubleshoot issues and optimize instances based on the statistics.

Use the offline key analysis feature

  • Advantages: This method allows you to analyze historical backup files without affecting online services.

  • Disadvantages: This method does not allow for rapid analysis, and it takes longer to analyze large Redis Database (RDB) files.

The offline key analysis feature allows you to analyze RDB backup files of Tair instances in a customized manner and identify large keys in these instances. You can view the statistics of keys in an instance, such as their memory usage, distribution, and time-to-live (TTL). You can use these statistics to optimize the instance and prevent issues such as insufficient memory and performance degradation that are caused by the improper distribution of keys.

Identify large keys and hotkeys by using the bigkeys and hotkeys parameters in redis-cli.

  • Advantages: This method is convenient, fast, and secure.

  • Disadvantages: This method does not support customized analysis, provides limited precision, and does not allow for rapid analysis.

Redis provides the bigkeys parameter to enable redis-cli to traverse all keys in a Tair instance and return the overall statistics of keys and the largest keys of each data type. The bigkeys parameter can return statistics for keys of the STRING, LIST, HASH, SET, ZSET, and STREAM types. Sample command: redis-cli -h r-***************.redis.rds.aliyuncs.com -a <password> --bigkeys.

Note

If you want to analyze only large keys of the STRING type or identify the HASH keys that have more than 10 members, the bigkeys parameter cannot fulfill your needs.

Starting from Redis 4.0, the hotkeys parameter is provided to help you quickly identify hotkeys. Sample command: redis-cli -h r-***************.redis.rds.aliyuncs.com -a <password> --hotkeys.

Analyze a specific key by using built-in commands in Redis

  • Advantages: This method is convenient and has little impact on online services.

  • Disadvantages: The returned serialized length of a key is not equal to the actual length of the key in the memory. This method provides limited precision and is for reference only.

The following section lists low-risk commands for analyzing keys of various data types to determine whether a key is a large key:

  • For a STRING key, run the STRLEN command. This command returns the length (number of bytes) of a string value stored at the key.

  • For a LIST key, run the LLEN command. This command returns the length of a list value stored at the key.

  • For a HASH key, run the HLEN command. This command returns the number of members in the key.

  • For a SET key, run the SCARD command. This command returns the number of members in the key.

  • For a ZSET key, run the ZCARD command. This command returns the number of members in the key.

  • For a STREAM key, run the XLEN command. This command returns the number of members in the key.

Note

The DEBUG OBJECT and MEMORY USAGE commands consume large amounts of resources when they are run. In addition, the time complexity of these commands is O(N), which indicates that these commands may block Tair instances. Therefore, we recommend that you do not use these commands.

Identify hotkeys at the business layer

  • Advantages: This method can identify hotkeys in a timely and accurate manner.

  • Disadvantages: To implement this method, you must write business code that has increased complexity. In addition, this method may degrade performance.

This method allows you to add code to the business layer to record requests that were sent to Tair instances and asynchronously analyze the collected statistics.

Identify large keys in a customized manner by using the redis-rdb-tools project

  • Advantages: This method supports customized analysis without affecting online services.

  • Disadvantages: This method does not allow for rapid analysis, and it takes longer to analyze large RDB files.

The redis-rdb-tools project is written in the Python programming language. redis-rdb-tools is an open source tool that can be used to analyze Tair RDB files in a customized manner. You can analyze the memory usage of all keys in a Tair instance, and query and analyze the statistics of each key in a fine-grained manner.

Identify hotkeys by using the MONITOR command

  • Advantages: This method is convenient and secure.

  • Disadvantages: This method consumes CPU, memory, and network resources, provides limited precision, and does not allow for rapid analysis.

The MONITOR command that is available in Tair can display the statistics of all requests related to an instance, including statistics about time, clients, commands, and keys.

In case of an emergency, you can run the MONITOR command and export the output to a file. You can then analyze and classify the requests in the output to identify hotkeys generated during the emergency period after you disable the MONITOR command.

Note

However, the MONITOR command significantly degrades the performance of Tair instances. We recommend that you use the MONITOR command only in special cases.

Optimize large keys and hotkeys

Category

Optimization method

Large keys

  • Split large keys

    For example, you can split a HASH key that contains tens of thousands of members into multiple HASH keys that each have an appropriate number of members. For Tair cluster instances, you can split large keys to balance the memory usage across multiple data shards.

  • Delete large keys

    You can store data that is unsuitable for Tair in other storage engines and delete the data from Tair.

    Note

    You can run the UNLINK command to safely delete large keys or super large keys from a Tair instance. This command can be used to gradually delete keys from a Tair instance to prevent the instance from being blocked.

  • Monitor the memory usage of Tair

    You can specify appropriate alert thresholds in the monitoring system for the memory usage of a Tair instance. For example, you can specify 70% as the alert threshold for the memory usage of a Tair instance and 20% as the alert threshold for the memory usage increase of the Tair instance over a 1-hour period. This allows you to prevent potential problems. For example, you can configure thresholds to generate alerts in advance so that you have time to prevent an increase in the number of keys caused by the failure of a consumer application that uses LIST keys. For more information, see Alert settings.

  • Delete expired data on a regular basis

    The accumulation of expired data leads to large keys. For example, if you incrementally write a large amount of data to a HASH key and ignore the TTL of the data, the HASH key may end up as a large key. You can use scheduled tasks to delete invalid data.

    Note

    To prevent Tair from being blocked when you delete invalid hash data, we recommend that you run the HSCAN and HDEL commands.

  • Use Tair

    If you have a large number of HASH keys and want to delete a large number of invalid members from some keys, scheduled tasks cannot delete invalid members in a timely manner. In this case, you can use Tair.

    Tair provides the TairHash data structure. TairHash is a HASH data type that allows TTL and version numbers to be specified for fields. TairHash, similar to Redis HASH, provides a variety of data interfaces and high processing performance. However, Redis HASH allows only TTL to be specified for keys. TairHash also allows version numbers to be specified. TairHash is more flexible and simplifies business development in most scenarios. In addition, TairHash uses the active expiration algorithm to check the TTL of fields and delete expired fields. This process does not increase the database response time.

    Such advanced features can be used to improve O&M efficiency, reduce troubleshooting workloads, and simplify business code. For more information, see exHash.

Hotkeys

  • Replicate hotkeys for Tair cluster instances

    Requests made for a hotkey in a data shard cannot be redistributed to other data shards in the instance because the smallest unit at which a hotkey can be migrated in a Tair cluster instance is a key. This results in a constant high workload for a single data shard. In this case, you can replicate the hotkey in the data shard to generate identical keys and migrate these new keys to other data shards. For example, you can replicate a hotkey named foo in a data shard to generate three identical hotkeys named foo2, foo3, and foo4. Then, you can migrate foo2, foo3, and foo4 to other data shards to reduce the pressure on the data shard that contains foo.

    Note

    The disadvantage of this method is that you must modify the corresponding code and data inconsistency may occur because you must update multiple keys instead of one key. For this reason, we recommend that you consider this method only as a temporary solution.

  • Use a read/write splitting architecture

    If the accumulation of read requests causes hotkeys, you can change your instance into a read/write splitting instance to reduce the read pressure on each data shard of the instance, or increase the number of replica nodes for the instance. However, the read/write splitting architecture increases the complexity of both the business code and the Tair cluster instance. You must provide server load balancing tools such as proxies and Linux Virtual Server (LVS) for multiple replica nodes and prepare to deal with the increased failure rate that results from a significant increase in the number of replica nodes. If you change your Tair instance into a cluster instance, you may encounter bigger challenges in monitoring, Q&M, and troubleshooting.

    In response to these challenges, Tair provides out-of-the-box solutions. As your needs evolve, you can change your instance architecture by making a configuration change, such as changing a master-replica instance into a read/write splitting instance or a read/write splitting instance into a cluster instance. For more information, see Change the configurations of an instance.

    Note

    The read/write splitting architecture also has its disadvantages. If a large number of requests are sent to a read/write splitting instance, some amount of latency is unavoidable, and dirty data may be read from the instance. Therefore, the read/write splitting architecture is not the optimal solution for scenarios that have high requirements for read and write capabilities and data consistency.

  • Use the proxy query cache feature of Tair

    Tair uses effective sorting and statistical algorithms to identify hotkeys that receive more than 5,000 queries per second (QPS). After you enable the proxy query cache feature, proxy nodes cache request and response data of hotkeys based on the rules you set. Proxy nodes cache only request and response data of a hotkey, instead of the entire key. If a proxy node receives a duplicate request within the validity period of the cached data, the proxy server directly returns the response of the request to the client without the need to interact with backend data shards. This improves the read speed, reduces the impacts of hotkeys on the performance of data shards, and prevents skewed requests.

    After the proxy query cache feature is enabled for a Tair instance, duplicate requests from clients are directly sent to proxy nodes instead of backend Tair nodes. The proxy nodes then return responses to the clients. Requests made for hotkeys can be processed by multiple proxy nodes instead of a single Tair node. This significantly reduces the hotkey workloads on Tair nodes. The proxy query cache feature of Tair also provides a variety of commands for you to query and manage the proxy query cache. For example, you can run the querycache keys command to query all cached hotkeys and run the querycache listall command to query all cached commands. For more information, see Use proxy query cache to address issues caused by hotkeys.