Discover and resolve the hotkey issue - ApsaraDB for Redis

Overview

Causes

The hotkey issue can have the following two causes:

The size of data consumed by users is much greater than that of produced data, as in the cases of hot sale items, hot news, hot comments, and celebrity live streaming.
The hotkey issue tends to occur unexpectedly, for example, the sales price promotion of popular commodities during Double 11. When one of these commodities is browsed or purchased tens of thousands of times, a large number of requests are processed, which causes the hotkey issue. Similarly, the hotkey issue tends to occur in scenarios where more read requests are processed than write requests. For example, hot news, hot comments, and celebrity live streaming.
In these cases, hotkeys are accessed much more frequently than other keys. Therefore, most of the user traffic is centralized to a specific Redis instance, and the Redis instance may reach a performance bottleneck.
When a piece of data is accessed on the server, the data is partitioning. During this process, the corresponding key is accessed on the server. When the load exceeds the performance threshold of the server, the hotkey issue occurs.

Impacts of the hotkey issue

The traffic is aggregated and reaches the upper limit of the physical network adapter.
Excessive requests queue up, and the partitioning service stops responding.
The database is overloaded and the service is interrupted.

When the number of hotkey requests on a server exceeds the upper limit of the network adapter on the server, the server stops providing other services due to the concentrated traffic. If hotkeys are densely distributed, a large number of hotkeys are cached. When the cache capacity is exhausted, the partitioning service stops responding. After the caching service stops responding, the newly generated requests are cached on the backend database. Due to its poor performance, this database is prone to exhaustion when the database handles a large number of requests. The exhaustion of the database leads to service interruption and a dramatic downgrading of the performance.

Common solutions

Rebuild the server or client to improve the performance.

Use a server cache

The client sends requests to the server. The server provides a multi-thread service, and a cache space is available based on the cache LRU policy. When the server is congested, it directly responds to the requests instead of forwarding them to the database. The server sends the requests from the client to the database and rewrite the data to the cache only after the congestion is cleared. By using this solution, the cache is accessed and rebuilt.

However, this solution has the following issues:

Cache building of the multi-thread service when the cache fails
Cache building when the cache is missing
Dirty reading

Use Memcache and Redis

In this solution, a separate cache is deployed on the client to resolve the hotkey issue. The client first accesses the service layer and then the cache layer of the same server. This solution has the following advantages: nearby access, high speed, and no bandwidth limit. However, it has the following disadvantages:

Wasted memory resources
Dirty reading

Use a local cache

Using the local cache generates the following issues:

hotkeys must be detected in advance.
The cache capacity is limited.
The inconsistency duration is long.
The omission of hotkeys.

If traditional hotkey solutions are all defective, how can the hotkey issue be resolved?

ApsaraDB for Redis provides the solution to the hotkey issue

Read/write splitting solution

The nodes in the architecture serve the following purposes:

Load balancing is implemented at the Server Load Balancer (SLB) layer.
Read/write splitting and automatic routing are implemented at the proxy layer.
Write requests are processed by the master node.
Read requests are processed by the read replica nodes.
High availability (HA) is implemented on the replica node and the master node.

In practice, the client sends requests to SLB, and SLB distributes these requests to multiple proxies. The proxies identify, classify, and then distribute requests. For example, a proxy node sends all write requests to the master node and all read requests to the read replica nodes. But the read replica nodes in the module can be expanded to solve the hotkey reading issue. Read/write splitting supports flexible scaling for hotkey reading and can store a large number of hotkeys. It is client-friendly.

Hot data solution

In this solution, hotkeys are actively discovered and stored to resolve the hotkey issue. The client accesses an SLB instance and requests are distributed to a proxy node through the SLB instance. Then, the proxy node forwards the requests to the backend Redis instances.

A cache is added to the server. A local cache is added to each proxy node. This cache uses the LRU algorithm to cache hot data. A hotkey computing module is added to the backend data node to return the hot data.

The proxy architecture has the following benefits:

The proxy nodes cache the hot data, and its reading capability can be scaled out.
The database node computes the hot data set at a specified time.
The database returns the hot data to the proxy nodes.
The proxy architecture is transparent to the client, therefore, no compatibility is required.

Process hotkeys

Read hot data

The processing of hotkeys is divided into two jobs: writing and reading. During the data writing process, SLB receives data K1 and writes it to a Redis database through a proxy node. If K1 becomes a hotkey after the calculation conducted by the backend hotkey computing module, the proxy node caches the hotkey. In this way, the client can directly access K1 without using Redis. The proxy node can be scaled out. Therefore, the accessibility of the hot data can be enhanced.

Discover hot data

The database first counts the requests that occur in a specified cycle. When the number of requests reaches a threshold, the database detects the hotkeys and stores them in an LRU list. When a client attempts to access data by sending a request to proxy nodes, Redis enters the feedback phase and marks the data if it finds that the destination is a hotkey.

The database uses the following methods to compute the hot data:

Hot data statistics based on statistical thresholds
Hot data statistics based on statistical cycles
Statistics collection method based on the version number without resetting the initial value
Computing hotkeys on the database has a minor impact on the performance and occupies only a small amount of memory.

Comparison of two solutions

The preceding analysis shows that compared with the traditional solutions, Alibaba Cloud has made significant improvements in resolving the hotkey issue. The read/write splitting solution and the hot data solution can be extended. These two solutions are transparent to the client, though they cannot ensure complete data consistency. The read/write splitting solution supports storing a larger amount of hot data, while the proxy-based solution is more cost-effective.