If the performance metric values of specific data shards of an ApsaraDB for Redis instance are much higher than those of other data shards, the ApsaraDB for Redis instance may have data skew issues. These performance metrics include the memory usage, CPU utilization, bandwidth usage, and latency. If the instance data is severely skewed, exceptions, such as key evictions, out of memory (OOM) errors, and prolonged responses, may occur even when the memory usage of the instance is low.
The cluster architecture of ApsaraDB for Redis is built in a distributed fashion. The storage space of a cluster instance is split into 16,384 slots. Each data shard stores and handles data in specific slots. For example, assume that a cluster instance has three data shards. Slots of this instance are split into the following parts: [0, 5460], [5461, 10922], and [10923, 16383]. When you write a key to a cluster instance or update a key of a cluster instance, the client determines the slot to which the key belongs by using the following formula:
Slot=CRC16(key)%16384. Then, the client writes the key to the slot. In theory, this mechanism should evenly distribute keys among data shards, and should be sufficient to keep the values of metrics such as memory usage and CPU utilization at almost the same level across data shards.
However, in actual practice, data skew may occur due to a lack of advanced planning, unusual data writes, or data access spikes.
You can view the metrics of data shards on the Data Node tab of the Performance Monitor page in the console. If the metrics of a data shard is consistently 20% higher than the others, data skew may be present. The severity of the data skew is proportional to the intensity of the metrics.
- The queries per second (QPS) of
Replica 1is much higher than that of other keys. This is data access skew that can lead to a high CPU utilization and bandwidth usage on the replica. As a result, the performance of all keys on the replica is affected.
- The size of
Replica 2is much larger than that of other keys. This is considered as a data volume skew, which can lead to a high memory usage and bandwidth usage on the replica. As a result, the performance of all keys on the replica is affected.
This topic describes how to determine whether data skew occurs, what may cause data skew, and how to handle this issue. You can also refer to this topic to troubleshoot high memory usage, CPU utilization, bandwidth usage, and latency for Tair standard instances.
Check for data skew issues for an instance
- Use the diagnostic report feature to check whether data skew is present on the current instance.
- On the Instance Information page, choose View monitoring data. to view the metrics of data shards. For more information, see
If data skew is present, you can use provisional solutions as a contingency measure. The following table describes these provisional solutions. These solutions provide temporary relief for data skew, but do not resolve the root cause.
To temporarily alleviate the influence of data skew, you can also reduce requests for large keys and hotkeys. To solve issues related to large keys and hotkeys, you must take measures on your business side. We recommend that you identify the cause of data skew in your instance in a timely manner and handle these issues on the business side to optimize instance performance. For more information, see Causes and solutions.
|Issue||Possible cause||Provisional solution|
|Memory usage skew||Large keys and hash tags||Upgrade your instance specifications. For more information, see Change the configurations of an instance. |
|Bandwidth usage skew||Large keys, hotkeys, and resource-intensive commands||Increase the bandwidth of one or more specific data shards. For more information, see Adjust the bandwidth of an ApsaraDB for Redis instance. |
Note You can increase the bandwidth to three times of the maximum bandwidth of a data shard. If this measure is still unable to solve data skew issues, we recommend that you make modifications on the business side.
|CPU utilization skew||Large keys, hotkeys, and resource-intensive commands||No provisional solutions are available. Check your instance, identify the cause, and then make modifications on the business side.|
Causes and solutions
To solve the root cause of data skew, we recommend that you evaluate your business growth and make the necessary preparations for future growth. You can take measures to split large keys and write data in a manner that conforms to the expected usage.
|Large keys||A large key is determined based on the size of the key and the number of members in the key. |
Typically, large keys are common in key-value data structures such as hash, list, set, and zset. This happens when these structures store a large number of fields or fields that are too large. Large keys are one of the main culprits in data skew. For more information, see Identify and handle large keys and hotkeys.
|Hotkeys||Hotkeys refer to the keys that have a much higher QPS than other keys. Hotkeys commonly appear during stress testing on a single key, or during flash sales on the keys of popular merchandise. For more information, see Identify and handle large keys and hotkeys.|
|Resource-intensive commands||Each command has a metric called time complexity that measures resource and time consumption. In most cases, the higher the time complexity of a command is, the more resources the command consumes. For example, the time complexity of the |
|Hash tags||Tair distributes a key to a specific data shard based on the slot calculation of |