When the metrics of a data shard in a Tair cluster instance vastly exceed those of other data shards, data skew may occur in the cluster. Metrics that may affect this outcome include memory usage, CPU utilization, bandwidth usage, and latency. In this case, exceptions such as data eviction, out-of-memory errors, and higher latency may occur even when the overall memory usage of the instance is relatively low.
Background information
The cluster architecture of Tair is built in a distributed fashion. The storage space of a cluster instance is split into 16,384 slots. Each data shard stores and handles data in specific slots. For example, assume that a cluster instance has three data shards. Slots of this instance are split into the following parts: [0, 5460], [5461, 10922], and [10923, 16383]. When you write a key to a cluster instance or update a key of a cluster instance, the client determines the slot to which the key belongs by using the following formula: Slot=CRC16(key)%16384
. Then, the client writes the key to the slot. In theory, this mechanism should evenly distribute keys among data shards, and should be sufficient to keep the values of metrics such as memory usage and CPU utilization at almost the same level across data shards.
However, in actual practice, data skew may occur due to a lack of advanced planning, unusual data writes, or data access spikes.
You can view the metrics of data shards on the Data Node tab of the Performance Monitor page in the console. If the metrics of a data shard is consistently 20% higher than the others, data skew may be present. The severity of the data skew is proportional to the intensity of the metrics.

- The queries per second (QPS) of
Key 1
onReplica 1
is much higher than that of other keys. This is data access skew that can lead to a high CPU utilization and bandwidth usage on the replica. As a result, the performance of all keys on the replica is affected. - The size of
Key 5
onReplica 2
is much larger than that of other keys. This is considered as a data volume skew, which can lead to a high memory usage and bandwidth usage on the replica. As a result, the performance of all keys on the replica is affected.
This topic describes how to determine whether data skew occurs, what may cause data skew, and how to handle this issue. You can also refer to this topic to troubleshoot high memory usage, CPU utilization, bandwidth usage, and latency for Tair standard instances.
Check for data skew
- Use the diagnostic report feature to check whether data skew is present on the current instance.
- On the Instance Information page, choose View monitoring data. to view the metrics of data shards. For more information, see
Provisional solutions to data skew
If data skew is present, you can use provisional solutions as a contingency measure. The following table describes these provisional solutions. These solutions provide temporary relief for data skew, but do not resolve the root cause.
To temporarily alleviate the influence of data skew, you can also reduce requests for large keys and hotkeys. To solve issues related to large keys and hotkeys, you must take measures on your business side. We recommend that you identify the cause of data skew in your instance in a timely manner and handle these issues on the business side to optimize instance performance. For more information, see Causes and solutions.
Issue | Possible cause | Provisional solution |
---|---|---|
Memory usage skew | Large keys and hash tags | Upgrade your instance specifications. For more information, see Change the configurations of an instance. Important
|
Bandwidth usage skew | Large keys, hotkeys, and resource-intensive commands | Increase the bandwidth of one or more specific data shards. For more information, see Manually adjust the bandwidth of a Tair instance. Note You can increase the bandwidth to three times of the maximum bandwidth of a data shard. If this measure is still unable to solve data skew issues, we recommend that you make modifications on the business side. |
CPU utilization skew | Large keys, hotkeys, and resource-intensive commands | No provisional solutions are available. Check your instance, identify the cause, and then make modifications on the business side. |
Causes and solutions
To solve the root cause of data skew, we recommend that you evaluate your business growth and make the necessary preparations for future growth. You can take measures to split large keys and write data in a manner that conforms to the expected usage.
Cause | Description | Solution |
---|---|---|
Large keys | A large key is determined based on the size of the key and the number of members in the key. Typically, large keys are common in key-value data structures such as hash, list, set, and zset. This happens when these structures store a large number of fields or fields that are too large. Large keys are one of the main culprits in data skew. For more information, see Identify and handle large keys and hotkeys. |
|
Hotkeys | Hotkeys refer to the keys that have a much higher QPS than other keys. Hotkeys commonly appear during stress testing on a single key, or during flash sales on the keys of popular merchandise. For more information, see Identify and handle large keys and hotkeys. |
|
Resource-intensive commands | Each command has a metric called time complexity that measures resource and time consumption. In most cases, the higher the time complexity of a command is, the more resources the command consumes. For example, the time complexity of the HGETALL command is O(n). This indicates that the command consumes resources in proportion to the number of fields specified for the command. Similarly, if a SET or GET command contains a large payload, the command also consumes large amounts of resources of the data shard. |
|
Hash tags | Tair distributes a key to a specific data shard based on the slot calculation of {} in the key. For example, the {item}id1 , {item}id2 , and {item}id3 keys are stored in the same data shard because they share the same {} . As a result, the memory usage and resource consumption of the data shard surge. |
|