When the metrics of a data shard in a Tair cluster instance vastly exceed those of other data shards, data skew may occur in the cluster. Metrics that may affect this outcome include memory usage, CPU utilization, bandwidth usage, and latency. In this case, exceptions such as data eviction, out-of-memory errors, and higher latency may occur even when the overall memory usage of the instance is relatively low.
The cluster architecture of Tair is built in a distributed fashion. The storage space of a cluster instance is split into 16,384 slots. Each data shard stores and handles data in specific slots. For example, assume that a cluster instance has three data shards. Slots of this instance are split into the following parts: [0, 5460], [5461, 10922], and [10923, 16383]. When you write a key to a cluster instance or update a key of a cluster instance, the client determines the slot to which the key belongs by using the following formula:
Slot=CRC16(key)%16384. Then, the client writes the key to the slot. In theory, this mechanism should evenly distribute keys among data shards, and should be sufficient to keep the values of metrics such as memory usage and CPU utilization at almost the same level across data shards.
However, in actual practice, data skew may occur due to a lack of advanced planning, unusual data writes, or data access spikes.
Typically, data skew occurs when the resources of specific data shards are more in-demand compared with those of other data shards.
You can view the metrics of data shards on the Data Node tab of the Performance Monitor page in the console. If the metrics of a data shard is consistently 20% higher than the others, data skew may be present. The severity of the data skew is proportional to the intensity of the metrics.
Data skew may occur even if keys are evenly distributed among data shards, as shown in the preceding figure.
The queries per second (QPS) of
Replica 1is much higher than that of other keys. This is data access skew that can lead to a high CPU utilization and bandwidth usage on the replica. As a result, the performance of all keys on the replica is affected.
The size of
Replica 2is much larger than that of other keys. This is considered as a data volume skew, which can lead to a high memory usage and bandwidth usage on the replica. As a result, the performance of all keys on the replica is affected.
This topic describes how to determine whether data skew occurs, what may cause data skew, and how to handle this issue. You can also refer to this topic to troubleshoot high memory usage, CPU utilization, bandwidth usage, and latency for Tair standard instances.
Check for data skew
Provisional solutions to data skew
If data skew is present, you can use provisional solutions as a contingency measure. The following table describes these provisional solutions. These solutions provide temporary relief for data skew, but do not resolve the root cause.
To temporarily alleviate the influence of data skew, you can also reduce requests for large keys and hotkeys. To solve issues related to large keys and hotkeys, you must take measures on your business side. We recommend that you identify the cause of data skew in your instance in a timely manner and handle these issues on the business side to optimize instance performance. For more information, see Causes and solutions.
Memory usage skew
Large keys and hash tags
Upgrade your instance specifications. For more information, see Change the configurations of an instance.
Bandwidth usage skew
Large keys, hotkeys, and resource-intensive commands
Increase the bandwidth of one or more specific data shards. For more information, see Manually adjust the bandwidth of a Tair instance.
You can increase the bandwidth to three times the maximum bandwidth of a data shard. If this measure is still unable to solve data skew issues, we recommend that you make modifications on the business side.
CPU utilization skew
Large keys, hotkeys, and resource-intensive commands
No provisional solutions are available. Check your instance, identify the cause, and then make modifications on the business side.
Causes and solutions
To solve the root cause of data skew, we recommend that you evaluate your business growth and make the necessary preparations for future growth. You can take measures to split large keys and write data in a manner that conforms to the expected usage.
A large key is determined based on the size of the key and the number of members in the key.
Typically, large keys are common in key-value data structures such as hash, list, set, and zset. This happens when these structures store a large number of fields or fields that are too large. Large keys are one of the main culprits in data skew. For more information, see Identify and handle large keys and hotkeys.
Hotkeys refer to the keys that have a much higher QPS than other keys. Hotkeys commonly appear during stress testing on a single key, or during flash sales on the keys of popular merchandise. For more information, see Identify and handle large keys and hotkeys.
Each command has a metric called time complexity that measures resource and time consumption. In most cases, the higher the time complexity of a command is, the more resources the command consumes. For example, the time complexity of the
Tair distributes a key to a specific data shard based on the slot calculation of