When the metrics of a data shard in a Tair cluster instance vastly exceed those of other data shards, data skew may occur in the cluster. Metrics that may affect this outcome include memory usage, CPU utilization, bandwidth usage, and latency. In this case, exceptions such as data eviction, out-of-memory errors, and higher latency may occur even when the overall memory usage of the instance is relatively low.

Background information

The cluster architecture of Tair is built in a distributed fashion. The storage space of a cluster instance is split into 16,384 slots. Each data shard stores and handles data in specific slots. For example, assume that a cluster instance has three data shards. Slots of this instance are split into the following parts: [0, 5460], [5461, 10922], and [10923, 16383]. When you write a key to a cluster instance or update a key of a cluster instance, the client determines the slot to which the key belongs by using the following formula: Slot=CRC16(key)%16384. Then, the client writes the key to the slot. In theory, this mechanism should evenly distribute keys among data shards, and should be sufficient to keep the values of metrics such as memory usage and CPU utilization at almost the same level across data shards.

However, in actual practice, data skew may occur due to a lack of advanced planning, unusual data writes, or data access spikes.

Note Typically, data skew occurs when the resources of specific data shards are more in-demand compared with those of other data shards.

You can view the metrics of data shards on the Data Node tab of the Performance Monitor page in the console. If the metrics of a data shard is consistently 20% higher than the others, data skew may be present. The severity of the data skew is proportional to the intensity of the metrics.

HotkeysData skew may occur even if keys are evenly distributed among data shards, as shown in the preceding figure.
  • The queries per second (QPS) of Key 1 on Replica 1 is much higher than that of other keys. This is data access skew that can lead to a high CPU utilization and bandwidth usage on the replica. As a result, the performance of all keys on the replica is affected.
  • The size of Key 5 on Replica 2 is much larger than that of other keys. This is considered as a data volume skew, which can lead to a high memory usage and bandwidth usage on the replica. As a result, the performance of all keys on the replica is affected.

This topic describes how to determine whether data skew occurs, what may cause data skew, and how to handle this issue. You can also refer to this topic to troubleshoot high memory usage, CPU utilization, bandwidth usage, and latency for Tair standard instances.

Check for data skew

  • Use the diagnostic report feature to check whether data skew is present on the current instance.
  • On the Instance Information page, choose Performance Monitor > Data Node to view the metrics of data shards. For more information, see View monitoring data.

Provisional solutions to data skew

If data skew is present, you can use provisional solutions as a contingency measure. The following table describes these provisional solutions. These solutions provide temporary relief for data skew, but do not resolve the root cause.

To temporarily alleviate the influence of data skew, you can also reduce requests for large keys and hotkeys. To solve issues related to large keys and hotkeys, you must take measures on your business side. We recommend that you identify the cause of data skew in your instance in a timely manner and handle these issues on the business side to optimize instance performance. For more information, see Causes and solutions.

IssuePossible causeProvisional solution
Memory usage skewLarge keys and hash tags Upgrade your instance specifications. For more information, see Change the configurations of an instance.
  • Tair initiates a precheck for data skew during instance specification change. If the instance type that you select cannot handle the data skew issue, Tair reports an error. Select an instance type that has higher specifications and try again.
  • After you upgrade your instance specifications, memory usage skew may be alleviated. However, usage skew may also occur on bandwidth and CPU resources.
Bandwidth usage skewLarge keys, hotkeys, and resource-intensive commands Increase the bandwidth of one or more specific data shards. For more information, see Manually adjust the bandwidth of a Tair instance.
Note You can increase the bandwidth to three times of the maximum bandwidth of a data shard. If this measure is still unable to solve data skew issues, we recommend that you make modifications on the business side.
CPU utilization skewLarge keys, hotkeys, and resource-intensive commands No provisional solutions are available. Check your instance, identify the cause, and then make modifications on the business side.

Causes and solutions

To solve the root cause of data skew, we recommend that you evaluate your business growth and make the necessary preparations for future growth. You can take measures to split large keys and write data in a manner that conforms to the expected usage.

Large keysA large key is determined based on the size of the key and the number of members in the key.

Typically, large keys are common in key-value data structures such as hash, list, set, and zset. This happens when these structures store a large number of fields or fields that are too large. Large keys are one of the main culprits in data skew. For more information, see Identify and handle large keys and hotkeys.

  • Do not use large keys.
  • Split a hash key that contains tens of thousands of members into multiple hash keys that have a proper number of members.
HotkeysHotkeys refer to the keys that have a much higher QPS than other keys. Hotkeys commonly appear during stress testing on a single key, or during flash sales on the keys of popular merchandise. For more information, see Identify and handle large keys and hotkeys.
Resource-intensive commandsEach command has a metric called time complexity that measures resource and time consumption. In most cases, the higher the time complexity of a command is, the more resources the command consumes. For example, the time complexity of the HGETALL command is O(n). This indicates that the command consumes resources in proportion to the number of fields specified for the command. Similarly, if a SET or GET command contains a large payload, the command also consumes large amounts of resources of the data shard.
  • Query the slow logs of this data shard.
  • Do not use resource-intensive commands. To disable specific commands, specify the #no_loose_disabled-commands parameter.
Hash tagsTair distributes a key to a specific data shard based on the slot calculation of {} in the key. For example, the {item}id1, {item}id2, and {item}id3 keys are stored in the same data shard because they share the same {}. As a result, the memory usage and resource consumption of the data shard surge.
  • Do not use {} in the name of a key.
    Note If you want to use {} in the name of a key, make sure that different keys have different fields in {}. This way, you can store keys across multiple data shards.