In a Tair (Redis OSS-compatible) cluster instance, data skew occurs when specific data nodes carry a disproportionate share of memory, CPU, or bandwidth load. When skew is severe enough, individual nodes can trigger key evictions, out of memory (OOM) errors, and prolonged response times — even when the instance's overall memory usage looks healthy.
Diagnose data skew
Follow these steps to confirm whether data skew exists and identify its source.
In the left-side navigation pane of the instance details page, go to CloudDBA > Real-time Performance and check whether Memory Usage is balanced across data nodes. A data node is skewed if its metrics are consistently more than 20% higher than those of other data nodes. In the example below, db-0 has significantly higher memory usage than db-1 and db-2, which confirms data skew.
NoteThe instance diagnostics feature also detects data skew. For details, see Perform diagnostics on an instance.
Compare the total key counts across data nodes. If the key counts are roughly equal (as in the example, where db-0, db-1, and db-2 each hold approximately 2.59–2.60 million keys), the skew is caused by large keys on the overloaded node — not by uneven key distribution.
NoteIf key counts are significantly uneven, the issue is likely key name design — for example, hash tags routing many keys to the same data node. Keys that share the same content inside
{}are always assigned to the same hash slot. To fix this, remove{}from key names in your application code. If you must use{}, make sure each key has unique content between the braces.Use offline key analysis to identify large keys. This feature analyzes backup files without affecting running workloads and returns the top 500 large keys. In the example below, the key
mylistis identified as a large key.
(Optional) Connect directly to a specific data node for deeper analysis.
Proxy mode: Use the ISCAN command (an in-house Alibaba Cloud command) to run SCAN on a specific data node. For details, see In-house commands for instances in proxy mode.
Direct connection mode: Call the DescribeDBNodeDirectVipInfo API to get the virtual IP address (VIP) of each data node, then connect your client directly.
Apply a fix:
Short-term: Upgrade the instance specifications.
Long-term (recommended): Restructure your application logic to split large keys.
How data skew works
Tair cluster instances split storage into 16,384 hash slots. Each data node owns a range of slots. For example, in a 3-shard cluster:
Shard 1: hash slots [0, 5460]
Shard 2: hash slots [5461, 10922]
Shard 3: hash slots [10923, 16383]
When a key is written, the client determines its target slot using Slot = CRC16(key) % 16384, then routes the write to the corresponding data node. In theory, this distributes keys evenly. In practice, data skew can develop because of large keys, hot keys, hash tags, or resource-intensive commands.
The diagram below shows two typical skew scenarios where keys are distributed evenly (two keys per node) but one node is still overloaded:
Data access skew:
key1on Shard 1 receives far more queries per second (QPS) than other keys, driving up CPU utilization and bandwidth usage on that node and degrading performance for all keys it hosts.Data volume skew:
key5on Shard 2 is 1 MB — much larger than other keys — causing high memory and bandwidth usage on that node.
Monitor data nodes on the Data Node tab of the Performance Monitor page. When a single data node reaches 100% memory usage, it triggers key eviction using the volatile-lru eviction policy by default.
Provisional solutions
These measures reduce pressure in the short term. For a permanent fix, see Root causes and long-term solutions.
| Issue | Possible causes | Provisional solution |
|---|---|---|
| Memory usage skew | Large keys, hash tags | Upgrade the instance specifications. Important The system runs a data skew precheck during upgrades. If the selected instance type cannot absorb the skew, the system returns an error — select a higher-specification type and retry. Note that upgrading memory specifications may shift skew to bandwidth or CPU. |
| Bandwidth usage skew | Large keys, hot keys, resource-intensive commands | Increase the bandwidth of affected data nodes. Bandwidth can be increased up to 6× the default, capped at 192 Mbit/s. If this does not resolve the skew, address the root cause at the application level. |
| CPU utilization skew | Large keys, hot keys, resource-intensive commands | Add data nodes during off-peak hours to spread load across more nodes. Also optimize expensive commands — for example, reduce the number of keys fetched in each SCAN call. Note Scale during off-peak hours. Data migration during scaling is CPU-intensive. |
Causes and solutions
Address data skew at the root cause to eliminate it permanently.
Large keys
A large key is a key whose value size or member count is significantly higher than typical keys. Large keys are most common in Hash, List, Set, and Sorted Set (ZSet) data structures when a single key accumulates too many members or stores very large values. Large keys are one of the primary drivers of data volume skew.
Solutions:
Split a Hash key that holds tens of thousands of members into multiple smaller Hash keys. Use a common key prefix to identify the collection — for example,
product:1001:field1,product:1001:field2. To retrieve multiple fields across the split keys, use MGET.If splitting is not immediately feasible, use HGET or HMGET instead of HGETALL to fetch only the fields you need rather than the entire key.
Prevent large keys from forming in the first place by enforcing size limits in your application.
For details on identifying and safely deleting large keys without impacting your workload, see Identify and handle large keys and hot keys.
Hot keys
A hot key receives far more QPS than other keys. Hot keys commonly appear during single-key stress tests or in flash sale scenarios tied to a specific product ID. High QPS on one key can saturate the CPU and bandwidth of the node that hosts it, degrading performance for all keys on that node.
Solutions:
Use the proxy query cache feature to serve hot reads from cache rather than hitting the data node directly. For details, see Identify and handle large keys and hot keys.
Prevent hot keys from forming by distributing access across multiple keys in your application design.
Resource-intensive commands
Every Redis command has a time complexity that reflects how resource consumption scales with input size. Commands with O(N) or higher complexity can consume significant node resources when the input is large. For example, the time complexity of the HGETALL command is O(n), meaning it consumes resources in proportion to the number of fields in the Hash.
To identify and reduce resource-intensive command usage:
Check the slow log for commands exceeding your latency threshold.
Use latency insights to find commands consuming the most resources.
Replace high-cost commands with lower-cost alternatives — for example, use HGET or HMGET instead of HGETALL to fetch only the fields you need.
Disable specific commands that should not be used in production by setting the
#no_loose_disabled-commandsparameter on the Parameter Settings page.
Hash tags
Tair routes a key to a hash slot based on the content inside {} in the key name. Keys that share the same {} content are always assigned to the same data node — regardless of how many other nodes exist. For example, {item}id1, {item}id2, and {item}id3 all land on the same node, concentrating memory and compute load there.
Solutions:
Remove
{}from key names.If you need to use
{}for a specific reason, make sure different keys have different content inside{}so they distribute across multiple data nodes.
FAQ
Can I upgrade only the data node that has the large key?
No. Tair (Redis OSS-compatible) does not support upgrading individual data nodes. Specification upgrades apply to all data nodes in the instance simultaneously.
Can I eliminate large keys by adding data nodes and redistributing keys?
No. Adding data nodes redistributes keys at the key level, but large keys cannot be split automatically during redistribution. After resharding, the large key moves to a single node, and that node will still have higher memory usage than others. The only effective solution is to manually split the large key in your application.