If the memory alert for a Tair (Redis OSS-compatible) instance indicates high memory usage or if your application encounters out of memory (OOM) exceptions, but the performance monitoring data indicates low memory usage, you can refer to this topic to troubleshoot the issue.
Problem description
Symptom 1:
You receive a memory alert for an instance, indicating that the memory usage exceeds the threshold. For example, the average value is greater than or equal to 90% for three consecutive times. However, the monitoring page in the console shows that the memory usage is significantly lower than the threshold.
Symptom 2:
Your application encounters the command not allowed when used memory > 'maxmemory'
exception, but the monitoring page in the console shows that the memory is not fully occupied or only one data node has high memory usage.
Causes
Why is the monitored memory usage different from the reported memory usage?
If the memory usage displayed on the monitoring page is different from the memory usage in the alert information, your instance may be a cluster instance. You are checking monitoring information at the instance level instead of the data node level.
Check whether nodeId = <Instance ID>-db-<Number>
is included in the instance details in the alert information that you receive. If the preceding condition is true, only the memory usage of the data node identified by <Instance ID>-db-<Number>
exceeds the threshold.
Perform the following steps to check whether the memory usage of the data node is the same as the memory usage in the alert information:
Log on to the console and go to the Instances page. In the top navigation bar, select the region in which the instance that you want to manage resides. Then, find the instance and click the instance ID.
In the left-side navigation pane, click Performance Monitoring.
Click the Data Node tab and select the data node that corresponds to
<Instance ID>-db-<Number>
. Check whether the memory usage of the data node is the same as the memory usage in the alert information.
Why is the memory usage of a data node significantly higher than the memory usage of other data nodes?
If the memory usage of one or more data nodes in a cluster instance is significantly higher compared with other data nodes, data skew may occur. You can use the instance diagnostics feature to check whether data skew occurs on the current instance.
Why does memory skew occur?
In most cases, memory skew occurs due to the following reasons:
Large keys exist.
The cluster instance uses the cyclic redundancy check (CRC) algorithm to calculate the slot to which a key belongs and writes data to the data node to which the slot belongs.
If a particular key stores a significant number of fields or fields that are large in size, the key may become excessively large and cause memory skew even if keys are evenly distributed across different data nodes.
Hash tags are used.
When you use hash tags such as
user:{1000}:name
, the instance performs CRC calculation on the string that is enclosed in the curly braces and maps keys with the same hash tag to the same slot. This way, the keys reside on the same data node. If identical hash tags are configured for a large number of keys, data may be concentrated on a single data node and cause memory skew.
Solutions
Check whether large keys exist and split the large keys
Identify large keys
You can use the offline key analysis feature to identify large keys. For more information, see Use the offline key analysis feature.
For information about how to identify large keys, see Identify and handle large keys and hotkeys.
Split large keys
For example, you can split a HASH key that contains tens of thousands of members into multiple HASH keys that have the appropriate number of members. For cluster instances, you can split large keys to balance the memory usage across multiple data shards.
Check whether hash tags are used
If hash tags are used, consider splitting a hash tag into multiple hash tags based on your business requirements. This way, data is evenly distributed across different data nodes.
Upgrade instance specifications
Upgrading the instance specifications by increasing the memory allocated to each shard can serve as a temporary solution to prevent memory skew. For more information, see Change the configurations of an instance.
The system initiates a precheck for data skew during instance specification change. If the instance type that you select cannot handle the data skew issue, the system reports an error. Select an instance type that has higher specifications and try again.
After you upgrade the instance specifications, memory usage skew may be alleviated. However, skew may also occur on bandwidth and CPU resources.