Diagnostic reports help you evaluate the operational conditions of a Tair instance and identify anomalies on the instance based on statistics such as performance level, skewed request distribution, and slow logs.

Prerequisites

Create a diagnostic report

Components of a diagnostic report

  • Basic instance information: displays basic information of an instance such as the instance ID, instance type, engine version, and the zone in which the instance is deployed.
  • Summary: displays the score of the instance health status and describes the reasons why points are deducted.
  • Performance level: displays the statistics and states of important performance metrics related to the instance.
  • TOP 10 nodes that receive the greatest number of slow queries: displays the top 10 data nodes that receive the greatest number of slow queries and provides information about the slow queries.

Basic instance information

This section displays the instance ID, instance type, engine version, and the region in which the instance is deployed.

Summary

This section displays the diagnostic results and the score of the instance health status. The highest score is 100. If your instance scores less than 100, you can check the diagnostic items and details.

Figure 1. Summary
Summary

Performance level

This section displays the statistics and states of key performance metrics related to the instance. You must pay attention to performance metrics that are in the Hazard state.

Note If your instance runs in a cluster architecture or a read/write splitting architecture, you must check whether the performance metrics are skewed and check for skewed data nodes. For more information, see Cluster architecture and Read/write splitting architecture. In addition, we recommend that you focus on the data nodes with higher loads based on the curve charts of each performance metric in the Top 5 Nodes section.
Figure 2. Performance level
Performance level
Performance metric Threshold Impact Possible cause and troubleshooting method
CPU Utilization 60% When an ApsaraDB for Redis instance has high CPU utilization, the throughput of the instance and the response time of clients are affected. In some cases, the clients may be unable to respond.

Possible causes:

  • The instance runs commands of high time complexity.
  • Hotkeys exist.
  • Connections are frequently established.

For more information about how to troubleshoot these issues, see Troubleshoot high CPU utilization on a Tair instance.

Memory Usage 80% When the memory usage of a Tair instance continuously increases, keys may be frequently evicted, response time increases, and queries per second (QPS) becomes unstable. This affects your business. Possible causes:
  • The memory is exhausted.
  • A great number of large keys exist.

For more information about how to troubleshoot these issues, see Troubleshoot the high memory usage on a Tair instance.

Connections Usage of data nodes 80% When the number of connections to a data node reaches the upper limit, connection requests may time out or fail.
Note
  • This metric is collected when clients connect to a Tair cluster instance in direct connection mode. For more information about the direct connection mode, see Enable the direct connection mode.
  • This metric is not collected when clients connect to a Tair cluster instance or read/write splitting instance by using proxy nodes. In this case, you must monitor the number of connections on the proxy nodes. For more information, see View monitoring data.

Possible causes:

  • User traffic spikes.
  • Idle connections are not released for an extended period of time.

For more information about how to troubleshoot these issues, see Session management.

Inbound Traffic 80% When the inbound or outbound traffic exceeds the maximum bandwidth provided by the instance type, the performance of clients is affected.

Possible causes:

  • Workloads spike.
  • Large keys are frequently read or written.

For more information about how to troubleshoot these issues, see Troubleshoot high traffic usage on a Tair instance.

Outbound Traffic 80%

If your instance runs in the cluster architecture or read/write splitting architecture, the system measures the overall access performance of the instance based on the preceding performance metrics and displays the results in the diagnostic report. For more information, see Cluster architecture and Read/write splitting architecture. The following table describes the criteria used to determine skewed requests, possible causes of skewed requests, and troubleshooting methods.

Note If the diagnostic report indicates that the instance has skewed requests for a specific performance metric, you must check the nodes to which the skewed requests are directed.
Criterion Possible cause Troubleshooting method

The following conditions are met:

  • Peak values of performance metrics for all data nodes of a Tair instance exceed the following thresholds:
    • CPU utilization: 10%.
    • Memory usage: 20%.
    • Inbound and outbound traffic: 5 Mbit/s.
    • Connection usage: 5%.
  • The balance score is greater than 1.3, which is calculated by using the following formula: max{average performance values of all data nodes}/median performance value of all data nodes.

    For example, a Tair instance contains four data nodes and the average CPU utilization of the four nodes is 10%, 30%, 50%, and 60%. Then, the median value is 40% and the result is 1.5 from 60/40. The calculated value 1.5 is greater than 1.3. Therefore, the system considers the CPU utilization of the instance skewed.

  • A data node has excessive large keys.
  • A data node has hotkeys.
  • The hash tags are improperly configured.
    Note If keys are configured with the same hash tag, the keys are stored on the same data node. If a large number of keys are configured with the same hash tag, the node is overwhelmed by these keys.

TOP 10 nodes that receive the greatest number of slow queries

This section displays the top 10 data nodes that receive the greatest number of slow queries and statistics about the slow queries. The statistics come from the following slow logs:

  • The slow logs of data nodes that are stored in the system audit logs. These slow logs are retained only for four days.
  • The slow logs that are stored on the data node. Only the most recent 1,024 log entries are retained. You can use redis-cli to connect to the instance and run the SLOWLOG GET command to view these slow logs.
Figure 3. Slow queries
Slow queries

You can analyze the slow queries, determine whether improper commands exist, and find solutions to different issues.

Cause Solution

Commands that have a time complexity of O(N) are run or consume a large amount of CPU resources, such as keys *.

Evaluate and disable commands that cause a high risk and consume a large amount of CPU resources, such as FLUSHALL, KEYS, and HGETALL. For more information, see Disable high-risk commands.
Large keys are frequently read from and written to the data nodes. Analyze and evaluate the large keys. For more information, see Offline key analysis. Then, split these large keys based on your business requirements.