If a Tair instance experiences high CPU utilization, the throughput of the instance and response time of an application that connects to the instance are affected. In extreme cases, the application may stop responding. If the average CPU utilization is higher than 50% and the average peak CPU utilization exceeds 90% for more than 5 minutes, the stability of the application may be affected. You must pay close attention to and troubleshoot this issue.
Search for and disable commands that cause high CPU utilization
Commands that consume large amounts of CPU resources have a time complexity of O(N) or higher. In most cases, a command with a higher time complexity consumes more CPU resources. This increases CPU utilization. For more information about the time complexity of each command, visit the Redis official website.
If Tair runs commands that consume large amounts of CPU resources, pending requests are piled up in the queue due to single-threading. This slows down the response of applications. In some cases, a Tair instance may be overwhelmed by pending requests. An application may be disconnected due to these requests timing out. In addition, requests may be directly forwarded to backend databases and cause a cache avalanche.
Use the performance monitoring feature to identify the time period during which CPU utilization is high. For more information, see View performance monitoring data.
Use the following methods to identify the commands that consume large amounts of CPU resources:
Audit logs record modification and deletion operations that are performed on Tair instances. You can query audit logs to analyze the commands and trends within a specific time period. This allows you to identify the commands that consume large amounts of CPU resources. For more information, see View audit logs.
Slow logs record commands that run longer than the specified threshold. You can identify commands that consume large amounts of CPU resources based on the statements and durations that are recorded in slow logs. For more information, see Query slow logs.
NoteThe amount of time that it takes to execute a statement is measured in microseconds.
Assess and disable commands that cause a high risk or consume large amounts of CPU resources, such as FLUSHALL, KEYS, and HGETALL. For more information, see Disable high-risk commands.
Optimize your application and do not frequently sort data.
Optional: Use one of the following methods to modify the instance based on your business requirements:
Change the architecture of the instance to read/write splitting to distribute commands or applications that consume large amounts of CPU resources. For more information about the read/write splitting architecture, see Read/write splitting instances.
Change the instance into a performance-enhanced instance and use the multi-threading feature of performance-enhanced instances to lower the CPU utilization of the instance. For more information about performance-enhanced instances, see DRAM-based instances.
NoteFor more information about how to change the architecture and series type of an instance, see Change the configurations of an instance.
Optimize hotkeys
Issue:
A cluster instance or a read/write splitting instance is used. For more information, see Cluster architecture and Read/write splitting architecture. The CPU utilization is high on some data nodes in the Tair instance.
Solution:
Enable the proxy query cache feature. After you enable this feature, proxy nodes cache the request and response data of hotkeys. If a proxy node receives a duplicate request during the validity period of the cached data, the proxy node directly returns a response to the client without the need to interact with backend data shards. This helps prevent skewed requests caused by hotkeys that receive a large number of read requests. For more information, see Use proxy query cache to address issues caused by hotkeys.
NoteThis feature is supported only for DRAM-based instances that use the cluster architecture.
Analyze the slow logs and audit logs of corresponding nodes, and then check the hotkeys on each node. This way, you can resolve the issue or slightly decrease CPU utilization. For more information, see Use the real-time key statistics feature.
Optimize short-lived connections
Issue:
Connections are frequently established to a Tair instance. As a result, large amounts of resources of the instance are consumed. In this case, CPU utilization is high, the number of established connections is large, and the queries per second (QPS) does not reach the expected value.
Solution:
Change short-lived connections into persistent connections. For example, create a JedisPool connection pool. For more information, see TairJedis.
Change the instance into a DRAM-based instance to optimize the processing of short-lived connections.
Disable AOF persistence
Issue:
By default, append-only file (AOF) persistence is enabled for Tair instances. If a Tair instance runs with heavy loads, frequent AOF operations may increase CPU utilization.
Solution:
Disable AOF persistence if this does not adversely affect your business. In addition, you can back up the Tair data during off-peak hours or during the maintenance window to minimize the impact.
If you use a DRAM-based instance, you can use only backup sets to restore instance data. Proceed with caution if you disable AOF persistence. For more information, see Restore data from a backup set to a new instance.
Optimize proxy node connections and the use of pipelines
Issue:
The performance trends of a cluster instance or read/write splitting instance are displayed in the Tair console. The CPU utilization of proxy nodes is unevenly distributed, and large differences exist between the maximum and minimum CPU utilization.
Solution:
Use the performance trends feature to check whether connection usage is evenly distributed. For more information, see Performance trends.
Perform the following operations based on whether connection usage is evenly distributed:
If connection usage is evenly distributed, restart the client or proxy node where business applications are deployed to redistribute connections.
If connection usage is unevenly distributed, this uneven distribution is usually caused by a large number of pipeline or batch operations, you can decrease the number of the corresponding operations. For example, you can separate one operation into multiple operations.
Evaluate the service performance
The preceding methods are used to optimize the performance of your instance. If the average CPU utilization still exceeds 50% during normal business operations, the instance may have a performance bottleneck.
To resolve this issue, first check for commands and requests from application hosts that may degrade the instance performance. If such commands or requests exist, you must optimize your business system. If no such commands or requests are found but the CPU utilization is still high, we recommend that you upgrade the instance specifications to ensure business stability. You can also upgrade the instance to a cluster instance or read/write splitting instance. For more information about how to upgrade an instance, see Change the configurations of an instance.
To ensure business stability, we recommend that you purchase a pay-as-you-go instance before you upgrade the instance. You can release this pay-as-you-go instance after you complete the stress and compatibility tests.