CPU resources are of paramount importance to a database and are a major consideration in daily O&M. High CPU utilization brings about high application RT, stuttering, and even cluster hang and HA switchovers. The business can be seriously affected. Therefore, a threshold must be set for CPU monitoring. You must take immediate actions when the threshold is reached. Otherwise, serious consequences follow.
Countermeasures against high CPU utilization
As business grows, your cluster may no longer meet the requirements. Growing traffic drives up cluster usage and CPU utilization. In the performance curve, a metric (QPS or IOPS) shows an upward trend, consistent with the upward trend of CPU utilization.
If CPU hits a bottleneck, the current cluster specifications are insufficient. You must add read-only nodes to the cluster or scale up the cluster.
If your cluster cannot be connected or a running DML statement fails, check whether the CPU resources of the cluster are sufficient. If not, scale up the cluster to ensure system stability.
If most business scenarios involve read requests, you can add read-only nodes to distribute read requests. For more information, see Add or remove read-only nodes.
If most business scenarios involve write requests, adding read-only nodes does not improve performance. You must scale up the cluster, for example, changing the specifications from 4-core to 8-core. For more information, see Manually change the specifications of a cluster.
Unexpected increase of CPU utilization
Many reasons can cause the unexpected increase of CPU utilization. This topic focuses on slow queries, lots of active threads, inappropriate kernel configurations, and system bugs.
Slow queries
Increasing CPU utilization is generally due to unreasonable SQL statements, which cause slow queries and exploding active threads. However, you must distinguish whether high CPU utilization is a result of slow queries or whether insufficient other resources cause slow queries and high CPU utilization.
On the Slow SQL Query page of the PolarDB console, view the information about slow queries. For more information, see Slow SQL queries.
If slow query data is displayed, you can analyze slow queries. If the values of the Scanned Rows column on the Slow Log Details tab are much larger than the values of the Returned Rows column, high CPU utilization is due to slow queries.
NoteYou mainly analyze TP queries and must exclude
count
queries. Some AP queries have large values of the Scanned Rows column.Only a small amount of data is read and written in TP queries. If a query scans a large amount of data, it is very likely that no indexes are created. For example, if the number of scanned rows is more than 10,000 and the number of returned rows is 1 after you execute the following query statement, the index is missing for the
name
column.SELECT * FROM table1 WHERE name='testname';
You can execute the following statement to check whether an index is created for the
name
column:SHOW index FROM table1;
If the
name
column does not have an index, you can execute the following statement to add an index key column and eliminate slow queries caused by high data scans:ALTER TABLE table1 ADD KEY ix_name (name);
If the
name
column has an index, you can execute the following statement to view the execution plan of the SQL statement and check whether the index is used:EXPLAIN SELECT * FROM table1 WHERE name='testname';
If the
name
column has an index but the index is not used, the possible reason is that an incorrect execution plan is generated due to inaccurate statistics. You can execute the following statement to regenerate statistics on tables and correct the execution plan: You can execute the following statement to regenerate statistics on tables and correct the execution plan:ANALYZE TABLE table1;
After the preceding statement is executed, execute the following statement to check whether the correct index is used:
EXPLAIN SELECT * FROM table1 WHERE name='testname';
Lots of active threads
A large number of active threads definitely leads to high CPU utilization. In the MySQL implementation, each CPU can only process one request at a time. For example, a cluster of the 16-core specifications can process up to 16 requests at a time. Note that the requests here are at the kernel level, not in terms of application concurrency. View session details on the PolarDB console.
page in theExclusive of exceptional requests caused by slow queries, active threads surge with increasing network traffic. If the overall traffic and request trends are consistent with the accumulation trends of active threads, all cluster resources have been used. In this case, you must add read-only nodes to the cluster or scale up the cluster.
When active threads reach a critical level, CPU contention occurs. A large number of mutex locks are generated in the kernel. In this case, the performance curve is characterized by high CPU utilization, high active threads, and low I/O or low QPS. Business peaks may also occur. Connections are established at a very high speed. CPU contention results in an accumulation of requests. You can enable the thread_pool feature for a cluster to mitigate this issue. For more information, see Thread pool. If the number of active threads is reduced, check whether requests accumulate at the application side. If high CPU loads coincide with surging active threads, you must consider whether to scale up the cluster.
If connection storms at the frontend cause sudden request accumulation at the cluster side, exceptional traffic occurs. This generally comes from data crawling. In this case, you can enable SQL throttling to reject requests. For more information, see Session management.
Inappropriate kernel configurations
The default values of the PolarDB parameters are for common scenarios. They may not be suitable for special business. You can modify the values of the PolarDB parameters. Some issues may not occur at the initial stage when only a small amount of data is involved, but appear as data volume increases.
A common issue is memory contention. In MySQL, memory is mainly used for data caching. The data in memory is iterated continuously. Typically, the
buffer pool
andinnodb_adaptive_hash_index
consume memory. Within the entire database system, the cache area handles the most frequent data exchange. Insufficient memory and memory page contention may result in data accumulation and slow queries. A typical symptom is that the CPU is exhausted and slow queries occur. If this issue is not caused by missing indexes, it is due to memory contention.For example, when you execute the
truncate table
statement, MySQL scans thebuffer pool
to evict all data pages of the tableto be truncated
. In a cluster with high specifications, ifthe innodb_buffer_pool_instances
parameter is set to 1 and the concurrency is relatively high, contention may occur. This issue may be found at the initial stage. Solutions include setting theinnodb_buffer_pool_instances
parameter to the number of CPU cores and dividing thebuffer pool
into buckets.Another scenario is memory contention for
innodb_adaptive_hash_index
. The symptom is that a large number ofhash0hash.cc
waits occur when you execute the following statement:SHOW ENGINE innodb STATUS;
In the AHI field, obvious data skew occurs.
For this issue, you can set the
innodb_adaptive_hash_index
parameter to off to disable the AHI feature. The AHI feature may also degrade performance in read and write scenarios. Disabling the AHI feature has few impact on the overall business.System bugs
System bugs are very rare. For example, process deadlocks and full table scans caused by table statistics reset may result in high CPU utilization. With the rapid iteration of the service, system bug-caused CPU issues become less and less. You may have difficulties troubleshooting issues because kernel information is involved. We recommend that you Submit a ticket to contact Alibaba Cloud technical support for troubleshooting.