CPU is a core database resource and a primary focus during daily operations. Excessive CPU usage can increase application response time, cause service slowdowns, and in severe cases, cause a database instance to hang or create high availability issues, seriously impacting production workloads. Therefore, you should set a safe threshold for CPU utilization and take immediate action if it is exceeded to prevent severe, unexpected consequences.
CPU utilization from business growth
As your business grows, you may outgrow your current cluster specification. Performance charts will typically show a metric, such as QPS or IOPS, trending upward in a pattern similar to the CPU utilization trend
If the CPU is the bottleneck, your cluster specification is likely insufficient for your business traffic. Resolve this by adding a read-only node to your database cluster or scaling up the cluster specification.
If you cannot connect to the database cluster or a running DML statement fails, check whether the CPU resources of your current cluster specification are sufficient for your workload. If you confirm that CPU resources are inadequate, scale up the cluster promptly to maintain system stability.
If your workload consists mostly of read requests, you can add a read-only node to scale out the cluster and distribute the read traffic. For more information, see Add or remove nodes.
If your workload is mostly write requests, adding a read-only node will not improve performance. In this case, you need to manually scale up the cluster specification, for example, by upgrading from a 4-core to an 8-core specification.
Troubleshooting high CPU utilization
Troubleshooting an unexpected increase in CPU utilization can be complex. This topic describes several common causes, including slow queries, a high number of active threads, improper kernel configuration, and system bugs.
Slow queries
High CPU utilization is often caused by inefficient SQL statements that result in slow queries and a buildup of active threads. However, you must first determine whether slow queries are the root cause of high CPU utilization or if a different resource bottleneck is slowing down queries and indirectly increasing CPU utilization.
You can view slow queries in the PolarDB console by navigating to . On the Slow Log Details tab, if the value for Scanned Rows is significantly higher than the value for Returned Rows, it indicates that slow queries are causing the high CPU utilization.
This analysis focuses on TP queries and therefore excludes count queries. Some AP queries also have a very large number of scanned rows.
TP queries involve a very small amount of data reads and writes. If a query scans a large amount of data, it is highly likely that an index is missing. For example, if a query in the slow query list shows that more than 10,000 rows were scanned but only one row was returned, it is a clear indication that an index is missing on the name column.
SELECT * FROM table1 WHERE name='testname';You can use the following statement to check whether an index exists on the name column.
SHOW index FROM table1;If the
namecolumn does not have an index, you can add an index with the following statement to eliminate slow queries caused by such large-scale data scans.ALTER TABLE table1 ADD KEY ix_name (name);If the
namecolumn has an index, you can use the following statement to view the execution plan of the SQL statement and confirm whether the correct index is used.EXPLAIN SELECT * FROM table1 WHERE name='testname';If you find that an index exists on the
namecolumn but is not used, this may be because inaccurate statistics have resulted in an incorrect execution plan. You can run the following statement to regenerate the statistics on the table to correct the plan.ANALYZE TABLE table1;After the command completes, check the execution plan again to confirm that the correct index is now used:
EXPLAIN SELECT * FROM table1 WHERE name='testname';
High active thread count
A high active thread count always increases CPU utilization. In MySQL, each CPU core can process only one request at a time. For example, a 16-core cluster can handle a maximum of 16 concurrent requests at the kernel level, which is distinct from application-level concurrency. You can view session information in the PolarDB console by navigating to .
If you rule out slow queries as the cause of request processing failures, an accumulation of active threads is typically due to an increase in production traffic. You can view the performance curves to verify this. If the overall traffic and request trends are consistent with the trend of active thread accumulation, it indicates that the cluster resources have reached their limit. To resolve this issue, you must add read-only nodes to the database cluster or scale up the cluster specifications.
When the number of active threads reaches a critical point, CPU contention can occur, generating numerous mutex locks in the kernel. Performance charts will typically show high CPU utilization, a high active thread count, and low IOPS or QPS. Another cause is a sudden traffic surge, where a high rate of new connections also leads to CPU contention and a request backlog. You can often mitigate this issue by enabling the cluster's thread pool feature for flow control. If the active thread count is reduced, check whether tasks are still backlogged on the application side. If CPU load and the active thread count remain high, you should also consider scaling up the cluster.
A frontend connection storm can also cause an instantaneous traffic spike on the cluster. This is abnormal traffic, often caused by web crawlers. You can use SQL throttling to reject requests. For more information, see Session Management.
Improper kernel configuration
The default parameters for a self-managed MySQL instance are configured for general-purpose scenarios and may not be optimal for all workloads and may require fine-tuning. Some issues may not appear in the early stages of an application when the data volume is small, but can emerge as the data grows over time and specific conditions are met.
Memory contention is a common problem. In the MySQL architecture, memory is primarily used for data caching. The most commonly used memory areas are the buffer pool and innodb_adaptive_hash_index. The cache area for the entire database system is where data is exchanged most frequently. If there is insufficient memory or memory page contention, an accumulation of various exceptions and slow queries can occur. A typical symptom is a sudden spike in CPU usage to its maximum level, accompanied by slow queries. If an investigation reveals that the issue is not a missing index, the problem may lie with the memory system.
For example, when a truncate table operation is performed, MySQL traverses the buffer pool to evict all data pages of the table being truncated. In a large-scale cluster, if innodb_buffer_pool_instances is set to 1 and concurrency is relatively high, contention issues may occur. This issue can be discovered early in the service lifecycle, and you can typically avoid it by aligning the value of innodb_buffer_pool_instances with the number of CPU cores and partitioning the buffer pool into buckets.
Another scenario is contention for innodb_adaptive_hash_index. A clear symptom is a large number of hash0hash.cc waits.
SHOW ENGINE innodb STATUS;In the AHI section of the output, you will see significant data skew.
insert 0, delete mark 0, delete 0
Hash table size 25499819, node heap has 20720 buffer(s)
Hash table size 25499819, node heap has 25111 buffer(s)
Hash table size 25499819, node heap has 23884 buffer(s)
Hash table size 25499819, node heap has 16835 buffer(s)
Hash table size 25499819, node heap has 23132 buffer(s)
Hash table size 25499819, node heap has 189284 buffer(s)
Hash table size 25499819, node heap has 38864 buffer(s)
Hash table size 25499819, node heap has 49094 buffer(s)
5469.32 hash searches/s, 5282.36 non-hash searches/sFor this type of issue, you can disable the innodb_adaptive_hash_index parameter, which in turn disables the AHI feature. Data shows that in mixed read-and-write scenarios, AHI can have a negative performance impact, but disabling it does not significantly affect overall business.
System bugs
System bugs are a relatively rare cause of high CPU issues. Examples from earlier versions include process deadlocks and full table scans caused by table statistics being reset to zero. As the product evolves, CPU issues caused by system bugs have become less common. However, troubleshooting these problems often requires deep kernel-level information and can be difficult to resolve on your own. We recommend that you contact us for technical support.