Problem description
High CPU utilization or system load can cause the following symptoms:
Application service issues
SSH remote connections are slow, unresponsive, or even fail to connect.
The response time for your website or application increases significantly, and pages load slowly.
Requests frequently time out, API calls fail, and overall service capacity is noticeably reduced.
System resource issues
The instance CPU utilization consistently exceeds 80% and may even approach 100%.
The system load (load average) consistently exceeds the number of logical CPU cores (for example, a load average greater than 4 on a 4-core machine).
CloudMonitor has triggered high-load alerts by SMS or email.
Causes
CPU-intensive processes: Specific processes consume excessive CPU resources due to application logic issues such as infinite loops, complex computational tasks, or high-concurrency requests.
I/O bottleneck: Frequent disk read/write operations or insufficient storage performance cause processes to wait a long time for I/O, increasing the system load average.
Kernel or system calls: Frequent context switching, kernel tasks, or driver issues lead to high CPU utilization in kernel space.
Malicious or abnormal programs: The instance is infected with a cryptocurrency miner, a Trojan, or has rootkit-hidden processes consuming significant computing resources.
Solution
First, use the top tool to identify the specific metric (user space, kernel space, or I/O wait) that is causing the high CPU utilization or system load. Then, use tools like perf, iotop, or vmstat to perform a deeper analysis based on the metric type. Finally, take appropriate action to optimize or resolve the issue.
Step 1: Identify the CPU bottleneck
-
Log on to an ECS instance using a VNC connection.
-
Go to ECS console - Instances. In the top navigation bar, select the target region and resource group.
-
Go to the details page of the target instance. Click Connect and select VNC. Enter the username and password to log on to the ECS instance.
-
Check the system load and process activity.
sudo topIdentify the cause.
In the
topinteractive interface, pressPto sort processes by CPU utilization in descending order. Identify the PID and COMMAND of the top consumer.If an application process (such as
java,python, orphp-fpm) consistently has a CPU utilization above 80%, see Handle high CPU business processes.If the I/O wait (
wa) value in the%Cpu(s)line is consistently above 20%, while user space (us) and kernel space (sy) values are low, and the load average is much higher than the number of CPU cores, it indicates the CPU is often idle, waiting for disk I/O. See Handle disk I/O bottlenecks.When a process is waiting for disk I/O to complete, it enters the D state (uninterruptible sleep) and cannot be terminated. A large number of processes in the D state indicates a slow disk response. This forces the CPU to wait, increasing the system load.
If the
sy(system) value in the%Cpu(s)line is consistently above 30%, it usually indicates that the kernel is frequently executing system calls or handling interrupts. See Handle high kernel or system call activity.If the
si(softirq) value in the%Cpu(s)line is consistently above 15%, it indicates high network traffic. See Handle high network interrupt activity.
Step 2: Analyze and resolve the issue
Handle high CPU business processes
Analyze and optimize code:
Use performance profiling tools to locate hot spots in your code.
Java applications: Use
jstack <PID>to export thread stacks. Search for threads in theRUNNABLEstate and check if the call stack is stuck in a specific method for an extended period.C/C++ applications: Use
perf top -p <PID>to view the specific function symbols consuming the most CPU.
Based on the analysis, optimize algorithms, fix infinite loops, or reduce unnecessary computations.
Upgrade resources: If the bottleneck is due to normal business growth, upgrade the instance type.
Handle disk I/O bottlenecks
Identify the high-I/O process using this guide: Troubleshoot high disk I/O load on a Linux instance.
Check for a buildup of processes in the D state:
sudo ps -axjf | grep " D"Take corrective actions:
Application optimization: Reduce log levels and add indexes to database queries to reduce disk read/write operations.
Storage upgrade: Upgrade the disk category, for example, from ESSD PL1 to ESSD PL2/PL3, to improve IOPS and throughput. Disk IOPS performance is subject to instance-level limits. If the instance type's IOPS limit is lower than the disk's capability, you must upgrade the instance type to utilize the full performance.
System restart: If many processes are in the D state, restarting the system can resolve the issue.
Handle high kernel or system call activity
Check for context switching. Run the
vmstat 1command and observe the value in thecs(context switch) column. If the value consistently exceeds 100,000, it indicates excessive context switching. Check if the application is creating and destroying too many threads.Check for kernel tasks. If the
kswapd0process shows high CPU utilization, physical memory is insufficient, and the kernel is frequently reclaiming memory. Consider upgrade the instance type.When physical memory is low,
kswapd0frequently scans, reclaims, and swaps out memory pages. These compute-intensive tasks consume significant CPU resources, which causes high utilization.
Handle high network interrupt activity
Analyze traffic: Use tools like
iftoporiptraf-ngto analyze the source and type of network traffic.Check the configuration. For high network workloads, enable multi-queue for the network interface card (NIC) to distribute interrupts across multiple CPU cores.
Perform a security check: Go to Security Center to check for network attacks.
Recommendations
Configure monitoring and alerting: Set reasonable alert thresholds for metrics such as CPU utilization, system load, and I/O wait for early warnings. To perform historical analysis of Linux system metrics, use the atop tool to monitor Linux system metrics.
Perform regular security inspections: Periodically use Security Center to scan for vulnerabilities, detect and remove viruses, and run a baseline check to fix potential security risks.
Perform regular performance audits: Conduct periodic performance and code reviews to identify and resolve potential bottlenecks.
Plan capacity: Plan your capacity based on business growth trends to ensure system resources can handle future load.