Resolve high CPU utilization or load on a Linux instance -

Problem description

High CPU utilization or system load can cause the following symptoms:

Service disruptions
- Slow response or unresponsiveness during SSH remote connections, and in severe cases, connection failures.
- Slow page loads and increased response times for websites or applications.
- Frequent request timeouts, API call failures, and reduced transaction processing capacity.
System resource anomalies
- Sustained CPU utilization above 80%, sometimes approaching 100%.
- The system load (load average) consistently exceeds the number of logical CPU cores (for example, a load average greater than 4 on a 4-core machine).
- You receive high-load alerts via SMS or email from the cloud monitoring platform.

Causes

Compute-intensive processes: A specific process consumes excessive CPU resources due to code logic issues (such as infinite loops), complex computational tasks, or high-concurrency requests.
I/O bottlenecks: Frequent disk I/O or poor storage performance can cause long I/O wait times, increasing the system load average.
Kernel or system calls: Frequent context switching, kernel tasks, or driver issues lead to high CPU utilization in system mode.
Malicious software: Cryptocurrency miners, trojans, or rootkits can consume significant resources. These processes are often hidden to evade detection.

Solutions

First, use the top tool to identify the specific metric causing the high CPU usage or load (user space, kernel space, or I/O wait). Then, use tools like perf, iotop, or vmstat to perform a deeper analysis based on the metric type. Finally, optimize or resolve the issue.

Step 1: Identify the CPU bottleneck metric

Log on to an ECS instance using a VNC connection.
1. Go to ECS console - Instances. In the top navigation bar, select the target region and resource group.
2. Go to the details page of the target instance. Click Connect and select VNC. Enter the username and password to log on to the ECS instance.
View the system load and process activity.
```
sudo top
```
Identify the cause of the issue.
In top, press the P key to sort processes by CPU utilization in descending order. Note the PID and COMMAND of the process that is consuming the most CPU.
- If a business process (such as java, python, or php-fpm) consistently shows CPU utilization above 80%, see Handle high CPU business processes.
- If the I/O wait (wa) in the %Cpu(s) line is consistently above 20%, while user space (us) and kernel space (sy) usage are low, and the load average far exceeds the number of CPU cores, it indicates that the CPU is often idle while waiting for disk responses. In this case, see Handle disk I/O bottlenecks.
  When a process waits for a disk I/O operation to complete, it enters the D state (uninterruptible sleep) and cannot be terminated. A large number of processes in the D state indicates a slow disk response, causing the CPU to wait and thus increasing the system load.
- If sy (system) in the %Cpu(s) line is consistently above 30%, it usually means the kernel is frequently executing system calls or handling interrupts. See Handle high kernel or system call activity.
- If si (softirq) in the %Cpu(s) line is consistently above 15%, it indicates high network traffic. See Handle high network interrupt activity.

Step 2: Analyze and resolve the issue

Handle high CPU business processes

Analyze and optimize code:
Use performance analysis tools to identify code hotspots.
- Java applications: Use jstack <PID> to export the thread stack. Search for threads in the RUNNABLE state and examine their call stacks to identify long-running or stuck methods.
- C/C++ applications: Use perf top -p <PID> to view the function symbols that are consuming the most CPU.
Based on the analysis, optimize algorithms, fix infinite loops, or reduce unnecessary computations.
Upgrade resources: If the bottleneck is due to normal business growth, upgrade the instance type.

Handle disk I/O bottlenecks

Identify the process with high I/O. For more information, see Troubleshoot high disk I/O load on a Linux instance.
Check for processes in the D state:
```
sudo ps -axjf | grep " D"
```
Take the following actions:
- Application optimization: Reduce logging levels or add indexes to database queries to minimize disk reads and writes.
- Storage upgrade: Upgrade the disk category, for example, from ESSD PL1 to ESSD PL2/PL3, to improve IOPS and throughput. Disk IOPS performance is subject to instance-level limits. If the instance type's IOPS limit is lower than the disk's capability, you must upgrade the instance type to utilize the full performance.
- System restart: If many processes are in the D state, restarting the system can resolve the issue.

Handle high kernel or system call activity

Check for context switching. Run the vmstat 1 command and observe the value in the cs (context switch) column. If the value consistently exceeds 100,000, it indicates excessive context switching. Check if the application is creating and destroying too many threads.
Check for kernel tasks. High kswapd0 utilization indicates that the system is low on physical memory and the kernel is frequently performing memory reclamation. In this case, upgrade the instance type.
When physical memory is insufficient, kswapd0 frequently scans pages, reclaims memory, and swaps pages out. These compute-intensive tasks consume a large amount of CPU resources, leading to high utilization.

Handle high network interrupt activity

Analyze traffic: Use tools like iftop or iptraf-ng to analyze the source and type of network traffic.
Check the configuration. For high network workloads, enable multi-queue for the network interface card (NIC) to distribute interrupts across multiple CPU cores.
Perform a security check: Go to Security Center to check for network attacks.

Recommendations

Configure monitoring and alerting: Set appropriate alert thresholds for metrics such as CPU utilization, system load, and I/O wait to receive early warnings. To review and analyze historical Linux system metrics, use the atop tool to monitor Linux system metrics.
Perform regular security inspections: Periodically use Security Center to scan for vulnerabilities, detect and remove viruses, and run a baseline check to fix potential security risks.
Perform regular performance audits: Periodically audit code and system configurations to identify and resolve potential bottlenecks before they impact production.
Plan capacity: Plan capacity based on business growth trends to ensure that system resources can handle future load increases.