CPU utilization is a key indicator of business system health. To ensure stability, CPU utilization must remain within a reasonable range. Excessively high CPU utilization can cause service latency or even outages. You can inject high CPU utilization faults into an ECS instance to test how the business system responds to specific CPU loads, evaluate system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. Based on the drill results, you can develop response strategies. This ensures that the system resumes normal operation quickly when high CPU utilization occurs in the production environment, reducing the risk of business interruption.
How it works
This solution uses the Cloud Assistant plugin ecs-fault-highcpu. The plugin starts the AliFaultHighCpu process to consume CPU time slices at a specified duty cycle.
Instructions
Prerequisites
Cloud Assistant Agent is installed on the ECS instance for which you want to perform a drill.
The status of Cloud Assistant is Normal on the ECS instance. For more information, see Check Cloud Assistant status and handle anomalies.
Inject a fault
As a user with sudo access privileges, run the
ecs-fault-highcpuCloud Assistant plugin.sudo acs-plugin-manager --exec --plugin ecs-fault-highcpu --params inject,[cpu-percent=paramA],[cpu-list=paramB]The parameters in
[]are optional.cpu-percent (optional): The target CPU utilization percentage. If not specified, the default value is 100.
NoteThe
cpu-percentvalue represents the CPU utilization of the injection process. The instance's total CPU utilization also includes the load from other running processes.cpu-list (optional): The specific vCPU cores to target. For example,
cpu-list=0-2/4applies the load to vCPU cores 0, 1, 2, and 4. If not specified, the load is applied to all vCPU cores.
Verify that the fault injection was successful.
On the ECS instance, run the
topcommand. A successful injection increases CPU utilization. The sum of CPU time spent in kernel mode (sy) and user mode (us) should approximate the specifiedcpu-percentvalue.
In the CloudMonitor CPU utilization chart, verify that CPU utilization increases after the fault is injected.

Recover from the fault
Use one of the following methods to recover the ECS instance.
Method 1 (Recommended): Run the fault recovery command on the ECS instance and verify that CPU utilization drops to its pre-injection level.
sudo acs-plugin-manager --exec --plugin ecs-fault-highcpu --params recoverAs shown in the figure below, CPU utilization has dropped to its pre-injection level, indicating that the system has returned to a Normal state.

Method 2: Terminate the process named
AliFaultHighCpu.To prevent issues with subsequent fault injections, run the recovery command from Method 1 after you terminate the
AliFaultHighCpuprocess.sudo kill <AliFaultHighCpu PID>