Server downtime is a common issue caused by software and hardware abnormalities, and is virtually unavoidable. This topic describes how to simulate kernel faults on an Elastic Compute Service (ECS) instance to cause downtime, to test your business system's response to downtime, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation after downtime occurs in the production environment, reducing the risk of business interruption.
Implementation
A downtime drill uses the ecs-fault-oscrash Cloud Assistant plugin to trigger a kernel panic by using the sysrq module. Then, the system automatically restarts and resumes normal operation.
The drill may affect your business. Make sure that your business system has high availability (HA) capabilities and kdump is enabled. For information about how to enable kdump, see How do I enable the kdump service on a Linux instance?
Procedure
Prerequisites
Cloud Assistant Agent is installed on the ECS instance for which you want to perform a drill.
The status of Cloud Assistant is Normal on the ECS instance. For more information, see View the status of Cloud Assistant and handle anomalies.
Inject a fault
Connect to the ECS instance as a user with sudo privileges.
For more information, see Use Workbench to connect to a Linux instance over SSH.
Run the
ecs-fault-oscrashCloud Assistant plugin.sudo acs-plugin-manager --exec --plugin ecs-fault-oscrash --params injectThe following command output indicates that the
ecs-fault-oscrashplugin is run.
Check whether a fault is injected.
If an event of the Instance Restart Due To Instance Error type appears in unexpected O&M events, a fault is injected.

On the ECS instance, run the
uptimecommand to check the system runtime to determine whether the ECS instance restarted.The following command output shows that the ECS instance restarted at 18:21:46, indicating that a fault is injected.

Recover from the fault
In this drill, the ECS instance automatically restarts and resumes normal operation. If the ECS instance fails to restart, you can force restart it in the ECS console. For more information, see Restart an instance.