Perform a downtime drill to test service performance on an ECS instance - Elastic Compute Service

Server downtime is a common issue caused by software and hardware abnormalities, and is virtually unavoidable. This topic describes how to simulate kernel faults on an Elastic Compute Service (ECS) instance to cause downtime, to test your business system's response to downtime, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation after downtime occurs in the production environment, reducing the risk of business interruption.

Implementation

A downtime drill uses the ecs-fault-oscrash Cloud Assistant plugin to trigger a kernel panic by using the sysrq module. Then, the system automatically restarts and resumes normal operation.

Important

The drill may affect your business. Make sure that your business system has high availability (HA) capabilities and kdump is enabled. For information about how to enable kdump, see How do I enable the kdump service on a Linux instance?

Procedure

Prerequisites

Cloud Assistant Agent is installed on the ECS instance for which you want to perform a drill.
The status of Cloud Assistant is Normal on the ECS instance. For more information, see View the status of Cloud Assistant and handle anomalies.

Inject a fault

Connect to the ECS instance as a user with sudo privileges.
For more information, see Use Workbench to connect to a Linux instance over SSH.
Run the ecs-fault-oscrash Cloud Assistant plugin.
```
sudo acs-plugin-manager --exec --plugin ecs-fault-oscrash --params inject
```
The following command output indicates that the ecs-fault-oscrash plugin is run.
Check whether a fault is injected.
- If an event of the Instance Restart Due To Instance Error type appears in unexpected O&M events, a fault is injected.
- On the ECS instance, run the uptime command to check the system runtime to determine whether the ECS instance restarted.
  The following command output shows that the ECS instance restarted at 18:21:46, indicating that a fault is injected.

Recover from the fault

In this drill, the ECS instance automatically restarts and resumes normal operation. If the ECS instance fails to restart, you can force restart it in the ECS console. For more information, see Restart an instance.