All Products
Search
Document Center

Elastic Compute Service:Downtime drill

Last Updated:Apr 14, 2025

Server downtime is a common issue caused by software and hardware abnormalities, and is virtually unavoidable. This topic describes how to simulate kernel faults on an Elastic Compute Service (ECS) instance to cause downtime, to test your business system's response to downtime, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation after downtime occurs in the production environment, reducing the risk of business interruption.

Implementation

A downtime drill uses the ecs-fault-oscrash Cloud Assistant plugin to trigger a kernel panic by using the sysrq module. Then, the system automatically restarts and resumes normal operation.

Important

The drill may affect your business. Make sure that your business system has high availability (HA) capabilities and kdump is enabled. For information about how to enable kdump, see How do I enable the kdump service on a Linux instance?

Procedure

Prerequisites

Inject a fault

  1. Connect to the ECS instance as a user with sudo privileges.

    For more information, see Use Workbench to connect to a Linux instance over SSH.

  2. Run the ecs-fault-oscrash Cloud Assistant plugin.

    sudo acs-plugin-manager --exec --plugin ecs-fault-oscrash --params inject

    The following command output indicates that the ecs-fault-oscrash plugin is run.

    image

  3. Check whether a fault is injected.

    • If an event of the Instance Restart Due To Instance Error type appears in unexpected O&M events, a fault is injected.

      image

    • On the ECS instance, run the uptime command to check the system runtime to determine whether the ECS instance restarted.

      The following command output shows that the ECS instance restarted at 18:21:46, indicating that a fault is injected.

      image

Recover from the fault

In this drill, the ECS instance automatically restarts and resumes normal operation. If the ECS instance fails to restart, you can force restart it in the ECS console. For more information, see Restart an instance.