All Products
Search
Document Center

Elastic Compute Service:Fault drills

Last Updated:Apr 23, 2025

Fault drills are critical to ensure system stability. You can inject specific controllable faults to the system to verify and enhance the system's high availability, train the emergency response capabilities of the relevant personnel, and verify fault handling mechanisms, thereby reducing the Mean Time To Repair (MTTR) when real faults occur. Alibaba Cloud provides Cloud Assistant plugins named in the ecs-fault-{scenename} or ACS-ECS-{scenename} format to inject faults to Elastic Compute Service (ECS) instances. This allows you to precisely and conveniently perform drills and improves the effectiveness and efficiency of the drills.

Benefits

  • Free and open source: All executed plugins are automatically saved in the Cloud Assistant directory, including fault injection source code and executable files.

  • Scenario-based: Each plugin can be used only in one type of drill scenario. You can download and use plugins based on your drill scenarios.

  • Convenient and efficient: All ECS instances on which Cloud Assistant is installed can run the drill plugins. You can complete the installation and execution of a plugin with only one command.

Supported scenarios for fault drills

Drill scenario

Description

Downtime drill

Server downtime is a common issue caused by software and hardware abnormalities, and is virtually unavoidable. You can simulate kernel faults on an ECS instance to cause downtime, to test your business system's response to downtime, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation after downtime occurs in the production environment, reducing the risk of business interruption.

High CPU utilization drill

To ensure stability for your business, CPU utilization must be maintained within a reasonable range. Excessively high CPU utilization can cause business latency, or even interruption. You can inject high CPU utilization faults to an ECS instance to test how the business system responds to specific CPU loads, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation when high CPU utilization occurs in the production environment, reducing the risk of business interruption.

OOM drill

You can perform an out of memory (OOM) drill on an ECS instance by using an injection process to continuously consume memory. This allows you to test whether business processes can be terminated as expected, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation when OOM occurs in the production environment, reducing the risk of business interruption.

Network drills

  • Network packet loss drill: Network packet loss is a common network failure. Situations such as network congestion, hardware failures, and link interference may cause network packet loss. This drill scenario verifies the system alert and recovery mechanisms when network packet loss occurs.

  • Network interruption drill: Network failure is a common issue in ECS. Network failure includes hardware link abnormalities, carrier network fluctuations, and system configuration issues, which can cause network connection failures and make ECS instances unavailable for long periods. This drill scenario verifies the monitoring and recovery capabilities of your business when one of its nodes becomes unavailable.

  • Network delay drill: Network delay affects the response speed of applications and services, and high network delay degrades user experience. Factors that lead to network delay include increased network traffic and unstable lines. This drill scenario verifies the system alert and recovery mechanisms when network delay occurs.

Disk and I/O drills

  • High disk capacity utilization drill: Excessively high disk capacity utilization may cause system performance degradation, system crashes, and data loss due to data accumulation and temporary file buildup. You can perform high disk capacity utilization drills to verify that the system can run stably and does not encounter data loss if disk capacity utilization is high or even full on an ECS instance.

  • Disk I/O hang drill: I/O hangs occur when the system cannot complete read or write operations due to certain reasons, which causes exceptions of processes or the system. I/O hangs can be caused by various factors, including hardware failures, driver issues, file system errors, network latency, or network congestion. This poses risks to businesses, such as performance degradation, service delays, and data inconsistency. You can perform disk I/O hang drills to verify the alert and recovery mechanisms of the system when a disk I/O hang occurs.

  • High disk I/O load drill: High I/O load is a common fault, which may be caused by excessive business process loads, unexpected non-business process occupation, and insufficient memory resources. High I/O load poses risks, such as business performance degradation and data loss, to your business system. You can perform high disk I/O load drills to verify the alert and recovery mechanisms of the system when the disk I/O load is high.

System resource drills

  • High system load drills: System load is a metric used to measure the system workload, which indicates the average number of processes in runnable and uninterruptible states over a specific time interval. Monitoring load is important for determining the current load of your business system, generating alerts, and taking response measures at the earliest opportunity.

  • PID insufficiency drills: In operating systems, a process identifier (PID) is a number used to uniquely identify a process, which can be reused after a process is terminated. Although it is difficult to exhaust PIDs, accidental exhaustion can still occur. If PIDs are accidentally exhausted, new processes cannot be created and services may be suspended, which affects business capabilities. It is necessary to simulate PID exhaustion or service suspension scenarios to test the high availability of your services.

  • System time jump drills: Time jump refers to a sudden change in the system clock. You must ensure the accuracy of system time and the consistency of time across various system components in production systems. Otherwise, exceptions occur on various time-sensitive services, such as logs and synchronization backups. You can perform time jump drills to test whether the system can promptly synchronize and restore the correct time and recover business when a system time jump occurs.

Procedure

This best practice relies on Cloud Assistant and Cloud Assistant plugins, with commands in the following syntax:

  • Fault injection

    sudo acs-plugin-manager --exec --plugin {plugin-name} --params inject,paramA=a,paramB=b
  • Fault recovery

    sudo acs-plugin-manager --exec --plugin {plugin-name} --params recover
Note
  • Replace {plugin-name} with the name of the actual Cloud Assistant plugin.

  • The plugins support the injection (inject) and recovery (recover) actions.

  • Fault injection concatenates parameter key-value pairs separated by commas (,), and the key and value in each pair are connected by an equal sign (=).