Fault drills are critical to ensure system stability. You can inject specific controllable faults to the system to verify and enhance the system's high availability, train the emergency response capabilities of the relevant personnel, and verify fault handling mechanisms, thereby reducing the Mean Time To Repair (MTTR) when real faults occur. Alibaba Cloud provides Cloud Assistant plugins named in the ecs-fault-{scenename} or ACS-ECS-{scenename} format to inject faults to Elastic Compute Service (ECS) instances. This allows you to precisely and conveniently perform drills and improves the effectiveness and efficiency of the drills.
Benefits
Free and open source: All executed plugins are automatically saved in the Cloud Assistant directory, including fault injection source code and executable files.
Scenario-based: Each plugin can be used only in one type of drill scenario. You can download and use plugins based on your drill scenarios.
Convenient and efficient: All ECS instances on which Cloud Assistant is installed can run the drill plugins. You can complete the installation and execution of a plugin with only one command.
Supported scenarios for fault drills
Drill scenario | Description |
Server downtime is a common issue caused by software and hardware abnormalities, and is virtually unavoidable. You can simulate kernel faults on an ECS instance to cause downtime, to test your business system's response to downtime, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation after downtime occurs in the production environment, reducing the risk of business interruption. | |
To ensure stability for your business, CPU utilization must be maintained within a reasonable range. Excessively high CPU utilization can cause business latency, or even interruption. You can inject high CPU utilization faults to an ECS instance to test how the business system responds to specific CPU loads, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation when high CPU utilization occurs in the production environment, reducing the risk of business interruption. | |
You can perform an out of memory (OOM) drill on an ECS instance by using an injection process to continuously consume memory. This allows you to test whether business processes can be terminated as expected, inspect system recovery capabilities, and verify the effectiveness of monitoring and alert mechanisms. You can then develop response strategies based on the drill results. This ensures that the system can quickly resume normal operation when OOM occurs in the production environment, reducing the risk of business interruption. | |
Network drills |
|
Disk and I/O drills |
|
System resource drills |
|
Procedure
This best practice relies on Cloud Assistant and Cloud Assistant plugins, with commands in the following syntax:
Fault injection
sudo acs-plugin-manager --exec --plugin {plugin-name} --params inject,paramA=a,paramB=bFault recovery
sudo acs-plugin-manager --exec --plugin {plugin-name} --params recover
Replace
{plugin-name}with the name of the actual Cloud Assistant plugin.The plugins support the injection (inject) and recovery (recover) actions.
Fault injection concatenates parameter key-value pairs separated by commas (,), and the key and value in each pair are connected by an equal sign (=).