Test service performance when disk I/O hangs occur in ECS by performing a disk I/O hang drill - Elastic Compute Service

I/O hangs occur when the operating system cannot complete read or write operations due to certain reasons, which causes exceptions of processes or the operating system. I/O hangs can be caused by various factors, including hardware failures, driver issues, file system errors, network latency, or network congestion. This poses risks to businesses, such as performance degradation, service delays, and data inconsistency. This topic describes how to perform a drill to verify that a disk I/O hang can be injected and rectified.

Limits

Elastic Compute Service (ECS) instances must run Linux distributions compatible with control group version 1 (cgroup v1), such as Alibaba Cloud Linux 3 and 2.

Implementation

The Cloud Assistant plug-in named ACS-ECS-IoHang uses cgroups to perform disk I/O throttling. The cgroup is a mechanism provided by the Linux kernel that can limit the resource usage of single or multiple processes to perform fine-grained control over CPUs, memory, IO resources, and network bandwidth.

Procedure

Prerequisites

Cloud Assistant Agent is installed on the ECS instance for which you want to perform a drill.
The status of Cloud Assistant is Normal on the ECS instance. For more information, see View the status of Cloud Assistant and handle anomalies.

Fault injection

Log on to an ECS instance.
For more information, see Use Workbench to connect to a Linux instance over SSH.
Use the sudo user to run the ACS-ECS-IoHang plug-in.
```
sudo acs-plugin-manager --exec --plugin ACS-ECS-IoHang --params inject,disk=vda,[duration=paramA]
```
The parameters in brackets ([]) are optional. Take note of the following parameters:
- disk (required): the destination disk. You can run the lsblk command to view the disks attached to the instance. If you want to inject faults into all disks, set disk to all.
- duration (optional): the duration of fault injection. Unit: seconds. Default value: 300.
Check whether the fault injection is performed as expected.
- If the following command output appears, the fault injection is successful.
- Check whether the business read and write speeds meet expectations.

Fault recovery

(Recommended) Method 1: Wait for automatic recovery after timeout.
Note
Fault injection automatically times out after 300 seconds. If fault injection is performed on a system disk, automatic fault recovery may not occur. In this case, forcefully restart the instance in the ECS console. For more information, see Restart an instance.
Method 2: If fault data is injected only to data disks, run the following command to resolve the I/O hang issue on the ECS instance:
```
sudo acs-plugin-manager --exec --plugin ACS-ECS-IoHang --params recover
```

Example

Run the lsblk command to view and select a disk attached to an ECS instance. In this example, the vdb data disk is used for fault injection.

To intuitively observe the drill effect, simulate a business scenario of reading data from the vdb data disk. Skip this step if you use an actual business scenario for the drill.
```
sudo dd if=/dev/vdb of=/dev/null
```
Query the I/O usage by running the iotop command.
Note
If the iotop tool is not installed, run a command to install iotop based on the Linux distribution:
1. Alibaba Cloud Linux 3 or 2 or CentOS 7
```
sudo yum install -y iotop
```
2. Ubuntu or Debian
```
sudo apt install -y iotop
```
Perform fault injection.
```
sudo acs-plugin-manager --exec --plugin ACS-ECS-IoHang --params inject,disk=vdb,duration=120
```
The following command output indicates the injection parameters and the major and minor device numbers of the specified disk. This indicates that the injection was successful.
Check the fault injection effect.
Check the I/O usage by running the iotop command. The current disk read speed decreases to 0 B/s.
Wait for fault recovery.
The following figure shows that after the fault injection times out, the disk read speed of the simulated business process recovered.