I/O hangs occur when the operating system cannot complete read or write operations due to certain reasons, which causes exceptions of processes or the operating system. I/O hangs can be caused by various factors, including hardware failures, driver issues, file system errors, network latency, or network congestion. This poses risks to businesses, such as performance degradation, service delays, and data inconsistency. This topic describes how to perform a drill to verify that a disk I/O hang can be injected and rectified.
Limits
Elastic Compute Service (ECS) instances must run Linux distributions compatible with control group version 1 (cgroup v1), such as Alibaba Cloud Linux 3 and 2.
Implementation
The Cloud Assistant plug-in named ACS-ECS-IoHang uses cgroups to perform disk I/O throttling. The cgroup is a mechanism provided by the Linux kernel that can limit the resource usage of single or multiple processes to perform fine-grained control over CPUs, memory, IO resources, and network bandwidth.
Procedure
Prerequisites
Cloud Assistant Agent is installed on the ECS instance for which you want to perform a drill.
The status of Cloud Assistant is Normal on the ECS instance. For more information, see View the status of Cloud Assistant and handle anomalies.
Fault injection
Log on to an ECS instance.
For more information, see Use Workbench to connect to a Linux instance over SSH.
Use the sudo user to run the
ACS-ECS-IoHangplug-in.sudo acs-plugin-manager --exec --plugin ACS-ECS-IoHang --params inject,disk=vda,[duration=paramA]The parameters in brackets (
[]) are optional. Take note of the following parameters:disk (required): the destination disk. You can run the lsblk command to view the disks attached to the instance. If you want to inject faults into all disks, set
diskto all.duration (optional): the duration of fault injection. Unit: seconds. Default value: 300.
Check whether the fault injection is performed as expected.
If the following command output appears, the fault injection is successful.

Check whether the business read and write speeds meet expectations.
Fault recovery
(Recommended) Method 1: Wait for automatic recovery after timeout.
NoteFault injection automatically times out after 300 seconds. If fault injection is performed on a system disk, automatic fault recovery may not occur. In this case, forcefully restart the instance in the ECS console. For more information, see Restart an instance.
Method 2: If fault data is injected only to data disks, run the following command to resolve the I/O hang issue on the ECS instance:
sudo acs-plugin-manager --exec --plugin ACS-ECS-IoHang --params recover
Example
Run the lsblk command to view and select a disk attached to an ECS instance. In this example, the vdb data disk is used for fault injection.
To intuitively observe the drill effect, simulate a business scenario of reading data from the vdb data disk. Skip this step if you use an actual business scenario for the drill.
sudo dd if=/dev/vdb of=/dev/nullQuery the I/O usage by running the
iotopcommand.NoteIf the iotop tool is not installed, run a command to install iotop based on the Linux distribution:
Alibaba Cloud Linux 3 or 2 or CentOS 7
sudo yum install -y iotopUbuntu or Debian
sudo apt install -y iotop

Perform fault injection.
sudo acs-plugin-manager --exec --plugin ACS-ECS-IoHang --params inject,disk=vdb,duration=120The following command output indicates the injection parameters and the major and minor device numbers of the specified disk. This indicates that the injection was successful.

Check the fault injection effect.
Check the I/O usage by running the
iotopcommand. The current disk read speed decreases to 0 B/s.
Wait for fault recovery.
The following figure shows that after the fault injection times out, the disk read speed of the simulated business process recovered.
