Problem description
The website or application hosted on the Elastic Compute Service (ECS) instance responds slowly, times out, or I/O-related error logs appear.
The IOPS, throughput, or device utilization (
%util) metrics for the instance are excessively high.
Causes
Process with high I/O consumption: A process on the instance, such as database reads/writes, heavy log writing, or a backup job, generates many disk read/write requests and consumes the entire disk I/O bandwidth.
Disk performance bottleneck: The normal I/O demand of your current business exceeds the performance limit (IOPS or throughput) of the disk attached to the instance.
Solutions
Use iostat to confirm the disk bottleneck, then use iotop to identify the specific process. Finally, optimize or upgrade resources as needed.
Step 1: Confirm the disk I/O bottleneck
Log on to the ECS instance.
Go to ECS console - Instances. In the top navigation bar, select the target region and resource group.
Go to the details page of the target instance. Click Connect and select Workbench. Follow the prompts on the page to log on to the terminal.
Use
iostatto monitor the disk I/O status.Install the sysstat package.
Alibaba Cloud Linux / CentOS / Fedora
sudo yum install -y sysstatUbuntu / Debian
sudo apt install -y sysstatopenSUSE
sudo zypper install -y sysstatRun
iostatto refresh the data every 2 seconds.iostat -d -x -k 2
Analyze the
iostatoutput. Pay attention to the following metrics:r/s,w/s: The number of read and write requests per second (IOPS).rkB/s,wkB/s: The amount of data read and written per second (throughput).%util: The disk I/O utilization. If this value is consistently close to 100%, the disk device is saturated.
Step 2: Locate the process with high I/O consumption
After you confirm that the disk I/O is saturated, you need to identify the process that is causing the high load.
Use
iotopto view the I/O activity of processes in real time.Install
iotop.Alibaba Cloud Linux / CentOS / Fedora
sudo yum install -y iotopUbuntu / Debian
sudo apt install -y iotopopenSUSE
sudo zypper install -y iotopDisplay active I/O processes.
sudo iotop -o
Analyze the
iotopoutput.Identify the process: Find the process with the highest I/O in the
DISK WRITEorDISK READcolumn.Additional context: The
IO>column shows the percentage of time the process spends waiting for I/O.
Step 3: Analyze and handle the abnormal process
Analyze the cause: Possible causes include slow SQL queries, a high log level, or frequent file read/write operations.
Resolution methods (including but not limited to):
Database: Check the slow query log. Optimize SQL statements and indexes.
Log service: Lower the log level of the application (for example, from DEBUG to INFO) and configure log rotation.
File I/O strategies: Check the file read/write logic. Use a memory cache and increase the buffer size.
Step 4: Evaluate and upgrade disk performance (Optional)
If you cannot reduce the load through application layer optimization, you can upgrade the disk to improve performance.
Evaluate requirements: Determine the performance target based on the actual IOPS and throughput reported by
iostat.Perform the upgrade: Based on the performance target, upgrade the disk to a higher specification.
Recommendation
Configure monitoring and alerts: For key disk metrics such as
%util, IOPS, and throughput, configure reasonable alert thresholds (for example, 80%) to detect issues early.Application I/O optimization: At the application layer, implement caching whenever possible to reduce direct disk reads and writes. For write-intensive scenarios, consider using asynchronous or batch writes to smooth out I/O peaks.