This topic describes how to troubleshoot the issues that cause high I/O on an ApsaraDB RDS for PostgreSQL instance. The I/O performance of your RDS instance varies based on three factors: the storage medium, the database engine architecture, and the SQL statements that are executed to scan or modify a specific amount of data.

Terms

Term Description
I/O The read and write operations that are performed on a disk.
I/O latency The amount of time that is required to perform a read and write operation.
Logical I/O The read and write operations that are performed by an application on a file system.
Physical I/O The read and write operations that are performed by a file system on a disk.
Sequential I/O The read and write operations that are sequentially performed on a disk.
Random I/O The read and write operations that are performed at random on a disk.
Synchronous write After an application initiates a write operation, the application is blocked until the write operation is complete.
Asynchronous write After an application initiates a write operation, the application continues to process the next operation request regardless of whether the write operation is complete.

Troubleshoot high I/O issues that occur when a large number of rows are scanned by using SQL statements

  • Symptom:

    The value of the Data Disk IOPS metric indicates a high read IOPS. In most cases, the data that the database client reads by using SQL statements is fetched from the shared buffer. If specific data cannot be fetched from the shared buffer, the system reads the data from the disk. If the read IOPS is high, the system may read a large amount of data from the disk because a large number of rows are scanned by using SQL statements.

    Note You can view the metrics of your RDS instance on the Standard Monitoring tab of the Monitoring and Alerts page in the ApsaraDB RDS console.

    The value of the Operation Rows metric indicates an increased number of rows that are scanned by using full table scans. This means that the database client has read a large amount of data.

  • Solution:

    Use the Performance Insight feature to identify problem SQL statements. Then, optimize indexes or increase the specifications of your RDS instance to handle higher loads.

    Note To use the Performance Insight feature, you can log on to the ApsaraDB RDS console, choose Autonomy Service > Diagnostics in the left-side navigation pane, and then click the Performance Insight tab on the page that appears.

Troubleshoot high I/O issues that are caused by data imports

  • Symptom:

    The value of the Data Disk IOPS metric indicates a high write IOPS, which may occur when you import a large amount of data.

    Note You can view the metrics of your RDS instance on the Standard Monitoring tab of the Monitoring and Alerts page in the ApsaraDB RDS console.
  • Solution: Use the Performance Insight feature to analyze your RDS instance. If the analysis results show that the high write IOPS occurs when you import data, we recommend that you do not import data into your RDS instance during peak hours.

Troubleshoot high I/O issues that are caused by vacuuming

VACUUM is run to delete the junk data that is generated in tables on which UPDATE and DELETE statements are executed, and to reclaim the storage that is occupied by the junk data in a timely manner.

  • Symptom:

    The value of the Data Disk IOPS metric indicates a high read IOPS and a high write IOPS, and the value of the Operation Rows does not indicate an increased number of rows. Therefore, the high read IOPS and the high write IOPS may be caused by vacuuming or checkpointing.

    Note You can view the metrics of your RDS instance on the Standard Monitoring tab of the Monitoring and Alerts page in the ApsaraDB RDS console.
    Run the select * from pg_stat_progress_vacuum ; command to query the pg_stat_progress_vacuum view. The view shows the vacuum operations that are being performed and the progress of each vacuum operation. A large number of vacuum operations can cause a high read IOPS and a high write IOPS.
    -[ RECORD 1 ]------+--------------
    pid                | 109229
    datid              | 13593
    datname            | postgres
    relid              | 40435
    phase              | scanning heap
    heap_blks_total    | 943453
    heap_blks_scanned  | 937101
    heap_blks_vacuumed | 0
    index_vacuum_count | 0
    max_dead_tuples    | 291
    num_dead_tuples    | 0
    You can also run the select * from pg_stat_user_tables where relname ='t_all'; command to query vacuum operations that are complete.
    -[ RECORD 1 ]-------+------------------------------
    relid               | 40435
    schemaname          | public
    relname             | t_all
    seq_scan            | 4547
    seq_tup_read        | 35959634448
    idx_scan            |
    idx_tup_fetch       |
    n_tup_ins           | 91514896
    n_tup_upd           | 0
    n_tup_del           | 0
    n_tup_hot_upd       | 0
    n_live_tup          | 91514896
    n_dead_tup          | 0
    n_mod_since_analyze | 0
    last_vacuum         | 2020-11-20 14:25:19.077927+08
    last_autovacuum     |
    last_analyze        |
    last_autoanalyze    | 2020-11-20 11:59:34.641906+08
    vacuum_count        | 1
    autovacuum_count    | 0
    analyze_count       | 0
    autoanalyze_count   | 5
  • Solution:

    Log on to the ApsaraDB RDS console and open the Parameters page of your RDS instance. Then, modify the values of the vacuum-related parameters to reduce the impact of vacuum operations on I/O.

    • The autovacuum_vacuum_cost_delay parameter specifies whether to apply the delay during which the vacuum process is suspended when the cost of the vacuum process reaches the maximum cost that is specified by the autovacuum_vacuum_cost_limit parameter. Unit: ms.
    • The autovacuum_vacuum_cost_limit parameter specifies whether to apply the maximum cost that is allowed for the vacuum process.
    Notice If the vacuuming speed is low, serious table bloats may occur.

Troubleshoot high I/O issues that are caused by checkpoining

  • Symptom:
    Run the select * from pg_stat_bgwriter ; command to query the pg_stat_bgwriter view, which contains the checkpoints_req metric. The checkpoints_req metric indicates the number of manual checkpoints and the number of times that the amount of write-ahead logging (WAL) log data that is written reaches the value of the max_wal_size parameter. In this example, the value of the checkpoints_req metric is 160.
    -[ RECORD 1 ]---------+------------------------------
    checkpoints_timed     | 8271
    checkpoints_req       | 160
    checkpoint_write_time | 9109283
    checkpoint_sync_time  | 22878
    buffers_checkpoint    | 1228674
    buffers_clean         | 829714
    maxwritten_clean      | 293
    buffers_backend       | 3731358
    buffers_backend_fsync | 0
    buffers_alloc         | 2290169
    stats_reset           | 2020-06-28 15:32:00.021466+08
  • Solution:

    Check whether a large amount of data is imported. We recommend that you import a small amount of data at a time. If you want to import a large amount of data, we recommend that you break the data into batches and import one batch at a time. This way, you can reduce I/O.

Note Checkpointing refers to the operation of reading dirty pages from the shared buffer and writing the dirty pages to the disk. Checkpointing is triggered in the following situations:
  • The time period that is specified the checkpoint_timeout parameter elapses. The parameter specifies the interval at which your RDS instance reads dirty pages from the shared buffer and writes the dirty pages to the disk. The write operations are asynchronous. The dirty pages are divided into small batches, which are written to the disk one by one. This reduces the impact of checkpointing on I/O.
  • The amount of WAL log data that is written reaches the value of the max_wal_size parameter. WAL log records are written to the WAL log file in a cyclic order. When the amount of WAL log data that is written reaches the value of the max_wal_size parameter, checkpointing is triggered to synchronously write WAL log records to the disk at the maximum speed that is allowed. This has a significant impact on I/O.
  • A manual checkpointing operation is performed to synchronously write WAL log records to the disk at the maximum speed that is allowed. This has a significant impact on I/O.
  • Your RDS instance is restarted. When your RDS instance is restarted, a checkpointing operation is performed to synchronously write WAL log records to the disk at the maximum speed that is allowed. This has a significant impact on I/O.