ApsaraDB RDS for MySQL provides various performance metrics and supports custom views. You can select metrics to create custom views based on your business requirements. ApsaraDB RDS for MySQL provides powerful diagnostic capabilities to detect events on your RDS instance at the earliest opportunity and automatically diagnose the events to analyze root causes and provide suggestions. ApsaraDB RDS for MySQL also allows you to select a time period during which you want to perform manual diagnostics. This helps you understand the status of your RDS instance.

Description

The standard monitoring feature of ApsaraDB RDS for MySQL is upgraded and integrated with the performance trend feature of Database Autonomy Service (DAS) to provide more performance monitoring metrics and support custom views. You can select metrics to create custom views based on your business requirements. For more information, see Performance trends. ApsaraDB RDS for MySQL provides powerful diagnostic capabilities to detect events on your RDS instance at the earliest opportunity and automatically diagnose the events to analyze root causes and provide suggestions. ApsaraDB RDS for MySQL also allows you to select a time period during which you want to perform manual diagnostics.

  • For more information about the performance parameters that are related to the metrics, see Performance parameters.
  • DAS is a stable, secure, and efficient cloud service that is designed based on machine learning and expert experience to implement automated perception, recovery, optimization, O&M, and security assurance for databases. DAS helps prevent service failures that are caused by manual operations. For more information, see Overview of DAS.

Procedure

View standard monitoring information

  1. Access RDS Instances, select a region at the top, and then click the ID of the target RDS instance.
  2. In the left-side navigation pane, click Monitoring and Alerts.
  3. On the Standard Monitoring tab, click the Standard View or Custom View tab and perform the following operations based on your business requirements:
    • On the Standard View tab, select a time range to view the trend of metrics and statistics about various events.
      You can click Add Trend Comparison to view the comparison of performance trends in a specific time range based on the metrics. Add Trend Comparison
      Note When you specify a time range, the end time must be later than the start time, and the interval between the start time and the end time cannot exceed 30 days.
      • Event statistics section: You can click View Details to go to the Performance Events tab. On this tab, you can view the details about exceptions, optimization events, and auto scaling events. The details include the events that are scheduled, being executed, and executed.
      • Trend charts section:
        • In addition to the classic view, the system provides diagnostic views for common issues. This allows you to quickly identify the causes of issues based on the trends of key metrics in the diagnostic views.

          The following diagnostic views are provided: Memory OOM diagnosis, Read-only instance delay diagnosis, Diagnosis of Space Full Problem, CPU jitter diagnosis, and Large transaction recognition diagnosis. You can select a diagnostic view based on your business requirements. For more information, see Use diagnostic views.

          Note You can click the Metric icon to the right of a monitoring item to view the metrics of the monitoring item.
          Select View
        • If you select Classic View, you can click More Metrics to select the monitoring items that you want the system to display.
        • You can select the severity levels of events. If events of the selected severity levels are detected, the system displays the detected events in the MySQL CPU Utilization/Memory Usage and Session trend charts.

          If you select Classic View, you can click the event in a trend chart and view the diagnostic result in the event list.

          Event monitoring
        • You can specify a time range and click Diagnose in the trend chart of a monitoring item to analyze the metrics in the selected time range.
        • You can click Common Causes in the trend chart of a monitoring item to view the common causes of the exceptions in the monitoring item.
        • You can click Details in the trend chart of a monitoring item to expand the chart. You can also change the time range to view the change trends of the monitoring item in a specific time range.
    • On the Custom View tab, click Add Monitoring Dashboard to create a custom dashboard to display the metrics that you want to monitor.
      • You can click Add Node and Metrics to select the RDS instances and metrics for the dashboard.
      • You can set the Display Mode parameter to Merge Display or Separate Display.
        • If you select Merge Display, multiple metrics are displayed in the same trend chart.
        • If you select Separate Display, each metric is displayed in a separate trend chart.
          • You can configure the Chart Layout parameter to specify the number of trend charts of metrics in each row.
          • You can click Details in the trend chart of a metric to expand the chart. You can also change the time range to view the change trends of the metric in a specific time range.

View legacy monitoring information

  1. Access RDS Instances, select a region at the top, and then click the ID of the target RDS instance.
  2. In the left-side navigation pane, click Monitoring and Alerts.
  3. On the Legacy Monitoring tab, click Resource Monitoring, Engine Monitoring, or Deploy Monitoring. Then, specify a time range. If you use RDS Cluster Edition, you can also select an cluster ID or node ID. The following table describes the monitoring items.
    Note When you specify a time range, the end time must be later than the start time, and the interval between the start time and the end time cannot exceed 30 days.
    TabMonitoring itemDescription
    Resource MonitoringCPU Utilization and Memory Usage (%)The CPU utilization and memory usage of the RDS instance. This excludes the CPU utilization and memory usage of the operating system.
    Disk Space (MB)The disk usage of the RDS instance. The disk usage provides the following information:
    • Disk Space Used
    • Data Space Used
    • Log Space Used
    • Temporary File Space Used
    • System File Space Used

    Unit: MB.

    IOPSThe IOPS of the RDS instance.
    Total ConnectionsThe number of active connections to the RDS instance and the total number of connections to the RDS instance.
    Network Traffic (KB)The volumes of inbound and outbound traffic of the RDS instance per second. Unit: KB.
    Engine MonitoringTPS/QPSThe average number of transactions per second (TPS) and the average number of SQL statements that are executed per second.
    InnoDB Buffer Pool Read Hit Ratio, Usage Ratio, and Dirty Block Ratio (%)The read hit ratio, usage, and dirty ratio of the InnoDB buffer pool.
    InnoDB Read/Write Volume (KB)The amount of data that InnoDB reads and writes per second. Unit: KB.
    InnoDB Buffer Pool Read/Write FrequencyThe number of read and write operations that InnoDB performs per second.
    InnoDB Log Reads, Writes, and fsync() OperationsThe number of physical writes to log files by InnoDB per second, the number of log write requests by InnoDB per second, and the number of fsync writes to log files by InnoDB per second.
    Temporary Tables Automatically Created on Hard Disk when MySQL Statements Are ExecutedThe number of temporary tables that are automatically created on the hard disk when SQL statements are executed.
    MySQL_COMDMLThe number of SQL statements that are executed per second. ApsaraDB RDS supports the following SQL statements:
    • Insert
    • Delete
    • Insert_Select
    • Replace
    • Replace_Select
    • Select
    • Update
    MySQL_RowDMLThe number of operations that InnoDB performs per second. The following items are included:
    • The number of physical writes to log files per second
    • The number of rows on which InnoDB performs operations per second. This includes the number of rows that are read from InnoDB tables per second, the number of rows that are updated in InnoDB tables per second, the number of rows that are deleted from InnoDB tables per second, and the number of rows that are inserted into InnoDB tables per second.
    MyISAM Read/Write FrequencyThe number of reads from the buffer pool by MyISAM per second, the number of writes to the buffer pool by MyISAM per second, the number of reads from the hard disk by MyISAM per second, and the number of writes to the hard disk by MyISAM per second.
    MyISAM Key Buffer Read/Write/Usage Ratio (%)The read hit ratio, write hit ratio, and usage of the MyISAM key buffer per second.
    MySQL Thread StatusThe number of active threads and the number of connected threads.
    Note The number of active threads may be greater than the number of connected threads. This is because threads may be used for other workloads than connection establishments.
    Redo Log Size Written by InnoDB per Second (Bytes/s)The amount of log data that is written to the redo log files per second. Unit: bytes.
    MySQL_ROW_LOCKThe information about row-level locks in InnoDB. The monitoring item consists of the following metrics:
    • innodb_row_lock_waits: the average number of times the row-level locks have been waited for.
    • innodb_row_lock_time: the maximum time of a row-level lock that has been waited for. Unit: milliseconds.
    • innodb_row_lock_avg: the average time of row-level locks that have been waited for. Unit: milliseconds.
    MySQL_SelectScanThe number of full table scans.
    Deploy MonitoringReplication Thread Status of Secondary InstancesThe status of the threads that are used to replicate data to the secondary RDS instance.
    • I/O Thread Status of Secondary Instances: The value 1 indicates that the thread is normal, and the value 0 indicates that the thread is lost.
    • SQL Thread Status of Secondary Instances: The value 1 indicates that the thread is normal, and the value 0 indicates that the thread is lost.
    Replication Latency of Secondary Instances (s)The latency of data replication to the secondary RDS instance. Unit: seconds.

Use diagnostic views

Memory OOM diagnosis

Memory OOM diagnosis
You can analyze and resolve the out of memory (OOM) issue based on the trend charts of monitoring items that are displayed in the Memory OOM diagnosis view.
  • Memory Usage:
    • If the memory usage slowly and continuously increases for a long period of time, such as more than seven days, and the usage of the InnoDB buffer pool remains unchanged, a memory leak may occur.
    • If the memory usage suddenly increases and the usage of the InnoDB buffer pool remains unchanged, your services may encounter traffic spikes.
    • If both the memory usage and the usage of the InnoDB buffer pool increase, InnoDB caches as it is accessed and works as expected.
  • Resident Memory: shows the memory capacity.
  • Open files, Temp File Size, Temp Disk Tables, and Sort Rows: show memory consumption.
The increase in memory usage is positively correlated with business metrics. In most cases, SQL statements that cause the increase in memory usage cannot be traced due to OOM errors before the execution of the SQL statements are completed. We recommend that you perform the following operations:
  • Check business logs to identify the cause of the increase in memory usage.
  • Increase the memory capacity and enable the SQL Explorer and Audit feature. This way, you can identify the cause of the increase in memory usage based on the points in time at which SQL statements are executed. For more information, see Use the SQL Explorer and Audit feature on an ApsaraDB RDS for MySQL instance.

Read-only instance delay diagnosis

Read-only instance delay diagnosis
You can analyze and resolve latency-related issues of read-only RDS instances based on the trend charts of monitoring items that are displayed in the Read-only instance delay diagnosis view.
  • Active Session: shows whether metadata locks exist and cause congestions.

    In most cases, if a large amount of data is queried, metadata locks cannot be obtained when DDL statements are executed. As a result, the DDL statements block other sessions and cause an accumulation of connections.

  • Resident Memory: shows the memory capacity.
  • DML Rows Processed, Pages Requested, DML/DDL Operations, and Temp Disk Space Used: show common business metrics.
  • Replication Delay: shows the replication latency.

Diagnosis of Space Full Problem

Diagnosis of Space Full Problem

You can analyze and resolve exhausted storage issues based on the trend charts of monitoring items that are displayed in the Diagnosis of Space Full Problem view.

You can check the types of files that occupy storage and the storage usage trend of each type of file. In most cases, the following types of files occupy storage:
  • Data files (user_data_size): You can use the storage analysis feature to check the storage usage of each database and table. Then, you can expand the storage capacity or delete unnecessary data based on your business requirements. For more information about the storage analysis feature, see Use the storage analysis feature for an ApsaraDB RDS for MySQL instance. For more information about how to handle data files, see What do I do if an ApsaraDB RDS for MySQL instance is in the Locked state because its storage capacity is exhausted by data files?
  • Temporary files (temp_file_size): Temporary tables may be generated when you execute SQL statements to sort and group data or associate tables. Binary log cache files are generated before large transactions are committed. These tables and files occupy storage. For more information about how to handle temporary files, see What do I do if an ApsaraDB RDS for MySQL instance is in the Locked state because its storage capacity is exhausted by temporary files?
  • Binary logs (binlog_size): Binary logs are quickly generated by large transactions. These logs occupy storage. For more information about how to handle binary logs, see What do I do if the storage capacity of an ApsaraDB RDS for MySQL instance is exhausted by binary log files?.
    Note If binary logs are subscribed, the binary logs may not be deleted at the earliest opportunity and occupy storage.
  • Undo logs (undo_log_size): In most cases, undo logs cannot be cleared due to long-running queries. You need to check whether queries that are executed for a long period of time and are not completed exist.
    Note In MySQL 5.6 and earlier versions, no separate tablespaces are provided for undo logs.
  • Slow query logs (slowlog_size): If slow query logs occupy a large amount of storage, we recommend that you execute the TRUNCATE statement to clear slow query logs during off-peak hours.
    Note The TRUNCATE statement is supported in MySQL 5.7 with a minor engine version of 20210630 or later and in MySQL 8.0 with a minor engine version of 20210930 or later.
  • General query logs (general_log_size): By default, the general_log parameter is disabled in MySQL. If you enable the general_log parameter, extra storage is occupied. We recommend that you execute the TRUNCATE statement to clear general query logs during off-peak hours.
    Note The TRUNCATE statement is supported in MySQL 5.7 with a minor engine version of 20210630 or later and in MySQL 8.0 with a minor engine version of 20210930 or later.

CPU jitter diagnosis

CPU jitter diagnosis
You can analyze and resolve CPU jitter issues based on the trend charts of monitoring items that are displayed in the CPU jitter diagnosis view. The following types of monitoring items are strongly correlated with CPU utilization.
  • Business monitoring items:
    • Page Request: In most cases, the trend of the number of requests in the buffer pool fluctuates based on CPU utilization.
    • Rows Processed: shows the relationship between CPU utilization and the number of rows that are processed by the system. You can check whether the number of processed rows suddenly increases when CPU utilization changes based on this monitoring item.
    • Queries: shows the types of SQL statements that are executed when CPU utilization changes.
  • Connection-related monitoring items:

    Thread Running: High concurrency leads to high CPU utilization. Metadata locks or row locks cause an accumulation of connections, which affects CPU utilization.

Common causes of CPU jitters:
  • The Page Request or Rows Processed monitoring item changes. In this case, you can select the period of time during which CPU utilization changes and click Diagnose to obtain details about the issue.
  • The number of active connections increases. In this case, you need to resolve the issue from the business side.

Large transaction recognition diagnosis

Large transaction recognition diagnosis
You can analyze and resolve large transaction-related issues based on the trend charts of monitoring items that are displayed in the Large transaction recognition diagnosis view.
  • Threads Connected, Temp File Size, and Binlog space: show whether large transactions exist. If one of the following events occurs, a large transaction exists in the database:
    • Active sessions accumulate.
    • The storage that is occupied by temporary files increases and then decreases.
    • The storage that is occupied by temporary files decreases, but the storage that is occupied by binary logs increases.
  • Rows Processed, Logical Page Write, and Queries per Second: show the type of a large transaction.

    For example, if a few queries are executed but a large number of rows are deleted, a large transaction that deletes data exists.

Large transactions cause the blocking of binary log writes.
  • When large transactions exist, the storage that is occupied by temporary tables (binlog cache) gradually increases and then remains stable.
  • When the storage that is occupied by temporary tables remains stable, the storage that is occupied by binary logs increases. Binary logs are globally written in serial mode. As a result, other transactions are blocked, which causes the accumulation of connections.
  • If the RDS instance runs RDS High-availability Edition, the statements that are executed to check the status of primary and secondary RDS instances for primary/secondary switchovers are blocked. As a result, a primary/second switchover is directly performed on the RDS instance.

We recommend that you split large transactions into small transactions and separately execute the small transactions. For example, you can add a WHERE clause to the DELETE statement to limit the volume of data that is deleted at a time and split the deletion operation into smaller operations.