How to use the diagnostics feature in AnalyticDB for MySQL Data Warehouse Edition - AnalyticDB for MySQL

AnalyticDB for MySQL Data Warehouse Edition (V3.0) provides the diagnostics feature. This feature allows you to diagnose the running status of clusters within a specific period of time. AnalyticDB for MySQL performs joint analysis based on monitoring data, log data, and the status of databases and tables. AnalyticDB for MySQL evaluates the health status of clusters from multiple aspects, such as resource usage, workload changes, SQL queries, operators, and storage, to help you efficiently identify and resolve issues.

Procedure

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Clusters page, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, click Monitoring Information.
Drag the pointer over the curve of a metric or select a time range from the drop-down list and click Diagnose.
Important
This feature supports data diagnostics within the range of up to 1 hour only for the last 14 days.
On the page that appears, you can view the diagnostic results of the cluster.

Interpret the diagnostic report

The diagnostic results include information about bad SQL detection, abnormal pattern detection, business metric detection, compute layer detection, and storage layer detection. You can use the diagnostic results to identify and resolve issues.

Bad SQL detection

Bad SQL detection includes the following metrics: Top Stages SQL, Most Memory-Consuming SQL, Most CPU-Consuming SQL, Top Data Read SQL, Top Data Output SQL, and Most Time-Consuming SQL. Up to 10 SQL queries are displayed in descending order for each metric. The system performs self-diagnostics for each SQL query of the Most Time Consuming SQL metric. You can click the icon to the left of the Actions column to view the diagnostic results of each SQL query. To analyze the current SQL query, click Diagnose in the Actions column to go to the query details page and view the execution plan.

Abnormal pattern detection

Abnormal pattern detection displays abnormal patterns within a specific period of time. For example, the number of submissions or the peak memory of a pattern dramatically increases. Abnormal pattern detection includes the following metrics: Peak Memory, Query Duration, Operator Cost, Amount of Output Data, and Query Submission. Each metric consists of the following fields: SQL Pattern, Detection Result Details, and Other Related Metrics.

SQL Pattern: helps you determine the SQL statement of the abnormal pattern.
Detection Result Details: displays the reasons of the abnormal pattern, such as comparison results with the data of the previous days.
Other Related Metrics: displays other related metrics, such as peak memory, operator cost, and query duration, to facilitate data analysis.

Business metric detection

Business metric detection includes cluster connections detection and query response time detection.

Cluster connections detection

This detection compares data within a specific period of time up to 1 hour between the current day and each of the last three days. If the number of connections within a specific period of time on the current day increases by more than 20% for 80% of data points that are compared with the data points on each of the last three days, the number of connections is considered a growth. You must check for new business traffic or abnormal connections.

Query response time detection

This detection uses the same logic as cluster connections detection. If the query response time within a specific period of time on the current day increases by more than 20% for 80% of data points that are compared with the data points on each of the last three days, the query response time is considered a growth. You must check for traffic increases or slow queries.

Note

The growth rate is calculated by using the following formula: (Current metric value - Previous metric value)/Previous metric value × 100%.

Compute layer detection

Storage layer detection includes abnormal operator detection, node status detection, CPU utilization skew detection, average CPU utilization detection for resource groups, CPU utilization skew detection among resource groups, and correlation analysis of time series metrics.

Abnormal operator detection

Operator details
Abnormal operator detection helps you quickly identify abnormal operators, such as Join, Aggregation, Sort, and Window operators, and sort the operators in descending order by metric. You can analyze abnormal operators based on fields such as the consumed memory, consumed CPU resources, amount of input data, and amount of output data. You can also click Diagnose in the Actions column to query the execution plan of an operator to determine the SQL statement and source of the operator.
Operator summary
In specific cases, multiple abnormal operators may cause large amounts of resources to be consumed. Abnormal operator detection aggregates the following metrics: most CPU consuming operators and most memory consuming operators. You can use the aggregate results to check the maximum metric values and identify abnormal operators.

Node status detection

If executor nodes need to handle large amounts of data or encounter internal errors, the nodes may become unhealthy and eventually fail to execute queries or quickly respond. Node status detection displays exceptions about unhealthy executor nodes based on the status of the nodes within a specific period of time to help you identify exception causes.

CPU utilization skew detection

This detection checks whether the CPU utilization among executor nodes is balanced. If the maximum CPU utilization of all executor nodes remains more than twice the average CPU utilization of all executor nodes for 80% of a specific period of time, the CPU utilization of executor nodes is considered skewed. We recommend that you pay constant attention to this issue and check for data skew. For more information, see Data modeling diagnostics.

Important

Detection results are displayed at the following levels: NORMAL, WARNING, and CRITICAL. To ensure that your business runs as expected, pay close attention to the detection results at the CRITICAL level.

Average CPU utilization detection for resource groups

This detection checks the average CPU utilization of a resource group within a specific period of time. If the CPU utilization of a resource group remains more than 80% for 80% of a specific period of time, the CPU utilization of the resource group is considered high. We recommend that you evaluate whether to change the amount of resources in the resource group. For more information, see the "Modify a resource group" section of the Create a resource group topic.

Important

CPU utilization skew detection among resource groups

This detection checks whether the CPU utilization among resource groups is balanced. If the maximum CPU utilization of all resource groups remains more than 2.5 times the average CPU utilization of all resource groups for 80% of a specific period of time, the CPU utilization of resource groups is considered skewed. We recommend that you evaluate whether to change the amount of resources in the resource groups. For more information, see the "Modify a resource group" section of the Create a resource group topic.

Important

Correlation analysis of time series metrics

If the Average CPU Utilization of Compute Node and Maximum CPU Utilization of Compute Node metrics are abnormal, the system uses correlation analysis of time series metrics to detect the related metrics that cause the anomalies. The related metrics include QPS, BUILD Jobs, Update TPS, Deletion TPS, and Write TPS. You can use the related metrics to identify possible causes of high CPU utilization.

For example, if QPS is detected as a metric related to the changes in the Average CPU Utilization of Compute Node metric, an increase in the average CPU utilization of compute nodes may be caused by a large number of queries.

Important

When you use correlation analysis of time series metrics in diagnostics, we recommend that you specify a time range that includes the monitoring periods of key metric anomalies. This helps you identify the causes of the metric anomalies at the earliest opportunity.

Storage layer detection

Storage layer detection includes table creation and access detection, node status detection, maximum CPU utilization detection, average CPU utilization detection, CPU utilization skew detection, and correlation analysis of time series metrics.

Table creation and access detection

Table access detection
This detection checks the tables that are accessed within a specific period of time and displays the tables in descending order by metric. The following metrics are included: Maximum Amount of Read Data, Average Amount of Read Data, Maximum Read Cost, Average Read Cost, and Table Accesses.
Data skew detection
In AnalyticDB for MySQL partitioned tables, you must select evenly distributed fields as distribution fields to distribute data to different worker nodes. If you select uneven distribution fields, table skew may occur. This detection identifies the tables that have the largest amount of data and checks whether the data is skewed. After you optimize skewed tables, the query performance is improved and the amount of data among disks is balanced. This prevents disks from being locked.
Partition unreasonability detection
This detection identifies the tables that have the largest amount of data and checks whether data is partitioned in an unreasonable manner.

Node status detection

If worker nodes need to handle large amounts of data or encounter internal errors, the nodes may become unhealthy and eventually fail to execute queries or quickly respond. Node status detection displays exceptions about unhealthy worker nodes based on the status of the nodes within a specific period of time to help you identify exception causes.

Maximum CPU utilization detection

This detection checks the maximum CPU utilization of all worker nodes within a specific period of time. If the maximum CPU utilization of all worker nodes remains more than 90% for 80% of a specific period of time, the CPU utilization of worker nodes is considered high. We recommend that you check for slow queries. For more information, see Typical slow queries.

Important

Average CPU utilization detection

This detection checks the average CPU utilization of all worker nodes within a specific period of time. If the average CPU utilization of all worker nodes remains more than 80% for 80% of a specific period of time, the CPU utilization of worker nodes is considered high. We recommend that you check for slow queries. For more information, see Typical slow queries.

Important

CPU utilization skew detection

This detection checks whether the CPU utilization among worker nodes is balanced. If the maximum CPU utilization of all worker nodes remains more than twice the average CPU utilization of all worker nodes for 80% of a specific period of time, the CPU utilization of worker nodes is considered skewed. We recommend that you pay constant attention to this issue and check for data skew. For more information, see Data modeling diagnostics.

Important

Correlation analysis of time series metrics

If the Average CPU Utilization of Storage Node and Maximum CPU Utilization of Storage Node metrics are abnormal, the system uses correlation analysis of time series metrics to detect the related metrics that cause the anomalies. The metrics include QPS, BUILD Jobs, Update TPS, Deletion TPS, and Write TPS. You can use the related metrics to identify possible causes of high CPU utilization.

For example, if QPS is detected as a metric related to the changes in the Average CPU Utilization of Storage Node metric, an increase in the average CPU utilization of storage nodes may be caused by a large number of queries.

Important