Detect MySQL Cluster Anomalies with One-Click Diagnostics - AnalyticDB for MySQL

The diagnostics feature analyzes your AnalyticDB for MySQL Data Warehouse Edition (V3.0) cluster's health across five dimensions: SQL performance, pattern anomalies, business traffic, compute layer, and storage layer. It correlates monitoring data, logs, and database table status to help you pinpoint and resolve performance issues faster.

Limitation: Diagnostics covers a time window of up to 1 hour within the last 14 days.

Run a diagnostic

Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters, then click the cluster ID.
In the left-side navigation pane, click Monitoring Information.
Select a time range — either drag your pointer over a metric curve or choose a range from the drop-down list — then click Diagnose.

The diagnostic report appears on the next page.

Interpret the diagnostic report

The report covers five detection areas. Detection results are reported at three severity levels: NORMAL, WARNING, and CRITICAL. Address CRITICAL issues promptly to keep your workloads running as expected.

Bad SQL detection

Bad SQL detection surfaces the top 10 queries for each of the following metrics, ranked in descending order:

Metric	What to investigate
Top Stages SQL	Queries with the most execution stages — check for complex query plans
Most Memory-Consuming SQL	Queries consuming the most memory — consider adding filters or rewriting joins
Most CPU-Consuming SQL	Queries consuming the most CPU — look for missing indexes or full table scans
Top Data Read SQL	Queries reading the most data — check for data skew or missing partition pruning
Top Data Output SQL	Queries producing the most output data — check for unnecessary columns or aggregations
Most Time-Consuming SQL	Queries with the longest execution time — the system runs self-diagnostics on each

For Most Time-Consuming SQL, the system automatically runs self-diagnostics on each query. Click the expand icon to the left of the Actions column to see the results. To view the full execution plan, click Diagnose in the Actions column.

Abnormal pattern detection

Abnormal pattern detection identifies SQL patterns whose behavior changed significantly during the selected period — for example, a spike in submission count or peak memory. The following metrics are monitored:

Peak Memory
Query Duration
Operator Cost
Amount of Output Data
Query Submission

Each detected pattern includes three fields:

Field	Description
SQL Pattern	The normalized SQL template associated with the anomaly
Detection Result Details	The reason for the anomaly, such as a comparison with the same period on previous days
Other Related Metrics	Supporting metrics (for example, peak memory, operator cost, query duration) to aid analysis

Business metric detection

Business metric detection compares activity in the selected period against the same period over the past three days. A metric is flagged as growing when more than 80% of the compared data points show an increase exceeding 20%.

Growth rate formula: (Current value - Previous value) / Previous value × 100%

Cluster connections detection

If connections are flagged as growing, check for new business traffic or abnormal connection behavior.

Query response time detection

If query response time is flagged as growing, check for traffic increases or slow queries.

Compute layer detection

Compute layer detection monitors executor nodes and resource groups.

Abnormal operator detection

Abnormal operator detection identifies resource-intensive operators — Join, Aggregation, Sort, and Window — and ranks them by metric in descending order. For each operator, you can inspect: consumed memory, consumed CPU, amount of input data, and amount of output data.

The detection includes two views:

Operator details: Pinpoints individual abnormal operators. Click Diagnose in the Actions column to view the execution plan and trace the SQL statement back to its source.
Operator summary: Aggregates the most CPU-consuming and most memory-consuming operators so you can quickly identify the maximum resource impact across multiple abnormal operators.

Node status detection

If executor nodes handle excessive data volumes or encounter internal errors, they can become unhealthy and fail to execute queries promptly. Node status detection surfaces exceptions based on node status during the selected period.

CPU utilization skew detection

Threshold: Maximum CPU utilization of all executor nodes exceeds twice the average for more than 80% of the selected period.

If skew is detected, investigate data skew as the likely cause.

Average CPU utilization detection for resource groups

Threshold: CPU utilization of a resource group stays above 80% for more than 80% of the selected period.

If utilization is high, evaluate whether to adjust the resource allocation for that group. See Modify a resource group for instructions.

CPU utilization skew detection among resource groups

Threshold: Maximum CPU utilization across all resource groups exceeds 2.5 times the average for more than 80% of the selected period.

If skew is detected, evaluate whether to rebalance resources across resource groups. See Modify a resource group for instructions.

Correlation analysis of time series metrics

When Average CPU Utilization of Compute Node and Maximum CPU Utilization of Compute Node are abnormal, the system identifies correlated metrics that may explain the spike: QPS, BUILD Jobs, Update TPS, Deletion TPS, and Write TPS.

For example, if QPS correlates with an increase in average CPU utilization, a surge in query volume is a likely cause.

Storage layer detection

Storage layer detection monitors worker nodes and table health.

Table creation and access detection

This detection checks the most-accessed tables during the selected period and surfaces three types of issues:

Table access detection: Ranks tables in descending order by the following metrics — Maximum Amount of Read Data, Average Amount of Read Data, Maximum Read Cost, Average Read Cost, and Table Accesses.
Data skew detection: Checks whether the tables with the most data have uneven data distribution across worker nodes. In AnalyticDB for MySQL partitioned tables, choose evenly distributed fields as distribution fields. Fixing skewed tables improves query performance and balances disk usage, which prevents disk contention.
Partition unreasonability detection: Checks whether the tables with the most data have partitioning configurations that could hurt performance.

Node status detection

If worker nodes handle excessive data volumes or encounter internal errors, they can become unhealthy. Node status detection surfaces exceptions based on node status during the selected period.

Maximum CPU utilization detection

Threshold: Maximum CPU utilization across all worker nodes stays above 90% for more than 80% of the selected period.

If utilization is high, investigate slow queries. See Typical slow queries for guidance.

Average CPU utilization detection

Threshold: Average CPU utilization across all worker nodes stays above 80% for more than 80% of the selected period.

If utilization is high, investigate slow queries. See Typical slow queries for guidance.

CPU utilization skew detection

Threshold: Maximum CPU utilization across all worker nodes exceeds twice the average for more than 80% of the selected period.

If skew is detected, investigate data skew as the likely cause.

Correlation analysis of time series metrics

When Average CPU Utilization of Storage Node and Maximum CPU Utilization of Storage Node are abnormal, the system identifies correlated metrics that may explain the spike: QPS, BUILD Jobs, Update TPS, Deletion TPS, and Write TPS.

For example, if QPS correlates with an increase in average CPU utilization on storage nodes, a surge in query volume is a likely cause.