The diagnostics feature analyzes your AnalyticDB for MySQL Data Warehouse Edition (V3.0) cluster's health across five dimensions: SQL performance, pattern anomalies, business traffic, compute layer, and storage layer. It correlates monitoring data, logs, and database table status to help you pinpoint and resolve performance issues faster.
Limitation: Diagnostics covers a time window of up to 1 hour within the last 14 days.
Run a diagnostic
Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters, then click the cluster ID.
In the left-side navigation pane, click Monitoring Information.
Select a time range — either drag your pointer over a metric curve or choose a range from the drop-down list — then click Diagnose.
The diagnostic report appears on the next page.
Interpret the diagnostic report
The report covers five detection areas. Detection results are reported at three severity levels: NORMAL, WARNING, and CRITICAL. Address CRITICAL issues promptly to keep your workloads running as expected.
Bad SQL detection
Bad SQL detection surfaces the top 10 queries for each of the following metrics, ranked in descending order:
| Metric | What to investigate |
|---|---|
| Top Stages SQL | Queries with the most execution stages — check for complex query plans |
| Most Memory-Consuming SQL | Queries consuming the most memory — consider adding filters or rewriting joins |
| Most CPU-Consuming SQL | Queries consuming the most CPU — look for missing indexes or full table scans |
| Top Data Read SQL | Queries reading the most data — check for data skew or missing partition pruning |
| Top Data Output SQL | Queries producing the most output data — check for unnecessary columns or aggregations |
| Most Time-Consuming SQL | Queries with the longest execution time — the system runs self-diagnostics on each |
For Most Time-Consuming SQL, the system automatically runs self-diagnostics on each query. Click the expand icon to the left of the Actions column to see the results. To view the full execution plan, click Diagnose in the Actions column.
Abnormal pattern detection
Abnormal pattern detection identifies SQL patterns whose behavior changed significantly during the selected period — for example, a spike in submission count or peak memory. The following metrics are monitored:
Peak Memory
Query Duration
Operator Cost
Amount of Output Data
Query Submission
Each detected pattern includes three fields:
| Field | Description |
|---|---|
| SQL Pattern | The normalized SQL template associated with the anomaly |
| Detection Result Details | The reason for the anomaly, such as a comparison with the same period on previous days |
| Other Related Metrics | Supporting metrics (for example, peak memory, operator cost, query duration) to aid analysis |
Business metric detection
Business metric detection compares activity in the selected period against the same period over the past three days. A metric is flagged as growing when more than 80% of the compared data points show an increase exceeding 20%.
Growth rate formula: (Current value - Previous value) / Previous value × 100%
Cluster connections detection
If connections are flagged as growing, check for new business traffic or abnormal connection behavior.
Query response time detection
If query response time is flagged as growing, check for traffic increases or slow queries.
Compute layer detection
Compute layer detection monitors executor nodes and resource groups.
Abnormal operator detection
Abnormal operator detection identifies resource-intensive operators — Join, Aggregation, Sort, and Window — and ranks them by metric in descending order. For each operator, you can inspect: consumed memory, consumed CPU, amount of input data, and amount of output data.
The detection includes two views:
Operator details: Pinpoints individual abnormal operators. Click Diagnose in the Actions column to view the execution plan and trace the SQL statement back to its source.
Operator summary: Aggregates the most CPU-consuming and most memory-consuming operators so you can quickly identify the maximum resource impact across multiple abnormal operators.
Node status detection
If executor nodes handle excessive data volumes or encounter internal errors, they can become unhealthy and fail to execute queries promptly. Node status detection surfaces exceptions based on node status during the selected period.
CPU utilization skew detection
Threshold: Maximum CPU utilization of all executor nodes exceeds twice the average for more than 80% of the selected period.
If skew is detected, investigate data skew as the likely cause.
Average CPU utilization detection for resource groups
Threshold: CPU utilization of a resource group stays above 80% for more than 80% of the selected period.
If utilization is high, evaluate whether to adjust the resource allocation for that group. See Modify a resource group for instructions.
CPU utilization skew detection among resource groups
Threshold: Maximum CPU utilization across all resource groups exceeds 2.5 times the average for more than 80% of the selected period.
If skew is detected, evaluate whether to rebalance resources across resource groups. See Modify a resource group for instructions.
Correlation analysis of time series metrics
When Average CPU Utilization of Compute Node and Maximum CPU Utilization of Compute Node are abnormal, the system identifies correlated metrics that may explain the spike: QPS, BUILD Jobs, Update TPS, Deletion TPS, and Write TPS.
For example, if QPS correlates with an increase in average CPU utilization, a surge in query volume is a likely cause.
Storage layer detection
Storage layer detection monitors worker nodes and table health.
Table creation and access detection
This detection checks the most-accessed tables during the selected period and surfaces three types of issues:
Table access detection: Ranks tables in descending order by the following metrics — Maximum Amount of Read Data, Average Amount of Read Data, Maximum Read Cost, Average Read Cost, and Table Accesses.
Data skew detection: Checks whether the tables with the most data have uneven data distribution across worker nodes. In AnalyticDB for MySQL partitioned tables, choose evenly distributed fields as distribution fields. Fixing skewed tables improves query performance and balances disk usage, which prevents disk contention.
Partition unreasonability detection: Checks whether the tables with the most data have partitioning configurations that could hurt performance.
Node status detection
If worker nodes handle excessive data volumes or encounter internal errors, they can become unhealthy. Node status detection surfaces exceptions based on node status during the selected period.
Maximum CPU utilization detection
Threshold: Maximum CPU utilization across all worker nodes stays above 90% for more than 80% of the selected period.
If utilization is high, investigate slow queries. See Typical slow queries for guidance.
Average CPU utilization detection
Threshold: Average CPU utilization across all worker nodes stays above 80% for more than 80% of the selected period.
If utilization is high, investigate slow queries. See Typical slow queries for guidance.
CPU utilization skew detection
Threshold: Maximum CPU utilization across all worker nodes exceeds twice the average for more than 80% of the selected period.
If skew is detected, investigate data skew as the likely cause.
Correlation analysis of time series metrics
When Average CPU Utilization of Storage Node and Maximum CPU Utilization of Storage Node are abnormal, the system identifies correlated metrics that may explain the spike: QPS, BUILD Jobs, Update TPS, Deletion TPS, and Write TPS.
For example, if QPS correlates with an increase in average CPU utilization on storage nodes, a surge in query volume is a likely cause.