Use stage and task details to analyze slow queries - AnalyticDB for MySQL

After the frontend access node of an AnalyticDB for MySQL cluster receives a query request, the cluster divides the query into multiple stages, and reads and computes data in a distributed manner on worker nodes and executor nodes. Some stages can be executed in parallel, but the other stages that have dependencies can be executed only in series. As a result, complex SQL statements may cause slow query issues. You can use stage and task details to analyze the issues by calling API operations or in the AnalyticDB for MySQL console. This topic describes how to use stage and task details to analyze queries.

Procedure

Log on to the AnalyticDB for MySQL console.
In the upper-left corner of the page, select a region.
In the left-side navigation pane, click Clusters.
On the Data Warehouse Edition (V3.0) tab, find the cluster that you want to manage and click the Cluster ID.
In the left-side navigation pane, click Diagnostics and Optimization.
On the SQL Queries tab, find the query that you want to diagnose and click Diagnose in the Actions column.
Click the Stages & Tasks tab to view the details of a stage. For information about the stage query results, see the "Stage parameters" section of this topic.
Click the stage ID to view the details of all tasks in the stage. For information about task query results, see the "Task parameters" section of this topic.
Important
You can view task details only for queries that consume more than 1 second.

Parameters

Stage parameters

Parameter	Description
Stage ID	The unique identifier of the stage, which corresponds to the stage ID in the execution plan tree.
Status	The execution status of the stage. Valid values: Completed: All tasks in the stage are executed. Failed: One or more tasks in the stage failed. Running: Tasks in the stage are running. Canceled: One or more tasks in the stage are canceled.
Rows of Input Data	The number of data rows that are input to the stage.
Amount of Input Data	The amount of data that is input to the stage.
Rows of Output Data	The number of data rows that are output from the stage.
Amount of Output Data	The amount of data that is output from the stage.
Peak Memory	The peak memory usage of the stage.
Cumulative Duration	The total amount of time that is consumed to execute all operators in the stage. You can use this parameter to identify the stages that take a long time to execute and consume a large amount of CPU resources. When you compare the cumulative duration with the query duration, you must consider the concurrency of stages.

Task parameters

Parameter	Description
Task ID	The unique identifier of the task. Example: `2.24`. In this example, `2` indicates the ID of the stage to which the task belongs, and `24` indicates the ID of the task in the stage.
Status	The execution status of the task. Valid values: Completed Failed Terminated Canceled
Amount of Input Data	The number of data rows that are input to the task and the amount of input data. You can sort the amount of input data for all tasks to check whether data skew occurs on the input data of the stage. Data skew may be caused by inappropriate field settings of GROUP or JOIN clauses. To resolve this issue, trace back to the upstream stage of the stage to which the current task belongs. Note If the specified distribution fields are inappropriate, data may be unevenly distributed among worker nodes. This is called a data skew.
Amount of Output Data	The number of data rows that are output from the task and the amount of output data. You can check whether combined fields exist in the GROUP or JOIN clause of the SQL statement based on the attributes of the Aggregation or Join node in the operator plan tree of the current stage. For example, `a.id=b.id` contains only the id field, and `a.id = b.id and a.age=b.age` contains the id and age fields.
Peak Memory	The peak memory usage of the task. The peak memory is proportional to the amount of input data. You can use this parameter to check whether query failures are caused by unbalanced distribution of input data.
Duration of Reading Table Data	The cumulative amount of time that is consumed by all TableScan operators of the stage to read table data. This parameter is a cumulative value that involves multiple nodes and threads and cannot be compared with the query duration. If you compare this parameter with the cumulative duration, you can determine how many computing resources of a stage are spent on data scans.
Amount of Read Table Data	The number of data rows and the amount of data that are read by all TableScan operators of the stage. You can sort the amount of read table data for all tasks to check whether data skew occurs on the source table data. If data skew occurs, you can check whether the data skew is caused by distribution fields in the AnalyticDB for MySQL console. For more information, see Data modeling diagnostics.
Created At	The time when the task is created.
Queuing Duration	The amount of queue time of the task before execution.
Ended At	The time when the task ends.
Interval Between Start and End Time	The interval between the creation time and end time of the task. For example, if a task is created at 2022-12-12 12:00:00 and ends at 2022-12-12 12:00:04, the start and end time interval is 4 seconds. If you compare this parameter with the query duration, you can identify the main cause of slow execution. For example, if the query duration is 6 seconds and the start and end time interval is 4 seconds, the current stage is the main reason why the query took a long time to execute. For more information, see the "Examples for calculating task duration and concurrency" section of this topic.
Cumulative Duration	The total amount of time that is consumed by all threads of all the tasks in the stage. For more information, see the "Examples for calculating task duration and concurrency" section of this topic.
Computing Time Ratio	The ratio of data computing duration to the subtask lifecycle. This parameter can be calculated by using the following formula: Computing time ratio = (Cumulative duration/Subtask concurrency)/Interval between start and end time. In this formula, (Cumulative duration/Subtask concurrency) indicates the average amount of time that is consumed by each thread to compute data. The start and end time interval includes the actual data computing time, subtask queuing duration, and network latency. For more information, see the "Examples for calculating task duration and concurrency" section of this topic. Note A longer start and end time interval results in a smaller computing time ratio. In this case, you need to identify the operators that consume a long period of time. A shorter start and end time interval results in a larger computing time ratio. In this case, you need to focus on issues like queuing and network latency.
Subtask Concurrency	The number of threads that concurrently execute the task on a node. For more information, see the "Examples for calculating task duration and concurrency" section of this topic.
Execution Node	The IP address of the node on which the task is executed. If long tail issues occur on the same node for multiple queries, you must check this node. Note A long tail issue in AnalyticDB for MySQL refers to the situation where some tasks take a much longer time to execute than the other tasks.

Examples for calculating task duration and concurrency

In this example, the start and end time interval, cumulative duration, computing time ratio, and subtask concurrency of Task 2.1 are calculated.

Task 2.1 belongs to Stage 2. Assume that Stage 2 consists of four operators: StageOutput, Join, TableScan, and RemoteSource. The following figure shows the operator tree diagram of Stage 2.

These operators are concurrently executed on multiple nodes in the direction of the arrows. On the node whose IP address is 192.168.12.23, Task 2.1 is executed with four concurrent threads. The four threads respectively consume 5, 5, 6, and 6 seconds to compute data, as shown in the following figure.

Cumulative duration of the task: 5s + 5s + 6s + 6s = 22s.
Start and end time interval: 10s.
Computing time ratio: (22s/4)/10s = 0.55.

Related operations

Operation	Description
DescribeDiagnosisSQLInfo	Queries the execution information about an SQL statement.
DescribeDiagnosisTasks	Queries the execution information about distributed tasks in a stage of a query.