Spark SQL diagnostics and optimization - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB for MySQL provides the Spark SQL diagnostics feature. If your Spark SQL queries have performance issues, you can use diagnostic information to quickly identify, analyze, and resolve performance bottlenecks to optimize Spark SQL queries. This topic describes how to perform Spark SQL diagnostics.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
- If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources. For more information, see Perform authorization.

Usage notes

You can perform diagnostics only on Spark SQL queries that are successfully executed in the last 14 days.

Procedure

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

In the left-side navigation pane, choose Diagnostics and Optimization > Spark SQL Diagnostics and Optimization.

The query list displays information about SQL queries for diagnostics. The following table describes the parameters of SQL queries.

Parameter	Description
Actions	The Diagnose operation that allows you to view execution details such as the SQL statement and diagnostic results.
SQL	The executed SQL statement.
Query ID	The ID of the query.
Execution ID	The sequence ID of the SQL statement in the Spark SQL application.
Status	The execution status of the SQL statement. Valid values: Completed Running Failed
Started At	The time when the query is submitted.
Execution Duration	The amount of time consumed to execute the SQL statement.
Maximum Operator Exclusive Time	The maximum execution time during which an operator uses resources exclusively.
Peak Memory	The peak memory usage of the query.
Scanned Data	The amount of data returned from the storage layer to the compute layer.

Find the SQL query that you want to diagnose and click Diagnose in the Actions column.
On the Query Properties page, click the Execution Plan tab to view the hierarchy chart of the execution plan tree of the query.
Click Query-level Diagnostic Results to the right of the execution plan tree to view the query-level diagnostic results.

Query-level diagnostic results

AnalyticDB for MySQL provides the SQL diagnostics feature to collect query-level statistics on SQL queries for diagnostics and deliver optimization suggestions.

Data bloat in a join

Problem description:
If the number of output rows of a join operator is greater than the number of input rows, data bloat occurs because the join operation is considered unreasonable. As a result, a large number of computing and memory resources are occupied, which slows down the query.
Suggestions:
Optimize the code, such as filtering out data that is not involved in the join operation in advance to reduce the amount of data to be processed.

Skew in the amount of scanned data

Problem description:
If a significant difference exists in the amount of scanned data among multiple tables during the execution of a Spark SQL query, an extended period of time is required to read data due to the long tail effect, which affects the final query performance.
Suggestions:
Select appropriate distribution key columns to reduce the amount of scanned data.