Perform Spark application performance diagnostics - AnalyticDB

If your Spark application is running slower than expected, experiencing memory spikes, or producing uneven task distribution, the built-in performance diagnostics feature in AnalyticDB for MySQL Data Lakehouse Edition (V3.0) can help you quickly identify the root cause. After you trigger a diagnosis on a completed Spark job, the Diagnostic Optimization Details panel surfaces specific bottlenecks and optimization recommendations.

Supported scenarios

Large-scale data processing: When Spark processes large datasets, diagnostics identifies bottlenecks such as memory spikes and spills that reduce throughput.
Highly concurrent workloads: For workloads with many concurrent applications, diagnostics detects issues like data skew, long-tail tasks, and load imbalance that degrade overall performance.

Limitations

Only Spark applications that ran successfully in the last 14 days can be diagnosed.
Only batch and streaming applications are supported.

Prerequisites

Before you begin, make sure you have:

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. For more information, see Create a cluster.
A job resource group with at least 8 AnalyticDB compute units (ACUs) of reserved computing resources. For more information, see Create a resource group.
A Resource Access Management (RAM) user granted the AliyunADBDeveloperAccess permission. For more information, see Manage RAM users and permissions.
A database account for the cluster:
- Alibaba Cloud account: create a privileged account. For more information, see Create a privileged account.
- RAM user: create both a privileged account and a standard account, then associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
AnalyticDB for MySQL authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources. For more information, see Perform authorization.

Run a diagnosis

Log on to the AnalyticDB for MySQL console. In the upper-left corner, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster and click its cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
In the Applications section, find the application to diagnose and choose More > History in the Actions column.
In the Execution History section, find the job and click Diagnose in the Actions column.
Note
After the diagnosis completes, the Diagnostic Optimization Details panel opens. The panel highlights detected performance issues — such as memory spikes, disk spills, data skew, or long-tail tasks — and provides specific optimization recommendations. Apply the recommendations to improve application performance.

诊断示例

以下示例为Spark应用性能诊断的常见问题，若您提交的Spark应用存在以下问题，可以根据优化方案优化Spark应用。

示例一：Shuffle过程中存在数据倾斜

不合理判定规则

Spark Shuffle读数据Records的最大值大于中位数的5倍。

优化方案

设置spark.sql.shuffle.partitions参数的取值为Spark Executor核数的2~3倍，使数据再次分配到不同任务（Task）。
修改业务逻辑，在Shuffle阶段前过滤数据集中的异常数据。
在GroupBy算子或ReduceBy算子中添加随机数做为key，打散Map和Reduce阶段的数据，然后在Reduce阶段删除该随机数。
根据业务情况增大spark.shuffle.file.buffer或spark.reducer.maxSizeInFlight参数的取值，提升shuffle性能。

示例二：CPU利用率较低

不合理判定规则

Spark Executor运行Task所用CPU的时间小于Spark Executor运行时间的35%。

优化方案

修改spark.executor.resourceSpec参数，降低规格以减少核数。
修改业务代码，降低RDD分区数。
如存在Shuffle过程，可修改spark.sql.shuffle.partitions参数，增加Task数量。

示例三：JVM GC耗时较长

不合理判定规则

JVM GC耗时占整体运行时间20% 以上。

优化方案

修改spark.executor.resourceSpec参数增大规格。
修改对应参数，详情请参见参考Garbage Collection Tuning。

示例四：Spark中单个Task处理大量数据

不合理判定规则

Stage中某一个Task处理数据大于200 MB。

优化方案

修改spark.default.parallelism或spark.sql.shuffle.partitions参数，增大取值，调整数据分区，提高应用计算并发度。

示例五：过多数据溢出磁盘

不合理判定规则

溢出磁盘的数据量大于溢出内存的数据量。

优化方案

修改spark.executor.resourceSpec参数增大规格，减少磁盘溢出。

示例六：执行过程中存在长尾任务

不合理判定规则

Spark Stage中所有Task的运行时间最大值大于中位数的1.5倍。

优化方案

设置spark.sql.shuffle.partitions参数的取值为Spark Executor核数的2~3 倍，使数据再次分配到不同Task。
修改业务逻辑，过滤数据集中的异常数据。