SQL Job Intelligent Diagnostics Overview - MaxCompute

When a MaxCompute SQL job fails or runs slower than expected, diagnosing the root cause can be time-consuming. Intelligent diagnostics automates this analysis — it evaluates completed SQL jobs against a set of diagnostic rules, surfaces performance issues and error details, and provides targeted recommendations so you can fix problems faster.

Intelligent diagnostics supports SQL jobs only.

How it works

After a SQL job completes, MaxCompute evaluates it against diagnostic rules. When an anomaly is detected, a colored tag appears in the Intelligent diagnostics column on the Jobs page:

Red tag — job error message diagnostics
Orange tag — job performance diagnostics

Click the tag to open the Job Insights page. On the Job Summary tab, you'll find the diagnostic findings and recommended actions for that job.

Diagnostic types

The following table summarizes the available diagnostic types. Click a type to jump to the detailed description.

Diagnostic type	Tag color	What it means
Insufficient resources	Orange	The job received fewer computing resources than requested
Data skew	Orange	Workers are processing data unevenly, slowing down the job
Data inflation	Orange	Output records far exceed input records in a Fuxi Task
Mode fallback	Orange	A query acceleration job fell back to normal mode
MAPJOIN small table nearing memory limit	Orange	The MAPJOIN small table is approaching the 512 MB limit
Job error message diagnostics	Red	A SQL error was detected and matched to a known error type

Limitations

Intelligent diagnostics supports SQL jobs only.
Diagnostic results are generated the day after the job completes, not immediately.
Jobs executed before November 1, 2023 have no diagnostic results.
The following regions do not support automatic diagnostics. To run diagnostics manually, click Insights in the Actions column for the target job. China (Hong Kong), China East 2 Finance, China North 2 Finance (Preview), China North 2 Ali Gov 1, China South 1 Finance, Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia), UAE (Dubai), SAU (Riyadh - Partner Region)

View diagnostic results

Log on to the MaxCompute console and select your region from the upper-left corner.
In the left-side navigation pane, choose Workspace > Jobs.
The default time range for querying jobs is one hour. Adjust it as needed to find the target job.
In the Intelligent diagnostics column, click the tag for the job you want to investigate. The Job Insights page opens.
On the Job Summary tab, review the diagnostic findings and recommendations.

If no tag appears, see Why is the diagnostics column empty?

To trigger diagnostics manually, click Insights in the Actions column for the target job.

Insufficient resources

What happened: The job received less than 95% of its requested computing resources for more than five minutes continuously.

Why it happens:

Pay-as-you-go jobs use a shared resource pool. Resources are not reserved, so jobs compete with each other. High concurrency can cause one job to be preempted by others.
Subscription jobs may face queuing when data volumes are large, resource demands are high, and the job has a lower priority.

What to do: On the Job Insights page for the affected job, click the Resource Consumption tab and view the resource consumption and computing quota allocation at the time of the job run. Based on what you find, adjust job priority, optimize task execution, or manage computing resources to match your workload.

Data skew

What happened: One or more workers are processing significantly more data or taking significantly longer than others, stalling overall job progress.

Why it happens: Data skew results from an uneven distribution of data across workers. In distributed computing, some workers finish quickly while others are still running — often causing job progress to appear stuck at 99%.

MaxCompute flags a job as skewed when either of the following is true:

The longest-running worker takes at least 3x the average worker time, and the average time exceeds 30 seconds.
Any worker's input record count is at least 3x the average across all workers.

What to do: MaxCompute identifies the node name of the affected workers. Use LogView to inspect those workers and pinpoint the cause. For common skew patterns and solutions, see Data skew tuning.

Data inflation

What happened: A Fuxi Task's output record count exceeds 10x its input record count.

What to do: MaxCompute identifies the name of the Fuxi Task with data inflation. Use LogView to inspect that task and trace the source of the expansion. For common causes and fixes, see Handle data expansion.

Mode fallback

What happened: A job that was expected to run in MaxCompute Query Acceleration mode fell back to normal mode, increasing its runtime.

Why it happens: MaxCompute does not guarantee that every job enters query acceleration mode. Jobs with large data volumes that don't return query results can only run in normal mode. When a query acceleration job falls back, it's identified by the Task Rerun sub-status.

What to do: To eliminate the uncertainty of mode selection, add the following line at the start of your job script to force normal mode:

set odps.service.mode=off;

This prevents unexpected fallbacks and makes job runtime more predictable.

MAPJOIN small table nearing memory limit

What happened: The small table used in a MAPJOIN operation is approaching the 512 MB memory limit.

Why it happens: When you include the MAPJOIN hint in a SELECT statement, MaxCompute loads the entire small table into memory during the Map stage. If the table's loaded size exceeds 512 MB, the job fails.

What to do: Choose one of the following options:

Remove the MAPJOIN hint and let MaxCompute choose the join strategy automatically.
Switch to DISTRIBUTED MAPJOIN, which distributes the small table across multiple nodes and is not subject to the 512 MB limit.

Job error message diagnostics

What happened: The job failed, and MaxCompute matched the error message to a known SQL error type.

What to do: The Job Summary tab on the Job Insights page shows the error description and recommended solution for the matched error type. If the job has no diagnostic result, look up the error code in the Error code overview.

Only some SQL error types are covered. Failed jobs without a matching diagnostic result will not show a tag.

If you need further help, join the MaxCompute developer community on DingTalk (group number: 11782920) or contact your dedicated support channel.

FAQ

Why is the diagnostics column empty?

An empty Intelligent diagnostics column is normal in several situations:

The job ran successfully with no anomalies detected.
The diagnostic result hasn't been generated yet — results are available the day after the job completes.
The job was executed before November 1, 2023.
The job ran in a region that doesn't support automatic diagnostics (see Limitations).

To check for diagnostics immediately, click Insights in the Actions column for the job.

What's next

Diagnostic cases of Logview — walk through real diagnostic examples using LogView
Optimize SQL statements — general SQL optimization techniques for MaxCompute
Best practices for job-level resource analysis — use the Job Insights feature to analyze resource consumption and reduce runtime
Optimize the calculation for long-period metrics