Use EXPLAIN to analyze the execution plan of a SQL query - MaxCompute

The EXPLAIN statement displays the execution plan of a SELECT query in MaxCompute SQL, helping you identify performance bottlenecks in query statements or table structures.

A query maps to one or more jobs, and each job contains one or more tasks. The EXPLAIN statement reveals how these jobs, tasks, and operators relate to each other so you can optimize your SQL.

Syntax

EXPLAIN <query>;

query: Required. A SELECT statement. For more information, see SELECT syntax.

Output structure

The EXPLAIN output contains three sections:

Job dependencies -- Lists all jobs and their execution order. For example, job0 is root job indicates that the query requires a single root job.
Task dependencies -- Lists the tasks within each job and their dependencies. For example: This means job0 contains three tasks: M1, M2, and J3_1_2_Stg1. MaxCompute runs J3_1_2_Stg1 only after both M1 and M2 complete.
```
    In Job job0:
    root Tasks: M1, M2
    J3_1_2_Stg1 depends on: M1, M2
```

Operator details -- Describes the operators within each task and the execution semantics. For example: The Data source line identifies the task input. Each subsequent line represents an operator and its parameters, indented to show the operator pipeline.

    In Task M2:
        Data source: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china
        TS: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china
            FIL: ISNOTNULL(customer_id)
                RS: order: +
                    nullDirection: *
                    optimizeOrderBy: False
                    valueDestLimit: 0
                    dist: HASH
                    keys:
                          customer_id
                    values:
                          customer_id (string)
                          total_price (double)
                    partitions:
                          customer_id

Task naming conventions

Component	Meaning	Example
First letter	Task type: `M` (MapTask), `R` (ReduceTask), `J` (JoinTask), or `L` (LocalWork)	`M2Stg1` is a map task
Digit after the first letter	Task ID, unique within the query	`M2` has task ID 2
Digits separated by underscores	Direct dependencies of the task	`J3_1_2_Stg1` depends on M1 and M2

Operators

Operator	Abbreviation	SQL clause	Description
TableScanOperator	TS	`FROM`	Scans the input table. The output shows the alias of the input table.
SelectOperator	SEL	`SELECT`	Projects columns to the next operator. A column displays as `<alias>.<column_name>`, an expression displays as a function list (for example, `func1(arg1_1, arg1_2, func2(arg2_1, arg2_2))`), and a constant displays its literal value.
FilterOperator	FIL	`WHERE`	Filters rows based on a `WHERE` expression, displayed in a format similar to SelectOperator.
JoinOperator	JOIN	`JOIN`	Joins tables. The output shows which tables are joined and the join method.
GroupByOperator	AGGREGATE	Aggregate functions	Performs aggregation. Appears when the query contains aggregate functions. The output shows the aggregate function content.
ReduceSinkOperator	RS	--	Distributes data between tasks. Appears at the end of a task when its output feeds into another task. The output shows the sort order, distribution keys, values, and hash columns.
FileSinkOperator	FS	`INSERT`	Writes final results to storage. For `INSERT` statements, the output shows the target table name.
LimitOperator	LIM	`LIMIT`	Limits the number of returned rows.
MapjoinOperator	HASHJOIN	`JOIN`	Performs a map-side join on large tables. Similar to JoinOperator.

Limitations

If the query statement is complex and the size of rows in the output results of the EXPLAIN statement exceed 4 MB, the threshold specified by the API of the upper-layer application is reached. As a result, the output results cannot be completely displayed. To address this issue, you can split the query statement into multiple subqueries and execute the EXPLAIN statement on each subquery to obtain the structure of the job.

Examples

Prepare sample data

Create two partitioned tables, sale_detail and sale_detail_jt, and insert sample data.

-- Create two partitioned tables named sale_detail and sale_detail_jt.
CREATE TABLE if NOT EXISTS sale_detail
(
shop_name     STRING,
customer_id   STRING,
total_price   DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);

CREATE TABLE if NOT EXISTS sale_detail_jt
(
shop_name     STRING,
customer_id   STRING,
total_price   DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);

-- Add partitions to the two tables.
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china') PARTITION (sale_date='2014', region='shanghai');
ALTER TABLE sale_detail_jt ADD PARTITION (sale_date='2013', region='china');

-- Insert data into the tables.
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
INSERT INTO sale_detail_jt PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2);

Verify the data:

SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail;

+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
| null       | c5          | NULL        | 2014       | shanghai   |
| s6         | c6          | 100.4       | 2014       | shanghai   |
| s7         | c7          | 100.5       | 2014       | shanghai   |
+------------+-------------+-------------+------------+------------+

SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail_jt;

+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s5         | c2          | 100.2       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

Create a non-partitioned table for the JOIN operation:

SET odps.sql.allow.fullscan=true;
CREATE TABLE shop AS SELECT shop_name, customer_id, total_price FROM sale_detail;

Example 1: Standard JOIN with aggregation

This example runs EXPLAIN on a query that performs an INNER JOIN between sale_detail_jt and sale_detail, groups the results, and applies ordering with a limit.

Query:

SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp
FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a
INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b
ON a.customer_id=b.customer_id
GROUP BY a.customer_id
ORDER BY a.customer_id
LIMIT 10;

Run EXPLAIN:

EXPLAIN
SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp
FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a
INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b
ON a.customer_id=b.customer_id
GROUP BY a.customer_id
ORDER BY a.customer_id
LIMIT 10;

Output:

job0 is root job

In Job job0:
root Tasks: M1
M2_1 depends on: M1
R3_2 depends on: M2_1
R4_3 depends on: R3_2

In Task M1:
    Data source: doc_****.default.sale_detail/sale_date=2013/region=china
    TS: doc_****.default.sale_detail/sale_date=2013/region=china
        Statistics: Num rows: 3.0, Data size: 324.0
        FIL: ISNOTNULL(customer_id)
            Statistics: Num rows: 2.7, Data size: 291.6
            RS: valueDestLimit: 0
                dist: BROADCAST
                keys:
                values:
                      customer_id (string)
                      total_price (double)
                partitions:

                Statistics: Num rows: 2.7, Data size: 291.6

In Task M2_1:
    Data source: doc_****.default.sale_detail_jt/sale_date=2013/region=china
    TS: doc_****.default.sale_detail_jt/sale_date=2013/region=china
        Statistics: Num rows: 3.0, Data size: 324.0
        FIL: ISNOTNULL(customer_id)
            Statistics: Num rows: 2.7, Data size: 291.6
            HASHJOIN:
                     Filter1 INNERJOIN StreamLineRead1
                     keys:
                         0:customer_id
                         1:customer_id
                     non-equals:
                         0:
                         1:
                     bigTable: Filter1

                Statistics: Num rows: 3.6450000000000005, Data size: 787.32
                RS: order: +
                    nullDirection: *
                    optimizeOrderBy: False
                    valueDestLimit: 0
                    dist: HASH
                    keys:
                          customer_id
                    values:
                          customer_id (string)
                          total_price (double)
                          total_price (double)
                    partitions:
                          customer_id

                    Statistics: Num rows: 3.6450000000000005, Data size: 422.82000000000005

In Task R3_2:
    AGGREGATE: group by:customer_id
     UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete]
        Statistics: Num rows: 1.0, Data size: 116.0
        RS: order: +
            nullDirection: *
            optimizeOrderBy: True
            valueDestLimit: 10
            dist: HASH
            keys:
                  customer_id
            values:
                  customer_id (string)
                  __agg_0 (double)
                  __agg_1 (bigint)
            partitions:

            Statistics: Num rows: 1.0, Data size: 116.0

In Task R4_3:
    SEL: customer_id,__agg_0,__agg_1
        Statistics: Num rows: 1.0, Data size: 116.0
        SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id
            Statistics: Num rows: 1.0, Data size: 216.0
            FS: output: Screen
                schema:
                  ashop (string)
                  ap (double)
                  bp (bigint)

                Statistics: Num rows: 1.0, Data size: 116.0

OK

The execution plan shows four tasks:

M1 -- Scans sale_detail, filters null customer_id values, and broadcasts the data (BROADCAST distribution).
M2_1 -- Scans sale_detail_jt, filters null customer_id values, and performs a map-side hash join (HASHJOIN) with the broadcast data from M1. Results are distributed by HASH on customer_id.
R3_2 -- Aggregates with GROUP BY customer_id, computing SUM(total_price) and COUNT(total_price). Both aggregations run in Complete mode.
R4_3 -- Selects the final columns, aliases them (ashop, ap, bp), and writes the output to screen.

Example 2: Map-side join (MAPJOIN hint)

This example uses a /*+ mapjoin(a) */ hint to force a map-side join with a non-equi join condition (a.total_price < b.total_price).

Query:

SELECT /*+ mapjoin(a) */
       a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp
FROM (SELECT * FROM sale_detail_jt
WHERE sale_date='2013' AND region='china') a
INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b
ON a.total_price<b.total_price
GROUP BY a.customer_id
ORDER BY a.customer_id
LIMIT 10;

Run EXPLAIN:

EXPLAIN
SELECT /*+ mapjoin(a) */
       a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp
FROM (SELECT * FROM sale_detail_jt
WHERE sale_date='2013' AND region='china') a
INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b
ON a.total_price<b.total_price
GROUP BY a.customer_id
ORDER BY a.customer_id
LIMIT 10;

Output:

job0 is root job

In Job job0:
root Tasks: M1
M2_1 depends on: M1
R3_2 depends on: M2_1
R4_3 depends on: R3_2

In Task M1:
    Data source: doc_****.sale_detail_jt/sale_date=2013/region=china
    TS: doc_****.sale_detail_jt/sale_date=2013/region=china
        Statistics: Num rows: 3.0, Data size: 324.0
        RS: valueDestLimit: 0
            dist: BROADCAST
            keys:
            values:
                  customer_id (string)
                  total_price (double)
            partitions:

            Statistics: Num rows: 3.0, Data size: 324.0

In Task M2_1:
    Data source: doc_****.sale_detail/sale_date=2013/region=china
    TS: doc_****.sale_detail/sale_date=2013/region=china
        Statistics: Num rows: 3.0, Data size: 24.0
        HASHJOIN:
                 StreamLineRead1 INNERJOIN TableScan2
                 keys:
                     0:
                     1:
                 non-equals:
                     0:
                     1:
                 bigTable: TableScan2

            Statistics: Num rows: 9.0, Data size: 1044.0
            FIL: LT(total_price,total_price)
                Statistics: Num rows: 6.75, Data size: 783.0
                AGGREGATE: group by:customer_id
                 UDAF: SUM(total_price) (__agg_0_sum)[Partial_1],COUNT(total_price) (__agg_1_count)[Partial_1]
                    Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069
                    RS: order: +
                        nullDirection: *
                        optimizeOrderBy: False
                        valueDestLimit: 0
                        dist: HASH
                        keys:
                              customer_id
                        values:
                              customer_id (string)
                              __agg_0_sum (double)
                              __agg_1_count (bigint)
                        partitions:
                              customer_id

                        Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069

In Task R3_2:
    AGGREGATE: group by:customer_id
     UDAF: SUM(__agg_0_sum)[Final] __agg_0,COUNT(__agg_1_count)[Final] __agg_1
        Statistics: Num rows: 1.6875, Data size: 195.75
        RS: order: +
            nullDirection: *
            optimizeOrderBy: True
            valueDestLimit: 10
            dist: HASH
            keys:
                  customer_id
            values:
                  customer_id (string)
                  __agg_0 (double)
                  __agg_1 (bigint)
            partitions:

            Statistics: Num rows: 1.6875, Data size: 195.75

In Task R4_3:
    SEL: customer_id,__agg_0,__agg_1
        Statistics: Num rows: 1.6875, Data size: 195.75
        SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id
            Statistics: Num rows: 1.6875, Data size: 364.5
            FS: output: Screen
                schema:
                  ashop (string)
                  ap (double)
                  bp (bigint)

                Statistics: Num rows: 1.6875, Data size: 195.75

OK

Key differences from Example 1:

M1 broadcasts sale_detail_jt (the table specified by the mapjoin hint) without applying a filter, since the non-equi join condition is evaluated after the join.
M2_1 performs the HASHJOIN, then applies the filter LT(total_price, total_price) (representing a.total_price < b.total_price), and runs a partial aggregation (Partial_1 phase).
R3_2 completes the aggregation in the Final phase, merging the partial results from M2_1.
R4_3 selects and aliases the final output columns, same as Example 1.