The EXPLAIN statement displays the execution plan of a SELECT query in MaxCompute SQL, helping you identify performance bottlenecks in query statements or table structures.
A query maps to one or more jobs, and each job contains one or more tasks. The EXPLAIN statement reveals how these jobs, tasks, and operators relate to each other so you can optimize your SQL.
Syntax
EXPLAIN <query>;query: Required. A SELECT statement. For more information, see SELECT syntax.
Output structure
The EXPLAIN output contains three sections:
Job dependencies -- Lists all jobs and their execution order. For example,
job0 is root jobindicates that the query requires a single root job.Task dependencies -- Lists the tasks within each job and their dependencies. For example: This means
job0contains three tasks:M1,M2, andJ3_1_2_Stg1. MaxCompute runsJ3_1_2_Stg1only after bothM1andM2complete.In Job job0: root Tasks: M1, M2 J3_1_2_Stg1 depends on: M1, M2Operator details -- Describes the operators within each task and the execution semantics. For example: The
Data sourceline identifies the task input. Each subsequent line represents an operator and its parameters, indented to show the operator pipeline.In Task M2: Data source: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china TS: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china FIL: ISNOTNULL(customer_id) RS: order: + nullDirection: * optimizeOrderBy: False valueDestLimit: 0 dist: HASH keys: customer_id values: customer_id (string) total_price (double) partitions: customer_id
Task naming conventions
| Component | Meaning | Example |
|---|---|---|
| First letter | Task type: M (MapTask), R (ReduceTask), J (JoinTask), or L (LocalWork) | M2Stg1 is a map task |
| Digit after the first letter | Task ID, unique within the query | M2 has task ID 2 |
| Digits separated by underscores | Direct dependencies of the task | J3_1_2_Stg1 depends on M1 and M2 |
Operators
| Operator | Abbreviation | SQL clause | Description |
|---|---|---|---|
| TableScanOperator | TS | FROM | Scans the input table. The output shows the alias of the input table. |
| SelectOperator | SEL | SELECT | Projects columns to the next operator. A column displays as <alias>.<column_name>, an expression displays as a function list (for example, func1(arg1_1, arg1_2, func2(arg2_1, arg2_2))), and a constant displays its literal value. |
| FilterOperator | FIL | WHERE | Filters rows based on a WHERE expression, displayed in a format similar to SelectOperator. |
| JoinOperator | JOIN | JOIN | Joins tables. The output shows which tables are joined and the join method. |
| GroupByOperator | AGGREGATE | Aggregate functions | Performs aggregation. Appears when the query contains aggregate functions. The output shows the aggregate function content. |
| ReduceSinkOperator | RS | -- | Distributes data between tasks. Appears at the end of a task when its output feeds into another task. The output shows the sort order, distribution keys, values, and hash columns. |
| FileSinkOperator | FS | INSERT | Writes final results to storage. For INSERT statements, the output shows the target table name. |
| LimitOperator | LIM | LIMIT | Limits the number of returned rows. |
| MapjoinOperator | HASHJOIN | JOIN | Performs a map-side join on large tables. Similar to JoinOperator. |
Limitations
If the query statement is complex and the size of rows in the output results of the EXPLAIN statement exceed 4 MB, the threshold specified by the API of the upper-layer application is reached. As a result, the output results cannot be completely displayed. To address this issue, you can split the query statement into multiple subqueries and execute the EXPLAIN statement on each subquery to obtain the structure of the job.
Examples
Prepare sample data
Create two partitioned tables, sale_detail and sale_detail_jt, and insert sample data.
-- Create two partitioned tables named sale_detail and sale_detail_jt.
CREATE TABLE if NOT EXISTS sale_detail
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
CREATE TABLE if NOT EXISTS sale_detail_jt
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
-- Add partitions to the two tables.
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china') PARTITION (sale_date='2014', region='shanghai');
ALTER TABLE sale_detail_jt ADD PARTITION (sale_date='2013', region='china');
-- Insert data into the tables.
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
INSERT INTO sale_detail_jt PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2);Verify the data:
SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail;+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s3 | c3 | 100.3 | 2013 | china |
| null | c5 | NULL | 2014 | shanghai |
| s6 | c6 | 100.4 | 2014 | shanghai |
| s7 | c7 | 100.5 | 2014 | shanghai |
+------------+-------------+-------------+------------+------------+SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail_jt;+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s5 | c2 | 100.2 | 2013 | china |
+------------+-------------+-------------+------------+------------+Create a non-partitioned table for the JOIN operation:
SET odps.sql.allow.fullscan=true;
CREATE TABLE shop AS SELECT shop_name, customer_id, total_price FROM sale_detail;Example 1: Standard JOIN with aggregation
This example runs EXPLAIN on a query that performs an INNER JOIN between sale_detail_jt and sale_detail, groups the results, and applies ordering with a limit.
Query:
SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp
FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a
INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b
ON a.customer_id=b.customer_id
GROUP BY a.customer_id
ORDER BY a.customer_id
LIMIT 10;Run EXPLAIN:
EXPLAIN
SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp
FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a
INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b
ON a.customer_id=b.customer_id
GROUP BY a.customer_id
ORDER BY a.customer_id
LIMIT 10;Output:
job0 is root job
In Job job0:
root Tasks: M1
M2_1 depends on: M1
R3_2 depends on: M2_1
R4_3 depends on: R3_2
In Task M1:
Data source: doc_****.default.sale_detail/sale_date=2013/region=china
TS: doc_****.default.sale_detail/sale_date=2013/region=china
Statistics: Num rows: 3.0, Data size: 324.0
FIL: ISNOTNULL(customer_id)
Statistics: Num rows: 2.7, Data size: 291.6
RS: valueDestLimit: 0
dist: BROADCAST
keys:
values:
customer_id (string)
total_price (double)
partitions:
Statistics: Num rows: 2.7, Data size: 291.6
In Task M2_1:
Data source: doc_****.default.sale_detail_jt/sale_date=2013/region=china
TS: doc_****.default.sale_detail_jt/sale_date=2013/region=china
Statistics: Num rows: 3.0, Data size: 324.0
FIL: ISNOTNULL(customer_id)
Statistics: Num rows: 2.7, Data size: 291.6
HASHJOIN:
Filter1 INNERJOIN StreamLineRead1
keys:
0:customer_id
1:customer_id
non-equals:
0:
1:
bigTable: Filter1
Statistics: Num rows: 3.6450000000000005, Data size: 787.32
RS: order: +
nullDirection: *
optimizeOrderBy: False
valueDestLimit: 0
dist: HASH
keys:
customer_id
values:
customer_id (string)
total_price (double)
total_price (double)
partitions:
customer_id
Statistics: Num rows: 3.6450000000000005, Data size: 422.82000000000005
In Task R3_2:
AGGREGATE: group by:customer_id
UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete]
Statistics: Num rows: 1.0, Data size: 116.0
RS: order: +
nullDirection: *
optimizeOrderBy: True
valueDestLimit: 10
dist: HASH
keys:
customer_id
values:
customer_id (string)
__agg_0 (double)
__agg_1 (bigint)
partitions:
Statistics: Num rows: 1.0, Data size: 116.0
In Task R4_3:
SEL: customer_id,__agg_0,__agg_1
Statistics: Num rows: 1.0, Data size: 116.0
SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id
Statistics: Num rows: 1.0, Data size: 216.0
FS: output: Screen
schema:
ashop (string)
ap (double)
bp (bigint)
Statistics: Num rows: 1.0, Data size: 116.0
OKThe execution plan shows four tasks:
M1 -- Scans
sale_detail, filters nullcustomer_idvalues, and broadcasts the data (BROADCAST distribution).M2_1 -- Scans
sale_detail_jt, filters nullcustomer_idvalues, and performs a map-side hash join (HASHJOIN) with the broadcast data from M1. Results are distributed by HASH oncustomer_id.R3_2 -- Aggregates with
GROUP BY customer_id, computingSUM(total_price)andCOUNT(total_price). Both aggregations run in Complete mode.R4_3 -- Selects the final columns, aliases them (
ashop,ap,bp), and writes the output to screen.
Example 2: Map-side join (MAPJOIN hint)
This example uses a /*+ mapjoin(a) */ hint to force a map-side join with a non-equi join condition (a.total_price < b.total_price).
Query:
SELECT /*+ mapjoin(a) */
a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp
FROM (SELECT * FROM sale_detail_jt
WHERE sale_date='2013' AND region='china') a
INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b
ON a.total_price<b.total_price
GROUP BY a.customer_id
ORDER BY a.customer_id
LIMIT 10;Run EXPLAIN:
EXPLAIN
SELECT /*+ mapjoin(a) */
a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp
FROM (SELECT * FROM sale_detail_jt
WHERE sale_date='2013' AND region='china') a
INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b
ON a.total_price<b.total_price
GROUP BY a.customer_id
ORDER BY a.customer_id
LIMIT 10;Output:
job0 is root job
In Job job0:
root Tasks: M1
M2_1 depends on: M1
R3_2 depends on: M2_1
R4_3 depends on: R3_2
In Task M1:
Data source: doc_****.sale_detail_jt/sale_date=2013/region=china
TS: doc_****.sale_detail_jt/sale_date=2013/region=china
Statistics: Num rows: 3.0, Data size: 324.0
RS: valueDestLimit: 0
dist: BROADCAST
keys:
values:
customer_id (string)
total_price (double)
partitions:
Statistics: Num rows: 3.0, Data size: 324.0
In Task M2_1:
Data source: doc_****.sale_detail/sale_date=2013/region=china
TS: doc_****.sale_detail/sale_date=2013/region=china
Statistics: Num rows: 3.0, Data size: 24.0
HASHJOIN:
StreamLineRead1 INNERJOIN TableScan2
keys:
0:
1:
non-equals:
0:
1:
bigTable: TableScan2
Statistics: Num rows: 9.0, Data size: 1044.0
FIL: LT(total_price,total_price)
Statistics: Num rows: 6.75, Data size: 783.0
AGGREGATE: group by:customer_id
UDAF: SUM(total_price) (__agg_0_sum)[Partial_1],COUNT(total_price) (__agg_1_count)[Partial_1]
Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069
RS: order: +
nullDirection: *
optimizeOrderBy: False
valueDestLimit: 0
dist: HASH
keys:
customer_id
values:
customer_id (string)
__agg_0_sum (double)
__agg_1_count (bigint)
partitions:
customer_id
Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069
In Task R3_2:
AGGREGATE: group by:customer_id
UDAF: SUM(__agg_0_sum)[Final] __agg_0,COUNT(__agg_1_count)[Final] __agg_1
Statistics: Num rows: 1.6875, Data size: 195.75
RS: order: +
nullDirection: *
optimizeOrderBy: True
valueDestLimit: 10
dist: HASH
keys:
customer_id
values:
customer_id (string)
__agg_0 (double)
__agg_1 (bigint)
partitions:
Statistics: Num rows: 1.6875, Data size: 195.75
In Task R4_3:
SEL: customer_id,__agg_0,__agg_1
Statistics: Num rows: 1.6875, Data size: 195.75
SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id
Statistics: Num rows: 1.6875, Data size: 364.5
FS: output: Screen
schema:
ashop (string)
ap (double)
bp (bigint)
Statistics: Num rows: 1.6875, Data size: 195.75
OKKey differences from Example 1:
M1 broadcasts
sale_detail_jt(the table specified by the mapjoin hint) without applying a filter, since the non-equi join condition is evaluated after the join.M2_1 performs the HASHJOIN, then applies the filter
LT(total_price, total_price)(representinga.total_price < b.total_price), and runs a partial aggregation (Partial_1phase).R3_2 completes the aggregation in the
Finalphase, merging the partial results from M2_1.R4_3 selects and aliases the final output columns, same as Example 1.