EXPLAIN - 云原生大数据计算服务 MaxCompute

在实际开发过程中，通常需要分析查询语句或表结构来分析性能瓶颈，MaxCompute SQL为您提供explain语句实现此功能。本文为您介绍explain的功能、命令格式及使用示例。

功能介绍

explain语句可以显示MaxCompute SQL对应的DML语句执行计划（执行SQL语义的程序）的结构，帮助您了解SQL语句的处理过程，为优化SQL语句提供帮助。一个查询语句作业会对应多个Job，一个Job对应多个Task。

说明如果查询语句足够复杂，explain的结果较多，则会触发API的限制，无法得到完整的EXPLAIN结果。此时您可以拆分查询语句，对各部分分别执行explain语句，以了解Job的结构。

命令格式

explain <dml query>;

dml query：必填。select语句，更多信息请参见SELECT语法。

返回说明

explain的执行结果包含如下信息：

Job间的依赖关系
例如job0 is root job。如果查询只需要一个Job（job0），只会显示一行信息。
Task间的依赖关系
```
In Job job0:
root Tasks: M1_Stg1, M2_Stg1
J3_1_2_Stg1 depends on: M1_Stg1, M2_Stg1
```
job0包含三个Task，M1_Stg1、M2_Stg1和J3_1_2_Stg1。系统会先执行M1_Stg1和M2_Stg1两个Task，执行完成后，再执行J3_1_2_Stg1。
Task的命名规则如下：
- 在MaxCompute中，共有四种Task类型：MapTask、ReduceTask、JoinTask和LocalWork。Task名称的第一个字母表示了当前Task的类型，例如M2Stg1就是一个MapTask。
- 紧跟着第一个字母后的数字，代表了当前Task的ID。这个ID在当前查询对应的所有Task中是唯一的。
- 用下划线（_）分隔的数字代表当前Task的直接依赖，例如J3_1_2_Stg1表示当前Task ID为3，依赖ID为1（M1_Stg1）和ID为2（M2_Stg1）的两个Task。

Task中所有Operator的依赖结构。

Operator串描述了一个Task的执行语义。结构示例如下：

In Task M2:
    Data source: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china  # "Data source"描述了当前Task的输入内容。
    TS: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china           # TableScanOperator
        FIL: ISNOTNULL(customer_id)                                   # FilterOperator
            RS: order: +                                              # ReduceSinkOperator
                nullDirection: *
                optimizeOrderBy: False
                valueDestLimit: 0
                dist: HASH
                keys:
                      customer_id
                values:
                      customer_id (string)
                      total_price (double)
                partitions:
                      customer_id


In Task J3_1_2:
    JOIN:                                                           # JoinOperator
         StreamLineRead1 INNERJOIN StreamLineRead2
         keys:
             0:customer_id
             1:customer_id

        AGGREGATE: group by:customer_id                            # GroupByOperator
         UDAF: SUM(total_price) (__agg_0_sum)[Complete],SUM(total_price) (__agg_1_sum)[Complete]
            RS: order: +
                nullDirection: *
                optimizeOrderBy: True
                valueDestLimit: 10
                dist: HASH
                keys:
                      customer_id
                values:
                      customer_id (string)
                      __agg_0 (double)
                      __agg_1 (double)
                partitions:


In Task R4_3:
    SEL: customer_id,__agg_0,__agg_1                               # SelectOperator
        LIM:limit 10                                               # LimitOperator
            FS: output: Screen                                     # FileSinkOperator
                schema:
                  customer_id (string) AS ashop
                  __agg_0 (double) AS ap
                  __agg_1 (double) AS bp

各Operator的含义如下：

TableScanOperator（TS）：描述查询语句中的from语句块的逻辑。explain结果中会显示输入表的名称（Alias）。
SelectOperator（SEL）：描述查询语句中的select语句块的逻辑。explain结果中会显示向下一个Operator传递的列，多个列由逗号分隔。
- 如果是列的引用，则显示为<alias>.<column_name>。
- 如果是表达式的结果，则显示为函数形式，例如func1(arg1_1, arg1_2, func2(arg2_1, arg2_2))。
- 如果是常量，则直接显示常量值。
FilterOperator（FIL）：描述查询语句中的where语句块的逻辑。explain结果中会显示一个where条件表达式，形式类似SelectOperator的显示规则。
JoinOperator（JOIN）：描述查询语句中的join语句块的逻辑。explain结果中会显示哪些表以哪种方式Join在一起。
GroupByOperator（例如AGGREGATE）：描述聚合操作的逻辑。如果查询中使用了聚合函数，就会出现该结构，explain结果中会显示聚合函数的内容。
ReduceSinkOperator（RS）：描述Task间数据分发操作的逻辑。如果当前Task的结果会传递给另一个Task，则必然需要在当前Task的最后，使用ReduceSinkOperator执行数据分发操作。explain的结果中会显示输出结果的排序方式、分发的Key、Value以及用来求Hash值的列。
FileSinkOperator（FS）：描述最终数据的存储操作。如果查询中有insert语句块，explain结果中会显示目标表名称。
LimitOperator（LIM）：描述查询语句中的limit语句块的逻辑。explain结果中会显示limit数。
MapjoinOperator（HASHJOIN）：类似JoinOperator，描述大表的join操作。

示例数据

为便于理解，本文为您提供源数据，基于源数据提供相关示例。创建表sale_detail和sale_detail_jt，并添加数据，命令示例如下：

--创建分区表sale_detail和sale_detail_jt。
create table if not exists sale_detail
(
shop_name     string,
customer_id   string,
total_price   double
)
partitioned by (sale_date string, region string);
create table if not exists sale_detail_jt
(
shop_name     string,
customer_id   string,
total_price   double
)
partitioned by (sale_date string, region string);

--向源表增加分区。
alter table sale_detail add partition (sale_date='2013', region='china') partition (sale_date='2014', region='shanghai');
alter table sale_detail_jt add partition (sale_date='2013', region='china');

--向源表追加数据。
insert into sale_detail partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
insert into sale_detail partition (sale_date='2014', region='shanghai') values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
insert into sale_detail_jt partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2);

查询表sale_detail和sale_detail_jt中的数据，命令示例如下：
select * from sale_detail;
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
| null       | c5          | NULL        | 2014       | shanghai   |
| s6         | c6          | 100.4       | 2014       | shanghai   |
| s7         | c7          | 100.5       | 2014       | shanghai   |
+------------+-------------+-------------+------------+------------+
select * from sale_detail_jt;
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s5         | c2          | 100.2       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

--创建做关联的表。
create table shop as select shop_name, customer_id, total_price from sale_detail;

使用示例

基于示例数据，执行命令如下。

--查询语句。
select a.customer_id as ashop, sum(a.total_price) as ap,count(b.total_price) as bp 
from (select * from sale_detail_jt where sale_date='2013' and region='china') a 
inner join (select * from sale_detail where sale_date='2013' and region='china') b 
on a.customer_id=b.customer_id 
group by a.customer_id 
order by a.customer_id 
limit 10;
--获取查询语句语义。
explain 
select a.customer_id as ashop, sum(a.total_price) as ap,count(b.total_price) as bp 
from (select * from sale_detail_jt where sale_date='2013' and region='china') a 
inner join (select * from sale_detail where sale_date='2013' and region='china') b 
on a.customer_id=b.customer_id 
group by a.customer_id 
order by a.customer_id 
limit 10;

返回结果如下。

job0 is root job

In Job job0:
root Tasks: M1

In Task M1_U0:
    TS: doc_test_dev.sale_detail_jt/sale_date=2013/region=china
        FIL: ISNOTNULL(customer_id)
            HASHJOIN:
                     Filter1 INNERJOIN Filter2
                     keys:
                         0:customer_id
                         1:customer_id
                     non-equals:
                         0:
                         1:
                     bigTable: Filter1

                LocalSortBy: order: +
                             nullDirection: *
                             keys:customer_id
                    AGGREGATE: group by:customer_id
                     UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete]
                        LIM:limit 10
                            FS: output: Screen
                                schema:
                                  customer_id (string) AS ashop
                                  __agg_0 (double) AS ap
                                  __agg_1 (bigint) AS bp


In Task M1_U1:
    TS: doc_test_dev.sale_detail/sale_date=2013/region=china
        FIL: ISNOTNULL(customer_id)
            HASHJOIN:
                     Filter1 INNERJOIN Filter2
                     keys:
                         0:customer_id
                         1:customer_id
                     non-equals:
                         0:
                         1:
                     bigTable: Filter1

                LocalSortBy: order: +
                             nullDirection: *
                             keys:customer_id
                    AGGREGATE: group by:customer_id
                     UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete]
                        LIM:limit 10
                            FS: output: Screen
                                schema:
                                  customer_id (string) AS ashop
                                  __agg_0 (double) AS ap
                                  __agg_1 (bigint) AS bp

基于示例数据，执行命令如下。

--查询语句。
select /*+ mapjoin(a) */
       a.customer_id as ashop, sum(a.total_price) as ap,count(b.total_price) as bp 
 from (select * from sale_detail_jt 
where sale_date='2013' and region='china') a 
inner join (select * from sale_detail where sale_date='2013' and region='china') b 
on a.total_price<b.total_price 
group by a.customer_id 
order by a.customer_id 
limit 10;
--获取查询语句语义。
explain 
select /*+ mapjoin(a) */
       a.customer_id as ashop, sum(a.total_price) as ap,count(b.total_price) as bp 
 from (select * from sale_detail_jt 
where sale_date='2013' and region='china') a 
inner join (select * from sale_detail where sale_date='2013' and region='china') b 
on a.total_price<b.total_price 
group by a.customer_id 
order by a.customer_id 
limit 10;

返回结果如下。

job0 is root job

In Job job0:
root Tasks: M1

In Task M1_U0:
    TS: doc_test_dev.sale_detail_jt/sale_date=2013/region=china
        HASHJOIN:
                 TableScan1 INNERJOIN TableScan2
                 keys:
                     0:
                     1:
                 non-equals:
                     0:
                     1:
                 bigTable: TableScan2

            FIL: LT(total_price,total_price)
                LocalSortBy: order: +
                             nullDirection: *
                             keys:customer_id
                    AGGREGATE: group by:customer_id
                     UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete]
                        LIM:limit 10
                            FS: output: Screen
                                schema:
                                  customer_id (string) AS ashop
                                  __agg_0 (double) AS ap
                                  __agg_1 (bigint) AS bp


In Task M1_U1:
    TS: doc_test_dev.sale_detail/sale_date=2013/region=china
        HASHJOIN:
                 TableScan1 INNERJOIN TableScan2
                 keys:
                     0:
                     1:
                 non-equals:
                     0:
                     1:
                 bigTable: TableScan2

            FIL: LT(total_price,total_price)
                LocalSortBy: order: +
                             nullDirection: *
                             keys:customer_id
                    AGGREGATE: group by:customer_id
                     UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete]
                        LIM:limit 10
                            FS: output: Screen
                                schema:
                                  customer_id (string) AS ashop
                                  __agg_0 (double) AS ap
                                  __agg_1 (bigint) AS bp