This topic describes the definition and operation methods of the subqueries that are supported in MaxCompute.
Basic definition
SELECT
statement. In this case, this SELECT statement constituents a subquery. A subquery
must have an alias.
-- Normal query
select * from sale_detail;
-- Subquery
select * from (select shop_name from sale_detail) a;
select (select a from table1) from table2;
FROM
clause, a subquery can be used as a table. You can join
the subquery with other tables or subqueries.
create table shop as select * from sale_detail;
select a.shop_name, a.customer_id, a.total_price from
(select * from shop) a join sale_detail on a.shop_name = sale_detail.shop_name;
IN SUBQUERY/NOT IN SUBQUERY
- IN SUBQUERY
IN SUBQUERY
is used in a similar manner toLEFT SEMI JOIN
.select * from mytable1 where id in (select id from mytable2); -- Equivalent to the following statement: select * from mytable1 a left semi join mytable2 b on a.id = b.id;
MaxCompute supportsIN SUBQUERY
and CORRELATED conditions.where value = mytable1.value
in the following subquery is a CORRELATED condition. MaxCompute V1.0 does not support expressions that reference source tables from subqueries and outer queries. MaxCompute V2.0 supports such expressions. These filtering conditions are part of theON
condition inSEMI JOIN
operations.select * from mytable1 where id in (select id from mytable2 where value = mytable1.value);
- NOT IN SUBQUERY
NOT IN SUBQUERY
is used in a similar manner toLEFT ANTI JOIN
, but not the same. The following example shows how NOT IN SUBQUERY is equivalent to LEFT ANTI JOIN:
Ifselect * from mytable1 where id not in (select id from mytable2); --If none of the IDs in mytable2 are NULL, this statement is equivalent to the following statement: select * from mytable1 a left anti join mytable2 b on a.id = b.id;
mytable2
contains a column that has a NULL ID, theNOT IN
expression is NULL. In this case, theWHERE
condition is invalid, and no data is returned. This is different fromLEFT ANTI JOIN
.MaxCompute V1.0 supportsIN SUBQUERY
or NOT IN SUBQUERY that does not serve asJOIN
conditions. For example, a non-WHERE
clause uses IN SUBQUERY or NOT IN SUBQUERY or aWHERE
clause uses IN SUBQUERY or NOT IN SUBQUERY that cannot be converted to aJOIN
condition. MaxCompute V2.0 also supports this feature. However, In MaxCompute V2.0, IN SUBQUERY or NOT IN SUBQUERY cannot be converted toSEMI JOIN
, and a separate job must be started to run subqueries. Therefore, IN SUBQUERY or NOT IN SUBQUERY does not support CORRELATED conditions. The following statement provides an example:
IN SUBQUERY or NOT IN SUBQUERY cannot be converted toselect * from mytable1 where id in (select id from mytable2) or value > 0;
SEMI JOIN
because theWHERE
clause includes anOR
operator. A separate job is started to run the subquery.Partitioned tables also require special processing. The following statement provides an example:
IfSELECT * from sales_detail where ds in (select dt from sales_date);
ds
is a partition key column,select dt from sales_date
is not converted toSEMI JOIN
but starts a separate job. After the subquery is run, the results are compared withds
one after one. If ads
value insales_detail
is not in the returned results, the partition is not read. This ensures that partition pruning is still valid.
EXISTS SUBQUERY/NOT EXISTS SUBQUERY
For an EXISTS SUBQUERY
clause, if the subquery returns at least one data row, True is returned. If the subquery
returns no data rows, False is returned. For a NOT EXISTS SUBQUERY
clause, if the subquery returns no data rows, True is returned. If the subquery returns
at least one data row, False is returned.
MaxCompute supports only subqueries that contain correlated WHERE
conditions. EXISTS SUBQUERY
or NOT EXISTS SUBQUERY
is converted into LEFT SEMI JOIN
or LEFT ANTI JOIN
for implementation.
SELECT * from mytable1 where exists (select * from mytable2 where id = mytable1.id);
-- Equivalent to the following statement:
SELECT * from mytable1 a left semi join mytable2 b on a.id = b.id;
select * from mytable1 where not exists (select * from mytable2 where id = mytable1.id);
-- Equivalent to the following statement:
select * from mytable1 a left anti join mytable2 b on a.id = b.id;
SCALAR SUBQUERY
select * from t1 where (select count(*) from t2 where t1.a = t2.a) > 1;
-- Equivalent to the following statement:
select t1.* from t1 left semi join (select a, count(*) from t2 group by a having count(*) > 1) t2 on t1.a = t2.a;
The output of select count(*) from t2 where t1.a = t2.a;
is a row set. This indicates that the output of the statement has only one row and
one column. Therefore, the subquery can be used as a scalar for the scalar operation
(>). However, in implementation, the subquery is converted to a JOIN
query, such as the equivalent statement in the preceding example.
- The
SELECT
list of the subquery uses aggregate functions that are not included in the parameter list of table-valued functions. - The query layer that contains aggregate functions in the subquery does not have the
GROUP BY
clause.
- Scalar subqueries can reference columns from outer queries. If the scalar subquery
has multiple layers of nesting, only the outermost column can be referenced.
select * from t1 where (select count(*) from t2 where t1.a = t2.a) = 3; -- This operation is allowed. select * from t1 where (select count(*) from t2 where (select count(*) from t3 where t3.a = t1.a) = 2) = 3; -- This operation is not allowed. The column of an outer query cannot be referenced within a subquery of a subquery.
- Scalar subqueries can be used only in the
WHERE
clause.select * from t1 where (select t1.b + count(*) from t2) = 3; -- This cannot be referenced in the SELECT statement of a subquery. select (select count(*) from t2 where t2.a = t1.a) from t1; -- This cannot be referenced in the SELECT statement of an outer query.
You can rewrite all subqueries that return single-row, single-column outputs based
on the preceding examples. If the query result contains only one row, a MAX
or MIN
operation that is nested on the outer layer of the subquery does not change the result.