This topic describes the definition and operation methods of the subqueries that are supported in MaxCompute.

Basic definition

A normal query reads data from tables. However, a query object can also be a SELECT statement. In this case, this SELECT statement constituents a subquery. A subquery must have an alias.
-- Normal query
select * from sale_detail;
-- Subquery
select * from (select shop_name from sale_detail) a;
The following statement has the equivalent effect, but it can return only one record.
select (select a from table1) from table2;
In a FROM clause, a subquery can be used as a table. You can join the subquery with other tables or subqueries.
create table shop as select * from sale_detail;
select a.shop_name, a.customer_id, a.total_price from
(select * from shop) a join sale_detail on a.shop_name = sale_detail.shop_name;

IN SUBQUERY/NOT IN SUBQUERY

  • IN SUBQUERY
    IN SUBQUERY is used in a similar manner to LEFT SEMI JOIN.
    select * from mytable1 where id in (select id from mytable2);
    -- Equivalent to the following statement:
    select * from mytable1 a left semi join mytable2 b on a.id = b.id;
    MaxCompute supports IN SUBQUERY and CORRELATED conditions. where value = mytable1.value in the following subquery is a CORRELATED condition. MaxCompute V1.0 does not support expressions that reference source tables from subqueries and outer queries. MaxCompute V2.0 supports such expressions. These filtering conditions are part of the ON condition in SEMI JOIN operations.
    select * from mytable1 where id in (select id from mytable2 where value = mytable1.value);
  • NOT IN SUBQUERY
    NOT IN SUBQUERY is used in a similar manner to LEFT ANTI JOIN, but not the same. The following example shows how NOT IN SUBQUERY is equivalent to LEFT ANTI JOIN:
    select * from mytable1 where id not in (select id from mytable2);
    --If none of the IDs in mytable2 are NULL, this statement is equivalent to the following statement:
    select * from mytable1 a left anti join mytable2 b on a.id = b.id;
    If mytable2 contains a column that has a NULL ID, the NOT IN expression is NULL. In this case, the WHERE condition is invalid, and no data is returned. This is different from LEFT ANTI JOIN.
    MaxCompute V1.0 supports IN SUBQUERY or NOT IN SUBQUERY that does not serve as JOIN conditions. For example, a non-WHERE clause uses IN SUBQUERY or NOT IN SUBQUERY or a WHERE clause uses IN SUBQUERY or NOT IN SUBQUERY that cannot be converted to a JOIN condition. MaxCompute V2.0 also supports this feature. However, In MaxCompute V2.0, IN SUBQUERY or NOT IN SUBQUERY cannot be converted to SEMI JOIN, and a separate job must be started to run subqueries. Therefore, IN SUBQUERY or NOT IN SUBQUERY does not support CORRELATED conditions. The following statement provides an example:
    select * from mytable1 where id in (select id from mytable2) or value > 0;
    IN SUBQUERY or NOT IN SUBQUERY cannot be converted to SEMI JOIN because the WHERE clause includes an OR operator. A separate job is started to run the subquery.
    Partitioned tables also require special processing. The following statement provides an example:
    SELECT * from sales_detail where ds in (select dt from sales_date);
    If ds is a partition key column, select dt from sales_date is not converted to SEMI JOIN but starts a separate job. After the subquery is run, the results are compared with ds one after one. If a ds value in sales_detail is not in the returned results, the partition is not read. This ensures that partition pruning is still valid.

EXISTS SUBQUERY/NOT EXISTS SUBQUERY

For an EXISTS SUBQUERY clause, if the subquery returns at least one data row, True is returned. If the subquery returns no data rows, False is returned. For a NOT EXISTS SUBQUERY clause, if the subquery returns no data rows, True is returned. If the subquery returns at least one data row, False is returned.

MaxCompute supports only subqueries that contain correlated WHERE conditions. EXISTS SUBQUERY or NOT EXISTS SUBQUERY is converted into LEFT SEMI JOIN or LEFT ANTI JOIN for implementation.

Example 1
SELECT * from mytable1 where exists (select * from mytable2 where id = mytable1.id);
-- Equivalent to the following statement:
SELECT * from mytable1 a left semi join mytable2 b on a.id = b.id;
Example 2
select * from mytable1 where not exists (select * from mytable2 where id = mytable1.id);
-- Equivalent to the following statement:
select * from mytable1 a left anti join mytable2 b on a.id = b.id;

SCALAR SUBQUERY

If the output of a subquery is a single row in a single column, the subquery can be used as a scalar.
select * from t1 where (select count(*) from t2 where t1.a = t2.a) > 1; 
-- Equivalent to the following statement:
select t1.* from t1 left semi join (select a, count(*) from t2 group by a having count(*) > 1) t2 on t1.a = t2.a;
The output of select count(*) from t2 where t1.a = t2.a; is a row set. This indicates that the output of the statement has only one row and one column. Therefore, the subquery can be used as a scalar for the scalar operation (>). However, in implementation, the subquery is converted to a JOIN query, such as the equivalent statement in the preceding example.
Note Subqueries can be used as scalars only if they can be confirmed to return single-row, single-column outputs during the compilation phase. If you use a scalar subquery that can only be confirmed to produce single-row data during the running phase, the compiler reports an error.
The compiler can accept statements that meet the following requirements:
  • The SELECT list of the subquery uses aggregate functions that are not included in the parameter list of table-valued functions.
  • The query layer that contains aggregate functions in the subquery does not have the GROUP BY clause.
When you use scalar subqueries, note the following limits:
  • Scalar subqueries can reference columns from outer queries. If the scalar subquery has multiple layers of nesting, only the outermost column can be referenced.
    select * from t1 where (select count(*) from t2 where t1.a = t2.a) = 3; -- This operation is allowed.
    select * from t1 where (select count(*) from t2 where (select count(*) from t3 where t3.a = t1.a) = 2) = 3; -- This operation is not allowed. The column of an outer query cannot be referenced within a subquery of a subquery.
  • Scalar subqueries can be used only in the WHERE clause.
    select * from t1 where (select t1.b + count(*) from t2) = 3; -- This cannot be referenced in the SELECT statement of a subquery.
    select (select count(*) from t2 where t2.a = t1.a) from t1; -- This cannot be referenced in the SELECT statement of an outer query.

You can rewrite all subqueries that return single-row, single-column outputs based on the preceding examples. If the query result contains only one row, a MAX or MIN operation that is nested on the outer layer of the subquery does not change the result.