Subqueries can be used when you want to perform a further query based on the query results of a SELECT statement. This topic describes the definition and usage of subqueries that are supported in MaxCompute.

Description

Subqueries are nested inside a SELECT statement to perform complex data queries. MaxCompute supports the following types of subqueries:

Sample data

This topic provides source data and sample statements for generating source data. This helps you understand how to prepare source data. Sample statements:
-- Create a partitioned table named sale_detail.
create table if not exists sale_detail
(
shop_name     string,
customer_id   string,
total_price   double
)
partitioned by (sale_date string, region string);

-- Add partitions to the sale_detail table.
alter table sale_detail add partition (sale_date='2013', region='china') partition (sale_date='2014', region='shanghai');

-- Insert data into the sale_detail table.
insert into sale_detail partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
insert into sale_detail partition (sale_date='2014', region='shanghai') values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);

Basic subquery

The object of a common query is a destination table. You can also use a SELECT statement as a query object. In this case, the SELECT statement is considered a subquery. A subquery in a FROM clause can be used as a table. You can join the subquery with other tables or subqueries. For more information about JOIN operations, see JOIN.
  • Syntax
    • Syntax 1
      select <select_expr> from (<select_statement>) <sq_alias_name>;
    • Syntax 2
      select (<select_statement>) from <table_name>;
  • Parameters
    • select_expr: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates a common column or partition key column that you want to query, or a regular expression that is used for a query.
    • select_statement: required. This parameter specifies a subquery clause. If select_statement uses Syntax 2, the subquery result can have only one row. For more information about the syntax, see SELECT syntax.
    • sq_alias_name: required. This parameter specifies the alias of a subquery.
    • table_name: required. This parameter specifies the name of a destination table.
  • Examples
    • Example 1: Use Syntax 1. Sample statement:
      select * from (select shop_name from sale_detail) a;
      The following result is returned:
      +------------+
      | shop_name  |
      +------------+
      | s1         |
      | s2         |
      | s3         |
      | null       |
      | s6         |
      | s7         |
      +------------+
    • Example 2: Use Syntax 2. Sample statement:
      select (select * from sale_detail where shop_name='s1') from sale_detail;
      The following result is returned:
      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      +------------+-------------+-------------+------------+------------+
    • Example 3: Use Syntax 1. In this example, a subquery in a FROM clause is used as a table, and the subquery is joined with other tables or subqueries. Sample statements:
      -- Create a table and join the table with a subquery.
      create table shop as select shop_name,customer_id,total_price from sale_detail;
      select a.shop_name, a.customer_id, a.total_price from
      (select * from shop) a join sale_detail on a.shop_name = sale_detail.shop_name;
      The following result is returned:
      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | null       | c5          | NULL        |
      | s6         | c6          | 100.4       |
      | s7         | c7          | 100.5       |
      | s1         | c1          | 100.1       |
      | s2         | c2          | 100.2       |
      | s3         | c3          | 100.3       |
      +------------+-------------+-------------+

IN SUBQUERY

The usage of IN SUBQUERY is similar to that of SEMI JOIN.

  • Syntax
    • Syntax 1
      select <select_expr1> from <table_name1> where <select_expr2> in (select <select_expr2> from <table_name2>);
      -- The preceding statement is equivalent to the following statement with LEFT SEMI JOIN:
      select <select_expr1> from <table_name1> <alias_name1> left semi join <table_name2> <alias_name2> on <alias_name1>.<select_expr1> = <alias_name2>.<select_expr2>;
      Note If select_expr2 specifies partition key columns, select <select_expr2> from <table_name2> is not converted into SEMI JOIN. A separate job is started to run a subquery. MaxCompute compares the subquery results with the columns specified in select_expr2 in sequence. If the partitions of the table specified by table_name1 contain the columns in select_expr2 and these columns are not included in the results, MaxCompute does not read data from these partitions. This ensures that partition pruning is still valid.
    • Syntax 2
      MaxCompute supports IN SUBQUERY and correlated conditions. where <table_name2_colname> = <table_name1>.<colname> is a correlated condition. MaxCompute V1.0 does not support expressions that reference source tables from both subqueries and outer queries. MaxCompute V2.0 supports such expressions. These filter conditions are part of the ON condition in SEMI JOIN operations.
      select <select_expr1> from <table_name1> where <select_expr2> in (select <select_expr2> from <table_name2> where <table_name2_colname> = <table_name1>.<colname>);
      Note MaxCompute supports IN SUBQUERY that does not serve as a JOIN condition. For example, a non-WHERE clause uses IN SUBQUERY, or a WHERE clause uses IN SUBQUERY that cannot be converted into a JOIN condition. In this case, IN SUBQUERY cannot be converted into SEMI JOIN. A separate job must be started to run a subquery. Correlated conditions are not supported.
  • Parameters
    • select_expr1: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates a common column or partition key column that you want to query, or a regular expression that is used for a query.
    • table_name1 and table_name2: required. These parameters specify the names of tables.
    • select_expr2: required. This parameter specifies the names of the columns in the tables specified by table_name1 and table_name2. Columns in the two tables are mapped to each other.
    • col_name: required. This parameter specifies the name of a column in the table.
  • Examples
    • Example 1: Use Syntax 1. Sample statement:
      select * from sale_detail where shop_name in (select shop_name from shop);
      The following result is returned:
      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      | null       | c5          | NULL        | 2014       | shanghai   |
      | s6         | c6          | 100.4       | 2014       | shanghai   |
      | s7         | c7          | 100.5       | 2014       | shanghai   |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s2         | c2          | 100.2       | 2013       | china      |
      | s3         | c3          | 100.3       | 2013       | china      |
      +------------+-------------+-------------+------------+------------+
    • Example 2: Use Syntax 2. Sample statement:
      select * from shop where shop_name in (select shop_name from sale_detail where customer_id = shop.customer_id);
      The following result is returned:
      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | s3         | c3          | 100.3       |
      | s7         | c7          | 100.5       |
      | s2         | c2          | 100.2       |
      | null       | c5          | NULL        |
      | s6         | c6          | 100.4       |
      | s1         | c1          | 100.1       |
      +------------+-------------+-------------+
    • Example 3: IN SUBQUERY does not serve as a JOIN condition. Sample statement:
      select * from sale_detail where shop_name in (select shop_name from shop) and total_price > 100.3;

      IN SUBQUERY cannot be converted into SEMI JOIN. This is because the WHERE clause includes an AND operator. A separate job is started to run the subquery.

      The following result is returned:
      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      | s6         | c6          | 100.4       | 2014       | shanghai   |
      | s7         | c7          | 100.5       | 2014       | shanghai   |
      +------------+-------------+-------------+------------+------------+

NOT IN SUBQUERY

The usage of NOT IN SUBQUERY is similar to that of SEMI JOIN. However, if the values of a row are NULL for a specified column in a destination table, the value of the expression in NOT IN SUBQUERY is NULL. In this case, the WHERE condition is invalid, and no data is returned. This is different from LEFT ANTI JOIN.

  • Syntax
    • Syntax 1
      select <select_expr1> from <table_name1> where <select_expr2> not in (select <select_expr2> from <table_name2>);
      -- The preceding statement is equivalent to the following statement with LEFT SEMI JOIN.
      select <select_expr1> from <table_name1> <alias_name1> left anti join <table_name2> <alias_name2> on <alias_name1>.<select_expr1> = <alias_name2>.<select_expr2>;
      Note If select_expr2 specifies partition key columns, select <select_expr2> from <table_name2> is not converted into ANTI JOIN. A separate job is started to run a subquery. MaxCompute compares the subquery results with the columns specified in select_expr2 in sequence. If the partitions of the table specified by table_name1 contain the columns in select_expr2 and these columns are not included in the results, MaxCompute does not read data from these partitions. This ensures that partition pruning is still valid.
    • Syntax 2
      MaxCompute supports NOT IN SUBQUERY and correlated conditions. where <table_name2_colname> = <table_name1>.<colname> in a subquery is a correlated condition. MaxCompute V1.0 does not support expressions that reference source tables from both subqueries and outer queries. MaxCompute V2.0 supports such expressions. These filter conditions are part of the ON condition in ANTI JOIN operations.
      select <select_expr1> from <table_name1> where <select_expr2> in (select <select_expr2> from <table_name2> where <table_name2_colname> = <table_name1>.<colname>);
      Note MaxCompute supports NOT IN SUBQUERY that does not serve as a JOIN condition. For example, a non-WHERE clause uses NOT IN SUBQUERY, or a WHERE clause uses NOT IN SUBQUERY that cannot be converted into a JOIN condition. In this case, NOT IN SUBQUERY cannot be converted into ANTI JOIN. A separate job must be started to run a subquery. Correlated conditions are not supported.
  • Parameters
    • select_expr1: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates a common column or partition key column that you want to query, or a regular expression that is used for a query.
    • table_name1 and table_name2: required. These parameters specify the names of tables.
    • select_expr2: required. This parameter specifies the names of the columns in the tables specified by table_name1 and table_name2. Columns in the two tables are mapped to each other.
    • col_name: required. This parameter specifies the name of a column in the table.
  • Examples
    • Example 1: Use Syntax 1. Sample statements:
      -- Create the shop1 table and insert data into the table.
      create table shop1 as select shop_name,customer_id,total_price from sale_detail;
      insert into shop1 values ('s8','c1',100.1);
      
      select * from shop2 where shop_name not in (select shop_name from sale_detail);
      The following result is returned:
      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | s8         | c1          | 100.1       |
      +------------+-------------+-------------+
    • Example 2: Use Syntax 2. Sample statement:
      select * from shop1 where shop_name not in (select shop_name from sale_detail where customer_id = shop1.customer_id);
      The following result is returned:
      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | s8         | c1          | 100.1       |
      +------------+-------------+-------------+
    • Example 3: NOT IN SUBQUERY does not serve as a JOIN condition. Sample statement:
      select * from shop1 where shop_name not in (select shop_name from sale_detail) and total_price < 100.3;

      NOT IN SUBQUERY cannot be converted into ANTI JOIN. This is because the WHERE clause includes an AND operator. A separate job is started to run a subquery.

      The following result is returned:
      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | s8         | c1          | 100.1       |
      +------------+-------------+-------------+
    • Example 4: If the values of any column in the table from which you want to query data are NULL, no data is returned. Sample statements:
      -- Create the sale table and insert data into the table.
      create table if not exists sale
      (
      shop_name     string,
      customer_id   string,
      total_price   double
      )
      partitioned by (sale_date string, region string);
      alter table sale add partition (sale_date='2013', region='china');
      insert into sale partition (sale_date='2013', region='china') values ('null','null',null),('s2','c2',100.2),('s3','c3',100.3);
      
      select * from sale where shop_name not in (select shop_name from sale_detail);
      The following result is returned:
      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      +------------+-------------+-------------+------------+------------+

EXISTS SUBQUERY

When you use an EXISTS SUBQUERY clause, if the subquery returns at least one row of data, True is returned. If the subquery does not return data, False is returned.

MaxCompute supports only the WHERE subqueries that have correlated conditions. To use an EXISTS SUBQUERY clause, you must convert this clause into LEFT SEMI JOIN.

  • Syntax
    select <select_expr> from <table_name1> where exists (select <select_expr> from <table_name2> where <table_name2_colname> = <table_name1>.<colname>);
  • Parameters
    • select_expr: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates a common column or partition key column that you want to query, or a regular expression that is used for a query.
    • table_name1 and table_name2: required. These parameters specify the names of tables.
    • col_name: required. This parameter specifies the name of a column in the table.
  • Example
    select * from sale_detail where exists (select * from shop where customer_id = sale_detail.customer_id);
    -- The preceding statement is equivalent to the following statement:
    select * from sale_detail a left semi join shop b on a.customer_id = b.customer_id;
    The following result is returned:
    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | null       | c5          | NULL        | 2014       | shanghai   |
    | s6         | c6          | 100.4       | 2014       | shanghai   |
    | s7         | c7          | 100.5       | 2014       | shanghai   |
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    +------------+-------------+-------------+------------+------------+

NOT EXISTS SUBQUERY

When you use a NOT EXISTS SUBQUERY clause, if no data is returned, True is returned. Otherwise, False is returned.

MaxCompute supports only the WHERE subqueries that have correlated conditions. To use a NOT EXISTS SUBQUERY clause, you must convert this clause into LEFT ANTI JOIN.

  • Syntax
    select <select_expr> from <table_name1> where not exists (select <select_expr> from <table_name2> where <table_name2_colname> = <table_name1>.<colname>);
  • Parameters
    • select_expr: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates a common column or partition key column that you want to query, or a regular expression that is used for a query.
    • table_name1 and table_name2: required. These parameters specify the names of tables.
    • col_name: required. This parameter specifies the name of a column in the table.
  • Example
    select * from sale_detail where not exists (select * from shop where shop_name = sale_detail.shop_name);
    -- The preceding statement is equivalent to the following statement:
    select * from sale_detail a left semi join shop b on a.shop_name = b.shop_name;
    The following result is returned:
    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    +------------+-------------+-------------+------------+------------+

SCALAR SUBQUERY

If the output result of a SCALAR SUBQUERY clause contains only one row and one column of data, the result can be used as a scalar for data computations.

All SCALAR SUBQUERY clauses whose output result contains only one row and one column of data can be rewritten based on the following syntax. If the output result of SCALAR SUBQUERY contains only one row of data and one MAX or MIN operator is nested outside SCALAR SUBQUERY, the result does not change.

  • Syntax
    select <select_expr> from <table_name1> where (<select count(*) from <table_name2> where <table_name2_colname> = <table_name1>.<colname>) <Scalar operator> <scalar_value>;
    -- The preceding statement is equivalent to the following statement:
    select <table_name1>.<select_expr> from <table_name1> left semi join (select <colname>, count(*) from <table_name2> group by <colname> having count(*) <Scalar operator> <scalar_value>) <table_name2> on <table_name1>.<colname> = <table_name2>.<colname>;
    Note
    • The output result of select count(*) from <table_name2> where <table_name2_colname> = <table_name1>.<colname> is a row set. The output contains only one row and one column of data. In this case, the result can be used as a scalar. In practical application, SCALAR SUBQUERY is converted into JOIN as much as possible.
    • The output result of SCALAR SUBQUERY can be used as a scalar only if you can confirm in the compilation phase that SCALAR SUBQUERY returns only one row and one column of data. If you cannot make this confirmation until the running phase, the compiler reports an error. The compiler can compile the statements that meet the following requirements:
      • The SELECT list of SCALAR SUBQUERY uses aggregate functions that are not included in the parameters of a specified user-defined table-valued function (UDTF).
      • SCALAR SUBQUERY that uses aggregate functions does not include a GROUP BY clause.
  • Parameters
    • select_expr: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates a common column or partition key column that you want to query, or a regular expression that is used for a query.
    • table_name1 and table_name2: required. These parameters specify the names of tables.
    • col_name: required. This parameter specifies the name of a column in the table.
    • Scalar operator: required. The operators include greater than (>), less than (<), equal to (=), greater than or equal to (≥), and less than or equal to (≤).
    • scalar_value: required. This parameter specifies a scalar value.
  • Limits
    • SCALAR SUBQUERY can reference columns from outer queries. If SCALAR SUBQUERY uses multiple-level nesting, only the outermost column can be referenced.
      -- Sample statement that you can execute:
      select * from t1 where (select count(*) from t2 where t1.a = t2.a) = 3; 
      -- Sample statement that you cannot execute. This is because the SELECT list of SCALAR SUBQUERY cannot reference the columns from outer queries.
      select * from t1 where (select count(*) from t2 where (select count(*) from t3 where t3.a = t1.a) = 2) = 3; 
    • SCALAR SUBQUERY can be used only in a WHERE clause.
      -- SCALAR SUBQUERY cannot be referenced in a SELECT clause for a subquery.
      select * from t1 where (select t1.b + count(*) from t2) = 3; 
      -- SCALAR SUBQUERY cannot be referenced in a SELECT clause for an outer query.
      select (select count(*) from t2 where t2.a = t1.a) from t1; 
  • Example
    select * from shop where (select count(*) from sale_detail where sale_detail.shop_name = shop.shop_name) >= 1;
    The following result is returned:
    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | null       | c5          | NULL        | 2014       | shanghai   |
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    | s6         | c6          | 100.4       | 2014       | shanghai   |
    | s7         | c7          | 100.5       | 2014       | shanghai   |
    +------------+-------------+-------------+------------+------------+