This topic describes how to perform the following operations on datasets of query results in MaxCompute: INTERSECT, INTERSECT ALL, INTERSECT DISTINCT, UNION, UNION ALL, UNION DISTINCT, EXCEPT, EXCEPT ALL, EXCEPT DISTINCT, MINUS, MINUS ALL, and MINUS DISTINCT.

Description

MaxCompute supports the following operations on datasets:
  • INTERSECT: returns the intersection of two datasets. The intersection includes the values that are contained in both datasets.
  • UNION: returns the union of two datasets. The union is a dataset that is obtained by combining the two datasets.
  • EXCEPT and MINUS: returns distinct values from one of two datasets. These values are not contained in the other dataset.

Limits

The following limits are imposed on INTERSECT, UNION, EXCEPT, and MINUS:

  • MaxCompute allows you to perform INTERSECT, UNION, EXCEPT, or MINUS operations on a maximum of 256 datasets at the same time. If the number of datasets exceeds 256, an error is returned.
  • The number of columns in the left and right tables must be the same.

Usage notes

When you perform INTERSECT, UNION, EXCEPT, or MINUS operations on two datasets, take note of the following items:

  • The results of INTERSECT, UNION, EXCEPT, or MINUS operations may not be sorted in a specific order.
  • If data types of the datasets are not consistent, MaxCompute implicitly converts the data types before you perform INTERSECT, UNION, EXCEPT, or MINUS operations. For more information about implicit conversions, see Data types. To prevent compatibility issues, MaxCompute disables implicit conversions between data of the STRING type and data of other types for INTERSECT, UNION, EXCEPT, or MINUS operations.

INTERSECT

  • Syntax
    -- Obtain an intersection that contains duplicate values.
    <select_statement1> intersect all <select_statement2>;
    -- Obtain an intersection that does not contain duplicate values. The usage of INTERSECT is equivalent to that of INTERSECT DISTINCT.
    <select_statement1> intersect [distinct] <select_statement2>;
  • Parameters
    • select_statement1 and select_statement2: required. These parameters specify the SELECT clauses. For more information about the syntax of the clauses, see SELECT syntax.
    • distinct: optional. This parameter is used to remove duplicate values from the intersection of two datasets.
  • Examples
    • Example 1: Obtain the intersection of two datasets. The intersection contains duplicate values. Sample statement:
      select * from values (1, 2), (1, 2), (3, 4), (5, 6) t(a, b) 
      intersect all 
      select * from values (1, 2), (1, 2), (3, 4), (5, 7) t(a, b);
      The following result is returned:
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 2          |
      | 1          | 2          |
      | 3          | 4          |
      +------------+------------+
    • Example 2: Obtain the intersection of two datasets. The intersection does not contain duplicate values. Sample statements:
      select * from values (1, 2), (1, 2), (3, 4), (5, 6) t(a, b) 
      intersect distinct 
      select * from values (1, 2), (1, 2), (3, 4), (5, 7) t(a, b);
      -- The preceding statement is equivalent to the following statement:
      select distinct * from 
      (select * from values (1, 2), (1, 2), (3, 4), (5, 6) t(a, b) 
      intersect all 
      select * from values (1, 2), (1, 2), (3, 4), (5, 7) t(a, b)) t;
      The following result is returned:
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 2          |
      | 3          | 4          |
      +------------+------------+

UNION

  • Syntax
    -- Obtain a union that contains duplicate values.
    <select_statement1> union all <select_statement2>;
    -- Obtain a union that does not contain duplicate values.
    <select_statement1> union [distinct] <select_statement2>;
  • Usage notes
    • If multiple UNION ALL operations exist, use parentheses () to specify the priorities of the UNION ALL operations.
    • If UNION is followed by a CLUSTER BY, DISTRIBUTE BY, SORT BY, ORDER BY, or LIMIT clause and odps.sql.type.system.odps2 is set to false, the clause works only on the last select_statement of UNION. If odps.sql.type.system.odps2 is set to true, the clause works on the results of all UNION operations.
  • Parameters
    • select_statement1 and select_statement2: required. These parameters specify the SELECT clauses. For more information about the syntax of the clauses, see SELECT syntax.
    • distinct: optional. This parameter is used to remove duplicate values from the union of two datasets.
  • Examples
    • Example 1: Obtain the union of two datasets. The union contains duplicate values. Sample statement:
      select * from values (1, 2), (1, 2), (3, 4) t(a, b)
      union all
      select * from values (1, 2), (1, 4) t(a, b);
      The following result is returned:
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 2          |
      | 1          | 2          |
      | 3          | 4          |
      | 1          | 2          |
      | 1          | 4          |
      +------------+------------+
    • Example 2: Obtain the union of two datasets. The union does not contain duplicate values. Sample statements:
      select * from values (1, 2), (1, 2), (3, 4) t(a, b)
      union distinct 
      select * from values (1, 2), (1, 4) t(a, b);
      -- The preceding statement is equivalent to the following statement:
      select distinct * from (
      select * from values (1, 2), (1, 2), (3, 4) t(a, b) 
      union all 
      select * from values (1, 2), (1, 4) t(a, b));
      The following result is returned:
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 2          |
      | 1          | 4          |
      | 3          | 4          |
      +------------+------------+
    • Example 3: Use parentheses () to specify the priorities of UNION ALL operations. Sample statement:
      select * from values (1, 2), (1, 2), (5, 6) t(a, b)
      union all
      (select * from values (1, 2), (1, 2), (3, 4) t(a, b)
      union all
      select * from values (1, 2), (1, 4) t(a, b));
      The following result is returned:
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 2          |
      | 1          | 2          |
      | 5          | 6          |
      | 1          | 2          |
      | 1          | 2          |
      | 3          | 4          |
      | 1          | 2          |
      | 1          | 4          |
      +------------+------------+
    • Example 4: Use UNION that is followed by a CLUSTER BY, DISTRIBUTE BY, SORT BY, ORDER BY, or LIMIT clause in a SELECT statement and set odps.sql.type.system.odps2 to true. Sample statements:
      set odps.sql.type.system.odps2=true;
      select explode(array(3, 1)) as (a) union all select explode(array(0, 4, 2)) as (a) order by a limit 3;
      The following result is returned:
      +------------+
      | a          |
      +------------+
      | 0          |
      | 1          |
      | 2          |
      +------------+
    • Example 5: Use UNION that is followed by a CLUSTER BY, DISTRIBUTE BY, SORT BY, ORDER BY, or LIMIT clause in a SELECT statement and set odps.sql.type.system.odps2 to false. Sample statements:
      set odps.sql.type.system.odps2=false;
      select explode(array(3, 1)) as (a) union all select explode(array(0, 4, 2)) as (a) order by a limit 3;
      The following result is returned:
      +------------+
      | a          |
      +------------+
      | 3          |
      | 1          |
      | 0          |
      | 2          |
      | 4          |
      +------------+

EXCEPT and MINUS

  • Syntax
    -- Obtain the supplementary set of two datasets. The supplementary set contains duplicate values.
    <select_statement1> except all <select_statement2>;
    <select_statement1> minus all <select_statement2>;
    -- Obtain the supplementary set of two datasets. The supplementary set does not contain duplicate values.
    <select_statement1> except [distinct] <select_statement2>;
    <select_statement1> minus [distinct] <select_statement2>;
    Note The usage of EXCEPT is equivalent to that of MINUS.
  • Parameters
    • select_statement1 and select_statement2: required. These parameters specify the SELECT clauses. For more information about the syntax of the clauses, see SELECT syntax.
    • distinct: optional. This parameter is used to remove duplicate values from the supplementary set of two datasets.
  • Examples
    • Example 1: Obtain the supplementary set of two datasets. The supplementary set contains duplicate values. Sample statements:
      select * from values (1, 2), (1, 2), (3, 4), (3, 4), (5, 6), (7, 8) t(a, b)
      except all
      select * from values (3, 4), (5, 6), (5, 6), (9, 10) t(a, b);
      -- The preceding statement is equivalent to the following statement:
      select * from values (1, 2), (1, 2), (3, 4), (3, 4), (5, 6), (7, 8) t(a, b)
      minus all 
      select * from values (3, 4), (5, 6), (5, 6), (9, 10) t(a, b);
      The following result is returned:
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 2          |
      | 1          | 2          |
      | 3          | 4          |
      | 7          | 8          |
      +------------+------------+
    • Example 2: Obtain the supplementary set of two datasets. The supplementary set does not contain duplicate values. Sample statements:
      select * from values (1, 2), (1, 2), (3, 4), (3, 4), (5, 6), (7, 8) t(a, b)
      except distinct 
      select * from values (3, 4), (5, 6), (5, 6), (9, 10) t(a, b);
      -- The preceding statement is equivalent to the following statements:
      select * from values (1, 2), (1, 2), (3, 4), (3, 4), (5, 6), (7, 8) t(a, b)
      minus distinct 
      select * from values (3, 4), (5, 6), (5, 6), (9, 10) t(a, b);
      -- The preceding statements are equivalent to the following statement:
      select distinct * from values (1, 2), (1, 2), (3, 4), (3, 4), (5, 6), (7, 8) t(a, b) except all select * from values (3, 4), (5, 6), (5, 6), (9, 10) t(a, b);
      The following result is returned:
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 2          |
      | 7          | 8          |
      +------------+------------+