This topic describes the SELECT syntax in MaxCompute and the precautions for using SELECT statements to perform operations such as nested query, sorting, and group query.

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY order_condition]
[DISTRIBUTE BY distribute_condition [SORT BY sort_condition] ]
[LIMIT number]

Limits

  • A SELECT statement can return a maximum of 10,000 rows of results. No such limits are imposed when SELECT is used as a clause. If SELECT is used as a clause, the clause returns all results in response to the query from the upper layer.
  • You are not allowed to perform full table scans on partitioned tables by executing SELECT statements.

    By default, you are not allowed to perform full table scans on partitioned tables in your project by using SQL statements if your project is created after 20:00 January 10, 2018. To query data in partitioned tables, you must specify the partitions to scan. This reduces unnecessary I/O of SQL statements and saves computing resources. This can also reduce your computing costs if the pay-as-you-go billing method is used.

    To perform a full table scan on a partitioned table, insert the set odps.sql.allow.fullscan=true; statement before the SQL statement that is used for the full table scan. Then, commit and run the inserted statement along with the SQL statement. Assume that sale_detail is a partitioned table. To perform a full table scan on it, execute the following statements:
    set odps.sql.allow.fullscan=true;
    select * from sale_detail;

Column expression select_expr

To use a SELECT statement to read data from a table, you can specify the columns to read by using one of the following methods:
  • Specify the names of specific columns. For example, to read the shop_name column from the sale_detail table, execute the following statement:
    select shop_name from sale_detail;
  • Use an asterisk (*) to represent all columns. For example, to read all columns from the sale_detail table, execute the following statement:
    select * from sale_detail;
    You can use a WHERE clause to specify filter conditions.
    select * from sale_detail where shop_name like 'hang%';
  • Use a regular expression. Examples:
    • SELECT `abc. *` FROM t;: Select all columns whose names start with abc from the t table.
    • SELECT `(ds)? +. +` FROM t;: Select all columns whose names are not ds from the t table.
    • SELECT `(ds|pt)? +. +` FROM t;: Select all columns except the columns whose names are ds and pt from the t table.
    • SELECT `(d. *)? +. +` FROM t;: Select all columns except the columns whose names start with d from the t table.
    Note

    When you attempt to exclude multiple columns, if the name of col2 is the prefix of the name of col1, make sure that the name of col1 is written before that of col2. That is, the longer one is placed in the front. For example, if you want to exclude partitions ds and dshh from your query of a table, execute SELECT `(dshh|ds)? +. +` FROM tbl; instead of SELECT `(ds|dshh)? +. +` FROM tbl; .

  • Use DISTINCT before the name of a column to eliminate duplicate values from the column and return distinct values. If you use ALL before the name of a column, all values, including the duplicate values, in the columns are returned. If this option is not specified, ALL is assumed.
    Example:
    -- Query data from the region column in the sale_detail table and return distinct values.
    select distinct region from sale_detail;
    +------------+
    | region     |
    +------------+
    | shanghai   |
    +------------+
    -- If you specify multiple columns after the DISTINCT option in a SELECT statement, the DISTINCT option takes effect on all the specified columns instead of a single column.
    select distinct region, sale_date from sale_detail;
    +------------+------------+
    | region     | sale_date  |
    +------------+------------+
    | shanghai   | 20191110   |
    +------------+------------+

TABLE_REFERENCE

table_reference specifies the table that you want to query. You can set table_reference to the name of the table to query or the condition of a nested subquery. Example:
select * from (select region from sale_detail) t where region = 'shanghai';

WHERE

The following table describes the filter conditions that the WHERE clause supports.
Filter condition Description
>, <, =, >=, <=, and <> Relational operators.
like and rlike Make sure that the source and pattern parameters of like and rlike are of the STRING type.
in and not in If a subquery is added after the condition in or not in, the values in only one column can be returned for the subquery and the number of returned values cannot exceed 1,000.
between…and The condition that specifies the query range.
Examples
  • In the WHERE clause of a SELECT statement, you can specify a range of partitions to scan in a table. Example:
    SELECT sale_detail. * 
    FROM sale_detail
    WHERE sale_detail.sale_date >= '2008'
    AND sale_detail.sale_date <= '2014';
    Note You can check whether partition pruning takes effect by executing the EXPLAIN SELECT statement. A common user-defined function (UDF) or the partition condition settings of JOIN may cause the failure for partition pruning to take effect. For more information, see Check whether partition pruning is effective.
  • UDFs support partition pruning. UDFs are executed as small jobs and then replaced with the execution results. You can enable UDF-based partition pruning by using one of the following methods:
    • Add an annotation to the UDF class when you write a UDF.
      @com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)
      Note The UDF annotation com.aliyun.odps.udf.annotation.UdfProperty is defined in the odps-sdk-udf.jar file. You must update the version of the referenced odps-sdk-udf to V0.30.x or later.
    • Insert set odps.sql.udf.ppr.deterministic = true; before the SQL statements to execute. Then, all UDFs in the SQL statements are treated as deterministic UDFs. The preceding SET statement backfills partitions with execution results. A maximum of 1,000 partitions can be backfilled. If you add annotations to UDF classes, an error may be returned indicating that more than 1,000 backfilling results have been produced. To ignore this error, you can execute the set odps.sql.udf.ppr.to.subquery = false; statement. After you execute the set odps.sql.udf.ppr.to.subquery = false; statement, UDF-based partition pruning is disabled.
  • The following example shows how to use the between...and condition to filter data:
    SELECT sale_detail. * 
    FROM sale_detail 
    WHERE sale_detail.sale_date BETWEEN '2008' AND '2014';

GROUP BY

Generally, GROUP BY is used together with aggregate functions. If a SELECT statement contains aggregate functions, the following rules must be observed:
  • During the parsing of SQL statements, GROUP BY precedes SELECT. Therefore, the key of GROUP BY must be the names or expressions of columns in the input table of the SELECT statement. You are not allowed to specify the alias of an output column of the SELECT statement as the key of GROUP BY.
  • The key of GROUP BY may be both the name or expression of a column in the input table and the alias of an output column of the SELECT statement. In this case, the name or expression of the column in the input table is used as the key of GROUP BY.
  • If set hive.groupby.position.alias=true; is inserted before a SELECT statement, the constants of the INTEGER type in GROUP BY are treated as column numbers in the SELECT statement.
    set hive.groupby.position.alias=true;-- Execute it together with the subsequent SELECT statement.
    select region, sum(total_price) from sale_detail group by 1;-- 1 indicates region, which is the first column read by the SELECT statement. This statement groups the table data based on the values in the region column and returns the distinct region value and total sales of each group.
Examples
-- The statement uses the name of the region column in the input table as the key of GROUP BY and groups the table data based on the values in the region column.
select region from sale_detail group by region;
-- The statement groups the table data based on the values in the region column and returns the total sales of each group.
select sum(total_price) from sale_detail group by region;
-- The statement groups the table data based on the values in the region column and returns the distinct region value and total sales of each group.
select region, sum(total_price) from sale_detail group by region;
-- An error is returned because the alias of the column in the SELECT statement is used as the key of the GROUP BY clause.
select region as r from sale_detail group by r;
-- A complete expression of the column is required.
select 2 + total_price as r from sale_detail group by 2 + total_price;
-- An error is returned because a column that does not use aggregate functions in the SELECT statement is absent from the GROUP BY clause.
select region, total_price from sale_detail group by region;
-- The statement can be executed because all the columns that do not use aggregate functions exist in the GROUP BY clause.
select region, total_price from sale_detail group by region, total_price;     

ORDER BY/SORT BY/DISTRIBUTE BY

  • DISTRIBUTE BY

    DISTRIBUTE BY is used to shard data based on the hash values of specified columns. You must specify the alias of an output column of the SELECT statement as the key of DISTRIBUTE BY.

    Example:
    -- The statement queries values in the region column of the sale_detail table and shards data based on the hash values of the region column.
    select region from sale_detail distribute by region;
    -- The statement can be properly executed because the column name is an alias.
    select region as r from sale_detail distribute by region;
    The preceding statement is equivalent to the following statement:
    select region as r from sale_detail distribute by r;
  • ORDER BY
    ORDER BY is used to sort all data based on the specified columns.
    • To sort records in descending order, use the desc keyword. By default, records are sorted in ascending order.
    • In an ORDER BY operation, NULL is considered smaller than any other value. This rule is consistent with MySQL, but different from Oracle.
    • ORDER BY must be followed by the alias of a column in the SELECT statement. If a column is included in a SELECT statement but you do not specify an alias for the column, the name of the column is treated as its alias.
    • If set hive.orderby.position.alias=true; is inserted before a SELECT statement, the constants of the INTEGER type in ORDER BY are treated as column numbers in the SELECT statement.
      --Set the flag.
      set hive.orderby.position.alias=true;
      -- Create the src table.
      create table src(key BIGINT,value BIGINT);
      -- Query the src table and sort the first 100 records that are returned in ascending order by value.
      SELECT * FROM src ORDER BY 2 limit 100;
      The preceding statement is equivalent to the following statement:
      SELECT * FROM src ORDER BY value limit 100;
    • You can use OFFSET together with the ORDER BY LIMIT clause to skip a specific number of rows.
      -- Sort the rows of the src table in ascending order by key, and return the eleventh to thirtieth rows. OFFSET 10 indicates that the first 10 rows are skipped, and LIMIT 20 indicates that a maximum of 20 rows can be returned.
      SELECT * FROM src ORDER BY key LIMIT 20 OFFSET 10;
    Note
    • The key of ORDER BY, SORT BY, or DISTRIBUTE BY must be the alias of an output column of the SELECT statement. The column alias can be Chinese.
    • During the parsing of SQL statements in MaxCompute, SELECT precedes ORDER BY, SORT BY, and DISTRIBUTE BY. Therefore, only the names of output columns of SELECT statements can be used as keys.
    Example:
    -- Query the sale_detail table and sort the first 100 records that are returned in ascending order by region.
    select * from sale_detail order by region limit 100;
    -- An error is returned because ORDER BY is not used together with LIMIT.
    select * from sale_detail order by region;
    -- ORDER BY is followed by a column alias.
    select region as r from sale_detail order by region limit 100;
    select region as r from sale_detail order by r limit 100;

LIMIT NUMBER

The number in a LIMIT clause is a constant that limits the number of output rows. A SELECT statement without the LIMIT clause can return a maximum of 10,000 rows. The maximum number of rows that can be returned varies based on projects. You can specify the maximum number by using the setproject command.