This topic describes the SELECT syntax in MaxCompute and the precautions when you execute SELECT statements to perform operations such as nested queries, sorting, and queries by group.

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY order_condition]
[DISTRIBUTE BY distribute_condition [SORT BY sort_condition] ]
[LIMIT number]

Limits

  • A SELECT statement can return a maximum of 10,000 rows of results. This limit does not apply to SELECT clauses. SELECT clauses return all results in response to the query from the upper layer.
  • You are not allowed to perform full scans on partitioned tables by using SELECT statements.

    If your project was created after 20:00 on January 10, 2018, you are not allowed to perform full scans on partitioned tables in your project by using SQL statements, unless otherwise specified. To query data in partitioned tables, you must specify the partitions that you want to scan. This reduces unnecessary I/O and saves computing resources. This also reduces your computing costs if you use the pay-as-you-go billing method.

    To perform a full scan on a partitioned table, add the set odps.sql.allow.fullscan=true; command before the SQL statement that is used for the full table scan. Then, commit and run the added command with the SQL statement. Assume that the sale_detail table is a partitioned table. To perform a full scan on this table, execute the following statements:
    set odps.sql.allow.fullscan=true;
    SELECT * FROM sale_detail;

select_expr

Each select_expr expression indicates a column from which you want to read data. You can use the following select_expr expressions to read data from a table.
  • Specify the names of the columns from which you want to read data. For example, to read the shop_name column from the sale_detail table, execute the following statement:
    SELECT shop_name FROM sale_detail;
  • Use an asterisk (*) to represent all columns. For example, to read all columns from the sale_detail table, execute the following statement:
    SELECT * FROM sale_detail;
    You can also use a WHERE clause to specify filter conditions.
    SELECT * FROM sale_detail WHERE shop_name LIKE 'hang%';
  • Use a regular expression. Examples:
    • SELECT `abc. *` FROM t;: Select all columns whose names start with abc from the t table.
    • SELECT `(ds)? +. +` FROM t;: Select all columns whose names are not ds from the t table.
    • SELECT `(ds|pt)? +. +` FROM t;: Select all columns except the columns whose names are ds and pt from the t table.
    • SELECT `(d. *)? +. +` FROM t;: Select all columns except the columns whose names start with d from the t table.
    Note

    If the name of col2 is the prefix of the name of col1 and you want to exclude multiple columns, make sure that the name of col1 is placed before that of col2. The longer column name is placed before the shorter column name. For example, if you want to exclude partitions ds and dshh from your query of a table, execute SELECT `(dshh|ds)? +. +` FROM tbl; instead of SELECT `(ds|dshh)? +. +` FROM tbl;.

  • Use DISTINCT before the name of a column to filter out duplicate values from the column and return only distinct values. If you use ALL before the name of a column, all values, including the duplicate values, in the column are returned. If DISTINCT is not used, ALL is used by default.
    Examples:
    -- Query data from the region column in the sale_detail table and return only distinct values.
    SELECT DISTINCT region FROM sale_detail;
    +------------+
    | region     |
    +------------+
    | shanghai   |
    +------------+
    -- If you use DISTINCT in a SELECT statement where multiple columns are specified, DISTINCT applies to all the columns you specified, instead of a single column.
    SELECT DISTINCT region, sale_date FROM sale_detail;
    +------------+------------+
    | region     | sale_date  |
    +------------+------------+
    | shanghai   | 20191110   |
    +------------+------------+

table_reference

table_reference specifies the table that you want to query. You can also use table_reference to specify a nested subquery. Example:
SELECT * FROM (SELECT region FROM sale_detail) t WHERE region = 'shanghai';

where_condition

The following table describes the operators that are used as filter conditions supported by where_condition.
Operator Description
>, <, =, >=, <=, and <> The relational operators.
LIKE and RLIKE The source and pattern parameters of LIKE and RLIKE must be of the STRING type.
IN and NOT IN If a subquery is added after the IN or NOT IN operator, the values in only one column can be returned for the subquery and the number of return values cannot exceed 1,000.
BETWEEN…AND The operator that specifies the query range.
Examples
  • In where_condition of a SELECT statement, you can specify the partitions that you want to scan in a table. Example:
    SELECT sale_detail. * 
    FROM sale_detail
    WHERE sale_detail.sale_date >= '2008'
    AND sale_detail.sale_date <= '2014';
    Note You can execute the EXPLAIN SELECT statement to check whether partition pruning takes effect. A common user-defined function (UDF) or the partition condition settings of JOIN may cause partition pruning to fail. For more information, see Check whether partition pruning is effective.
  • UDFs support partition pruning. UDFs are executed as small jobs and then replaced with the execution results. You can enable UDF-based partition pruning by using one of the following methods:
    • Add an annotation to the UDF class when you write a UDF.
      @com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)
      Note The UDF annotation com.aliyun.odps.udf.annotation.UdfProperty is defined in the odps-sdk-udf.jar file. You must update the version of the referenced odps-sdk-udf to 0.30.X or later.
    • Add set odps.sql.udf.ppr.deterministic = true; before the SQL statements to execute. This way, all UDFs in the SQL statements are considered deterministic UDFs. The preceding SET statement backfills partitions with execution results. A maximum of 1,000 partitions can be backfilled with execution results. If you add an annotation to a UDF class, an error may be returned. This indicates that more than 1,000 partitions are backfilled with execution results. To ignore this error, you can run the set odps.sql.udf.ppr.to.subquery = false; command. After you run this command, UDF-based partition pruning is no longer in effect.
  • Use the BETWEEN…AND operator to filter data. Example:
    SELECT sale_detail. * 
    FROM sale_detail 
    WHERE sale_detail.sale_date BETWEEN '2008' AND '2014';

GROUP BY

GROUP BY is typically used with aggregate functions. If a SELECT statement contains aggregate functions, the following rules apply:
  • During the parsing of SQL statements, GROUP BY precedes SELECT. Therefore, the key of GROUP BY must be the name or expression of a column in the input table in the SELECT statement. You are not allowed to specify the alias of an output column in the SELECT statement as the key of GROUP BY.
  • The key of GROUP BY may be both the name or expression of a column in the input table and the alias of an output column in the SELECT statement. In this case, the name or expression of the column in the input table is used as the key of GROUP BY.
  • If set hive.groupby.position.alias=true; is added before a SELECT statement, the integer constants in GROUP BY are considered column IDs in the SELECT statement.
    set hive.groupby.position.alias=true;-- Run this command with the next SELECT statement.
    SELECT region, SUM(total_price) FROM sale_detail GROUP BY 1;-- 1 indicates the region column, which is the first column read by the SELECT statement. This statement groups the table data based on the values in the region column and returns distinct region values and total sales of each group.
Examples
-- The statement uses the name of the region column in the input table as the key of GROUP BY and groups the table data based on the values in the region column.
SELECT region FROM sale_detail GROUP BY region;
-- The statement groups the table data based on the values in the region column and returns the total sales of each group.
SELECT SUM(total_price) FROM sale_detail GROUP BY region;
-- The statement groups the table data based on the values in the region column and returns distinct values and total sales of each group.
SELECT region, SUM(total_price) FROM sale_detail GROUP BY region;
-- An error is returned when the alias of the column in the SELECT statement is used as the key of the GROUP BY clause.
SELECT region AS r FROM sale_detail GROUP BY r;
-- A complete expression of the column is required.
SELECT 2 + total_price AS r FROM sale_detail GROUP BY 2 + total_price;
-- An error is returned when a column that does not use aggregate functions in the SELECT statement is not included in the GROUP BY clause.
SELECT region, total_price FROM sale_detail GROUP BY region;
-- The SELECT statement can be executed when all the columns that do not use aggregate functions are included in the GROUP BY clause.
SELECT region, total_price FROM sale_detail GROUP BY region, total_price;     

ORDER BY|DISTRIBUTE BY|SORT BY

  • ORDER BY
    ORDER BY is used to sort all data records based on the specified columns.
    • To sort data records in descending order, use the DESC keyword. By default, data records are sorted in ascending order.
    • If you use ORDER BY to sort data records, NULL is considered the smallest value. This is also the case in MySQL, but not in Oracle.
    • ORDER BY must be followed by the alias of a specific column in the SELECT statement. If a column is included in the SELECT statement but you do not specify an alias for the column, the name of the column is considered its alias.
    • If set hive.orderby.position.alias=true; is added before a SELECT statement, the integer constants in ORDER BY are considered column IDs in the SELECT statement.
      -- Set the flag.
      set hive.orderby.position.alias=true;
      -- Create the src table.
      CREATE table src(key BIGINT, value BIGINT);
      -- Query the src table and sort the first 100 records that are returned in ascending order by value.
      SELECT * FROM src ORDER BY 2 LIMIT 100;
      The preceding statement is equivalent to the following statement:
      SELECT * FROM src ORDER BY value LIMIT 100;
    • You can use an OFFSET clause with an ORDER BY LIMIT clause to skip a specific number of rows.
      -- Sort the rows of the src table in ascending order by key, and return the eleventh to thirtieth rows. OFFSET 10 indicates that the first 10 rows are skipped, and LIMIT 20 indicates that a maximum of 20 rows can be returned.
      SELECT * FROM src ORDER BY key LIMIT 20 OFFSET 10;
    Examples
    -- Query data records in the sale_detail table and sort the first 100 records that are returned in ascending order by region.
    SELECT * FROM sale_detail ORDER BY region LIMIT 100;
    -- By default, the LIMIT clause is used with the ORDER BY clause. If only the ORDER BY clause is used, an error is returned. If you want to remove this limit, see Remove the limit on simultaneous execution of ORDER BY and LIMIT.
    SELECT * FROM sale_detail ORDER BY region;
    -- ORDER BY is followed by the alias of a column.
    SELECT region AS r FROM sale_detail ORDER BY region LIMIT 100;
    SELECT region AS r FROM sale_detail ORDER BY r LIMIT 100;
  • DISTRIBUTE BY

    DISTRIBUTE BY is used to shard data based on the hash values of specified columns. You must specify the alias of an output column of the SELECT statement as the key of DISTRIBUTE BY.

    Examples
    -- The statement queries values in the region column of the sale_detail table and shards data based on the hash values of the region column.
    SELECT region FROM sale_detail DISTRIBUTE BY region;
    -- The statement can be executed when the column name is an alias of the column.
    SELECT region AS r FROM sale_detail DISTRIBUTE BY region;
    The preceding statement is equivalent to the following statement:
    SELECT region AS r FROM sale_detail DISTRIBUTE BY r;
  • SORT BY
    SORT BY is used to sort specific data records. You can add a keyword, such as ASC or DESC, to sort specific data records in order. If ASC is used, data records are sorted in ascending order. If DESC is used, data records are sorted in descending order. If SORT BY is not followed by a keyword, data records are automatically sorted in ascending order.
    • If SORT BY is preceded by DISTRIBUTE BY, SORT BY sorts specific results of DISTRIBUTE BY. DISTRIBUTE BY determines how the map output values are distributed among reducers. If you want to prevent duplicate output values being distributed to different reducers or you want to process the same group of data together, you can use DISTRIBUTE BY to ensure that the same group of data is distributed to the same reducer. Then, use SORT BY to sort the group of data. Example:
      -- The statement queries values in the region column of the sale_detail table, shards data based on the hash values of the region column, and then sorts the sharded data.
      SELECT region FROM sale_detail DISTRIBUTE BY region SORT BY region;
    • If SORT BY is not preceded by DISTRIBUTE BY, SORT BY sorts the data of each reducer. This process sorts specific data records. This ensures that the output values of each reducer are sorted in order and increases the storage compression ratio. If data is filtered during data reading, this method reduces the amount of data read from disks and improves the efficiency of subsequent global sorting.
    Example
    SELECT region FROM sale_detail SORT BY region;
Note
  • The value of ORDER BY, DISTRIBUTE BY, or SORT BY must be the alias of an output column in the SELECT statement. The column alias can be Chinese.
  • During the parsing of MaxCompute SQL statements, the ORDER BY, DISTRIBUTE BY, or SORT BY clause is executed after the SELECT statement. Therefore, the value of ORDER BY, DISTRIBUTE BY, or SORT BY must be the alias of an output column in the SELECT statement.
  • ORDER BY cannot be used with DISTRIBUTE BY or SORT BY. Similarly, GROUP BY cannot be used with DISTRIBUTE BY or SORT BY.

LIMIT

The number in the LIMIT clause is a constant that limits the number of rows returned.

Note LIMIT scans and filters data based on a distributed system. When you use LIMIT, the amount of data returned is not reduced. Therefore, computing costs are not reduced.

Remove the limit on simultaneous execution of ORDER BY and LIMIT

ORDER BY sorts all data of a single execution node. By default, ORDER BY is used with LIMIT to prevent a single node from processing large amounts of data. You can remove the limit on simultaneous execution of ORDER BY with LIMIT from a project or session.
  • To remove the limit from a project, run the setproject odps.sql.validate.orderby.limit=false; command.
  • To remove the limit from a session, commit and run the setproject odps.sql.validate.orderby.limit=false; command with the SQL statement.
    Note After the limit is removed, if a single execution node has large amounts of data to sort, more resources and time are consumed.

Limits on the number of rows displayed on the screen

If you use a SELECT statement without LIMIT or the number in the LIMIT clause exceeds the maximum number of rows that can be displayed on the screen, you can view only the rows whose quantity does not reach the maximum on the screen.

The maximum number of rows that can be displayed on the screen varies based on projects. You can use one of the following methods to control the maximum number:
  • If project data protection is disabled, modify the odpscmd config.ini file.

    Set use_instance_tunnel to true in the odpscmd config.ini file. If the instance_tunnel_max_record parameter is not configured, the number of rows displayed on the screen is not limited. Otherwise, the number of rows displayed is limited by the instance_tunnel_max_record parameter. The maximum value of the instance_tunnel_max_record parameter is 10000. For more information about InstanceTunnel, see Tunnel command usage.

  • If project data protection is enabled, the number of rows that can be displayed on the screen is limited by READ_TABLE_MAX_ROW. The maximum value of this parameter is 10000.
Note You can run the show SecurityConfiguration; command to view the value of ProjectProtection. If ProjectProtection is set to true, you can determine whether to disable project data protection as required. You can run the set ProjectProtection=false; command to disable project data protection. By default, ProjectProtection is set to false. For more information about project data protection, see Project data protection.