This topic describes the SELECT syntax in MaxCompute and the precautions for using SELECT statements to perform operations such as nested query, sorting, and group query.

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY order_condition]
[DISTRIBUTE BY distribute_condition [SORT BY sort_condition] ]
[LIMIT number]
For more information about the precautions for using SELECT statements, see the following sections of this topic.

Column expression

The SELECT statements retrieve data from tables. The columns to be read can be specified by column names. You can use an asterisk (*) to represent all columns. The following is an example of a simple SELECT statement:
select * from sale_detail;
To retrieve only the shop_name column from the sale_detail table, run the following statement:
select shop_name from sale_detail;
You can use a WHERE clause to set filter conditions.
select * from sale_detail where shop_name like 'hang%';

Currently, the SELECT statements can only return up to 10,000 rows of results. No such restriction is imposed when SELECT is used as a clause. In this case, the SELECT clause returns all results in response to the query from the upper layer.

Usage notes:
  • Full table scan is not allowed for partitioned tables when you run SELECT statements.

    For a new project created after 20:00 January 10, 2018, full table scan is not allowed for the partitioned tables in the project when SQL statements are run. To perform a full table scan on partitioned tables, you must set partition conditions to specify the partitions to scan. This reduces unnecessary I/O of SQL statements and saves computing resources. In addition, computing costs can be saved when the pay-as-you-go billing method is used, where the data input volume is one of the billing parameters.

    For example, a table is defined as t1(c1,c2) partitioned by(ds). In a new project, the following statements are disabled and an error is returned.
    select * from t1 where c1=1;
    select * from t1 where (ds='20180202' or c2=3);
    select * from t1 left outer join t2 on a.id =b.id and a.ds=b.ds and b.ds='20180101');  
    -- When join is performed, partition pruning takes effect if a partition pruning condition resides in a WHERE clause. If it resides in the ON clause, partition pruning takes effect in the secondary table while a full table scan is performed on the primary table.

    To perform a full table scan on a partitioned table, you can insert set odps.sql.allow.fullscan=true; before the SQL statement used for the full table scan, and commit and run it along with the SQL statement. Assume that sale_detail is a partitioned table. To perform a full table scan on it, commit and run the following statements:

    set odps.sql.allow.fullscan=true;
    select * from sale_detail;
  • Nested subqueries are supported in the table_reference part. The following is an example:
    select * from (select region from sale_detail) t where region = 'shanghai';
    Note Currently, the SELECT statements can only return up to 10,000 rows of results. No such restriction is imposed when SELECT is used as a clause. In this case, the SELECT clause returns all results in response to the query from the upper layer.

DISTINCT

You can use DISTINCT before a field to eliminate its duplicate values and return only one value. If you use ALL before a field, all duplicate values in the field are returned. If this option is not specified, ALL is assumed.
With DISTINCT specified, only one row of record is returned. The following are some examples:
select distinct region from sale_detail;
select distinct region, sale_date from sale_detail;
-- The DISTINCT option applies to multiple columns. The option takes effect on all columns of a SELECT statement, instead of a single column.

SELECT_EXPR

MaxCompute SQL allows you to use the regular expression select_expr to select columns.

The regular expression select_expr can be used in the following methods:
  • SELECT `abc. *` FROM t;: Select all columns whose names start with abc from the t table.
  • SELECT `(ds)? +. +` FROM t;: Select all columns whose names are not ds from the t table.
  • SELECT `(ds|pt)? +. +` FROM t;: Select all columns except the columns whose names are ds and pt from the t table.
  • SELECT `(d. *)? +. +` FROM t;: Select all columns except the columns whose names start with d from the t table.
Note

When you attempt to exclude multiple columns, if col2 is the prefix of col1, make sure that col1 is written before col2. That is, the longer one is placed in the front. For example, if you do not want to select the two partitions ds and dshh in a table, run SELECT `(dshh|ds)? +. +` FROM tbl; instead of SELECT `(ds|dshh)? +. +` FROM tbl; .

WHERE

The following table lists the filter conditions supported by the WHERE clause.
Filter condition Description
>, <, =, >=, <=, and <> Relational operators.
like and rlike Make sure that the source and pattern parameters of like and rlike are of the string type.
in and not in If a subquery is added after the condition in or not in, the result of only one column can be returned for the subquery and the number of returned values cannot exceed 1,000.
In the WHERE clause of a SELECT statement, you can specify a partition range in a table to scan only the specified partitions. Take the following statement as an example:
SELECT sale_detail. * 
FROM sale_detail
WHERE sale_detail.sale_date >= '2008'
AND sale_detail.sale_date <= '2014';
Note You can check whether partition filtering takes effect by running the explain select statement. A common UDF or the partition condition settings of join may cause partition pruning to fail to take effect. For more information, see Reasonableness evaluation of partition pruning.
UDFs are run as small jobs and then replaced with the execution results. To enable UDF-based partition pruning, perform either of the following operations:
  • Add an annotation to the UDF class when writing a UDF.
    @com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)
    Note The UDF annotation com.aliyun.odps.udf.annotation.UdfProperty is defined in the odps-sdk-udf.jar file. You need to update the version of the referenced odps-sdk-udf to 0.30.x or later.
  • Insert set odps.sql.udf.ppr.deterministic = true; before SQL statements. Then all UDFs in subsequent SQL statements are treated as deterministic. This backfills partitions with execution results. A maximum of 1,000 partitions can be backfilled. If you add annotations to UDF classes, an error may be returned indicating that more than 1,000 backfilling results have been produced. To ignore this error, you can disable this feature globally by running set odps.sql.udf.ppr.to.subquery = false;. After this feature is disabled, UDF-based partition pruning is also disabled.

The WHERE clause in MaxCompute SQL statements supports the between...and conditional query.

SELECT sale_detail. * 
FROM sale_detail 
WHERE sale_detail.sale_date BETWEEN '2008' AND '2014';

GROUP BY

Generally, GROUP BY is used together with aggregate functions. The following rules must be observed when a SELECT statement contains aggregate functions:
  1. A key of GROUP BY can be the name of a column in an input table.
  2. It can also be an expression that is constituted by the names of columns in the input table. It cannot be the alias of an output column in a SELECT statement.
  3. The first rule is given a higher priority than the second one. The first rule prevails if the two rules conflict, that is, the key of GROUP BY is both a column or expression of an input table and the alias of an output column in a SELECT statement.
  4. If set hive.groupby.position.alias=true; is inserted before the SQL statement, the integer constants in GROUP BY are treated as the column numbers of SELECT.
    set hive.groupby.position.alias=true;-- Run it with the next SQL statement.
    select region, sum(total_price) from sale_detail group by 1;-- 1 indicates region, which is the first column of the SELECT statement. By grouping region values, this function returns the unique region value and total sales of each group.
Examples
select region from sale_detail group by region;
-- Run successfully with a column in the input table directly used as the GROUP BY column.
select sum(total_price) from sale_detail group by region;
-- Run successfully with the table grouped by the region value and return the total sales of each group.
select region, sum(total_price) from sale_detail group by region;
-- Run successfully with the table grouped by the region value and return the unique region value and total sales of each group.
select region as r from sale_detail group by r;
 -- Run with the alias of the SELECT column and return an error.
select 2 + total_price as r from sale_detail group by 2 + total_price;
-- Require a complete expression of the column.
select region, total_price from sale_detail group by region;
-- Return an error. All columns not using an aggregate function in the SELECT statement must exist in GROUP BY.
select region, total_price from sale_detail group by region, total_price;
-- Run successfully.

Generally, GROUP BY precedes SELECT during the parsing of SQL statements. Therefore, GROUP BY only uses the column names or expressions in the input table as keys.

Note
  • For more information about aggregate functions, see Aggregate functions.
  • ORDER BY sorts all data in sequence globally based on specified columns. To sort records in descending order, use the desc keyword. ORDER BY must be used together with LIMIT because records are sorted globally. In the ORDER BY operation, NULL is considered smaller than any value. This is consistent with MySQL, but different from Oracle.
    Different from GROUP BY, ORDER BY must be followed by the alias of a column in SELECT. When a column is included in a SELECT statement, its name is treated as its alias if you do not specify an alias for the column.
    select * from sale_detail order by region;
    -- Return an error because ORDER BY is not used together with LIMIT.
    select * from sale_detail order by region limit 100;
    select region as r from sale_detail order by region limit 100;
    -- Return an error because ORDER BY is not followed by a column alias.
    select region as r from sale_detail order by r limit 100;

LIMIT NUMBER

The parameter number in LIMIT NUMBER is a constant that limits the number of output rows. A maximum of 10,000 rows can be displayed on the screen if a SELECT statement without the LIMIT option is run. The upper limit of records displayed on the screen varies with projects. It can be set by using the setproject command.

ORDER BY, SORT BY, and DISTRIBUTE BY

  • DISTRIBUTE BY shards data based on hash values of specified columns, where the alias of an output column in a SELECT statement must be used.
    select region from sale_detail distribute by region;
    -- Run successfully because the column name is an alias.
    select region as r from sale_detail distribute by region;
    -- Return an error because DISTRIBUTE BY is not followed by a column alias.
    select region as r from sale_detail distribute by r;
  • SORT BY is used for partial ordering, which must be preceded with DISTRIBUTE BY. In practice, SORT BY partially sorts the results of DISTRIBUTE BY. It must use the alias of an output column in the SELECT statement.
    select region from sale_detail distribute by region sort by region;
    select region as r from sale_detail sort by region;
    -- Return an error and exit because the statement does not follow a DISTRIBUTE BY statement.
  • ORDER BY and GROUP BY cannot be used together with DISTRIBUTE BY or SORT BY, and must use the alias of an output column in a SELECT statement.
  • If set hive.orderby.position.alias=true; is inserted before the SQL statement, the integer constants in ORDER BY are treated as the column numbers of SELECT.
    --set flag
    set hive.orderby.position.alias=true;
    -- Create the src table.
    creat table src(key BIGINT,value BIGINT);
    -- Group by value and return all information about the src table.
    SELECT * FROM src ORDER BY 2 limit 100;
    is equivalent to:
    SELECT * FROM src ORDER BY value limit 100;
  • OFFSET can be used together with the ORDER BY LIMIT clause to skip a number of rows whose quantity is specified by OFFSET.
    -- Sort the rows of the src table in ascending order by key, and return the 11th to 30th rows. OFFSET 10 indicates that the first 10 rows are skipped, and LIMIT 20 indicates that a maximum number of 20 rows can be returned.
    SELECT * FROM src ORDER BY key LIMIT 20 OFFSET 10;
Note
  • The key of ORDER BY, SORT BY, or DISTRIBUTE BY must be the alias of an output column in a SELECT statement. The column alias can be Chinese.
  • Generally, SELECT precedes ORDER BY, SORT BY, and DISTRIBUTE BY during the parsing of SQL statements in MaxCompute. Therefore, only the column names of SELECT can be used as keys.