This topic describes the SELECT syntax in MaxCompute and the precautions for using SELECT statements to perform operations such as nested query, sorting, and group query.
For more information about the precautions for using
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY order_condition] [DISTRIBUTE BY distribute_condition [SORT BY sort_condition] ] [LIMIT number]
SELECTstatements, see the following sections of this topic.
SELECTstatements retrieve data from tables. The columns to be read can be specified by column names. You can use an asterisk (
*) to represent all columns. The following is an example of a simple
select * from sale_detail;
shop_namecolumn from the
sale_detailtable, run the following statement:
select shop_name from sale_detail;
WHEREclause to set filter conditions.
select * from sale_detail where shop_name like 'hang%';
SELECT statements can only return up to 10,000 rows of results. No such restriction is imposed
SELECT is used as a clause. In this case, the
SELECT clause returns all results in response to the query from the upper layer.
- Full table scan is not allowed for partitioned tables when you run
For a new project created after 20:00 January 10, 2018, full table scan is not allowed for the partitioned tables in the project when SQL statements are run. To perform a full table scan on partitioned tables, you must set partition conditions to specify the partitions to scan. This reduces unnecessary I/O of SQL statements and saves computing resources. In addition, computing costs can be saved when the pay-as-you-go billing method is used, where the data input volume is one of the billing parameters.For example, a table is defined as
t1(c1,c2) partitioned by(ds). In a new project, the following statements are disabled and an error is returned.
select * from t1 where c1=1; select * from t1 where (ds='20180202' or c2=3); select * from t1 left outer join t2 on a.id =b.id and a.ds=b.ds and b.ds='20180101'); -- When join is performed, partition pruning takes effect if a partition pruning condition resides in a WHERE clause. If it resides in the ON clause, partition pruning takes effect in the secondary table while a full table scan is performed on the primary table.
To perform a full table scan on a partitioned table, you can insert
set odps.sql.allow.fullscan=true;before the SQL statement used for the full table scan, and commit and run it along with the SQL statement. Assume that
sale_detailis a partitioned table. To perform a full table scan on it, commit and run the following statements:
set odps.sql.allow.fullscan=true; select * from sale_detail;
- Nested subqueries are supported in the
table_referencepart. The following is an example:
select * from (select region from sale_detail) t where region = 'shanghai';Note Currently, the
SELECTstatements can only return up to 10,000 rows of results. No such restriction is imposed when
SELECTis used as a clause. In this case, the
SELECTclause returns all results in response to the query from the upper layer.
DISTINCTYou can use
DISTINCTbefore a field to eliminate its duplicate values and return only one value. If you use
ALLbefore a field, all duplicate values in the field are returned. If this option is not specified,
DISTINCTspecified, only one row of record is returned. The following are some examples:
select distinct region from sale_detail; select distinct region, sale_date from sale_detail; -- The DISTINCT option applies to multiple columns. The option takes effect on all columns of a SELECT statement, instead of a single column.
MaxCompute SQL allows you to use the regular expression
select_expr to select columns.
select_exprcan be used in the following methods:
SELECT `abc. *` FROM t;: Select all columns whose names start with
SELECT `(ds)? +. +` FROM t;: Select all columns whose names are not
SELECT `(ds|pt)? +. +` FROM t;: Select all columns except the columns whose names are
SELECT `(d. *)? +. +` FROM t;: Select all columns except the columns whose names start with
When you attempt to exclude multiple columns, if col2 is the prefix of col1, make
sure that col1 is written before col2. That is, the longer one is placed in the front.
For example, if you do not want to
select the two partitions
dshh in a table, run
SELECT `(dshh|ds)? +. +` FROM tbl; instead of
SELECT `(ds|dshh)? +. +` FROM tbl; .
WHEREThe following table lists the filter conditions supported by the
|>, <, =, >=, <=, and <>||Relational operators.|
|like and rlike||Make sure that the source and pattern parameters of like and rlike are of the string type.|
|in and not in||If a subquery is added after the condition in or not in, the result of only one column can be returned for the subquery and the number of returned values cannot exceed 1,000.|
WHEREclause of a
SELECTstatement, you can specify a partition range in a table to scan only the specified partitions. Take the following statement as an example:
SELECT sale_detail. * FROM sale_detail WHERE sale_detail.sale_date >= '2008' AND sale_detail.sale_date <= '2014';
explain selectstatement. A common UDF or the partition condition settings of
joinmay cause partition pruning to fail to take effect. For more information, see Reasonableness evaluation of partition pruning.
- Add an annotation to the UDF class when writing a UDF.
@com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)Note The UDF annotation
com.aliyun.odps.udf.annotation.UdfPropertyis defined in the odps-sdk-udf.jar file. You need to update the version of the referenced odps-sdk-udf to 0.30.x or later.
set odps.sql.udf.ppr.deterministic = true;before SQL statements. Then all UDFs in subsequent SQL statements are treated as
deterministic. This backfills partitions with execution results. A maximum of 1,000 partitions can be backfilled. If you add annotations to UDF classes, an error may be returned indicating that more than 1,000 backfilling results have been produced. To ignore this error, you can disable this feature globally by running
set odps.sql.udf.ppr.to.subquery = false;. After this feature is disabled, UDF-based partition pruning is also disabled.
WHERE clause in MaxCompute SQL statements supports the
between...and conditional query.
SELECT sale_detail. * FROM sale_detail WHERE sale_detail.sale_date BETWEEN '2008' AND '2014';
GROUP BYis used together with aggregate functions. The following rules must be observed when a
SELECTstatement contains aggregate functions:
- A key of
GROUP BYcan be the name of a column in an input table.
- It can also be an expression that is constituted by the names of columns in the input
table. It cannot be the alias of an output column in a
- The first rule is given a higher priority than the second one. The first rule prevails
if the two rules conflict, that is, the key of
GROUP BYis both a column or expression of an input table and the alias of an output column in a
set hive.groupby.position.alias=true;is inserted before the SQL statement, the integer constants in GROUP BY are treated as the column numbers of SELECT.
set hive.groupby.position.alias=true;-- Run it with the next SQL statement. select region, sum(total_price) from sale_detail group by 1;-- 1 indicates region, which is the first column of the SELECT statement. By grouping region values, this function returns the unique region value and total sales of each group.
select region from sale_detail group by region; -- Run successfully with a column in the input table directly used as the GROUP BY column. select sum(total_price) from sale_detail group by region; -- Run successfully with the table grouped by the region value and return the total sales of each group. select region, sum(total_price) from sale_detail group by region; -- Run successfully with the table grouped by the region value and return the unique region value and total sales of each group. select region as r from sale_detail group by r; -- Run with the alias of the SELECT column and return an error. select 2 + total_price as r from sale_detail group by 2 + total_price; -- Require a complete expression of the column. select region, total_price from sale_detail group by region; -- Return an error. All columns not using an aggregate function in the SELECT statement must exist in GROUP BY. select region, total_price from sale_detail group by region, total_price; -- Run successfully.
GROUP BY precedes
SELECT during the parsing of SQL statements. Therefore,
GROUP BY only uses the column names or expressions in the input table as keys.
- For more information about aggregate functions, see Aggregate functions.
ORDER BYsorts all data in sequence globally based on specified columns. To sort records in descending order, use the
ORDER BYmust be used together with
LIMITbecause records are sorted globally. In the
ORDER BYoperation, NULL is considered smaller than any value. This is consistent with MySQL, but different from Oracle.Different from
ORDER BYmust be followed by the alias of a column in
SELECT. When a column is included in a
SELECTstatement, its name is treated as its alias if you do not specify an alias for the column.
select * from sale_detail order by region; -- Return an error because ORDER BY is not used together with LIMIT. select * from sale_detail order by region limit 100; select region as r from sale_detail order by region limit 100; -- Return an error because ORDER BY is not followed by a column alias. select region as r from sale_detail order by r limit 100;
LIMIT NUMBERThe parameter
LIMIT NUMBERis a constant that limits the number of output rows. A maximum of 10,000 rows can be displayed on the screen if a
SELECTstatement without the
LIMIToption is run. The upper limit of records displayed on the screen varies with projects. It can be set by using the
ORDER BY, SORT BY, and DISTRIBUTE BY
DISTRIBUTE BYshards data based on hash values of specified columns, where the alias of an output column in a
SELECTstatement must be used.
select region from sale_detail distribute by region; -- Run successfully because the column name is an alias. select region as r from sale_detail distribute by region; -- Return an error because DISTRIBUTE BY is not followed by a column alias. select region as r from sale_detail distribute by r;
SORT BYis used for partial ordering, which must be preceded with
DISTRIBUTE BY. In practice,
SORT BYpartially sorts the results of
DISTRIBUTE BY. It must use the alias of an output column in the
select region from sale_detail distribute by region sort by region; select region as r from sale_detail sort by region; -- Return an error and exit because the statement does not follow a DISTRIBUTE BY statement.
GROUP BYcannot be used together with
SORT BY, and must use the alias of an output column in a
set hive.orderby.position.alias=true;is inserted before the SQL statement, the integer constants in ORDER BY are treated as the column numbers of SELECT.
--set flag set hive.orderby.position.alias=true; -- Create the src table. creat table src(key BIGINT,value BIGINT); -- Group by value and return all information about the src table. SELECT * FROM src ORDER BY 2 limit 100; is equivalent to: SELECT * FROM src ORDER BY value limit 100;
- OFFSET can be used together with the ORDER BY LIMIT clause to skip a number of rows
whose quantity is specified by OFFSET.
-- Sort the rows of the src table in ascending order by key, and return the 11th to 30th rows. OFFSET 10 indicates that the first 10 rows are skipped, and LIMIT 20 indicates that a maximum number of 20 rows can be returned. SELECT * FROM src ORDER BY key LIMIT 20 OFFSET 10;
- The key of
ORDER BY, SORT BY, or DISTRIBUTE BYmust be the alias of an output column in a
SELECTstatement. The column alias can be Chinese.
ORDER BY, SORT BY, and DISTRIBUTE BYduring the parsing of SQL statements in MaxCompute. Therefore, only the column names of
SELECTcan be used as keys.