MaxCompute lets you query data using SELECT statements. This topic describes the syntax of SELECT commands and how to perform operations such as nested queries, group queries, and sorting.
Before you execute SELECT statements, make sure that you have the `Select` permission on the destination table. For more information, see MaxCompute permissions.
You can execute the statements on the following platforms:
Features
SELECT statements are used to query data that meets the specified conditions from a table. The following table describes the query operations that can be performed in different scenarios.
Type | Feature |
Performs a further query on the results of a previous query. | |
Performs intersection, union, or complement operations on query result datasets. | |
Uses a | |
Filters data in the left table using the right table. The result set contains only data from the left table. | |
To improve query performance for a | |
When a JOIN operation on two tables has a hot spot, a long-tail issue can occur. To improve JOIN efficiency, extract the hot spot keys, process the hot spot data and non-hot spot data separately, and then merge the results. | |
Use Lateral View with a user-defined table-valued function (UDTF) to split a single row of data into multiple rows. | |
Aggregates and analyzes data from multiple dimensions. | |
| |
Modify the Split Size to control the concurrency level. | |
Delta tables support:
|
Limits
When a
SELECTstatement is executed, a maximum of 10,000 rows of results are displayed, and the size of the returned results cannot exceed 10 MB. This limit does not apply when aSELECTstatement is used as a subquery. In this case, theSELECTclause returns all results to the parent query.When you use a
SELECTstatement to query a partitioned table, full table scans are prohibited by default.For projects created after 20:00:00 on January 10, 2018, you cannot perform a full table scan on a partitioned table in the project by default. When you query data from a partitioned table, you must specify a partition. This practice reduces unnecessary I/O operations and computing resource consumption. It also reduces unnecessary computing costs if you use the pay-as-you-go billing method.
To perform a full table scan on a partitioned table, you can add the
SET odps.sql.allow.fullscan=true;command before the SQL statement and submit them together for execution. For example, if thesale_detailtable is a partitioned table, you must execute the following statements together to perform a full table query:SET odps.sql.allow.fullscan=true; SELECT * from sale_detail;When you query a clustered table, the current version optimizes bucket pruning only when 400 or fewer partitions are scanned in a single table. If bucket pruning does not take effect, more data is scanned. If you use the pay-as-you-go billing method, your costs increase. If you use the subscription billing method, the SQL computing performance decreases.
Syntax
[WITH <cte>[, ...] ]
SELECT [ALL | DISTINCT] <SELECT_expr>[, <EXCEPT_expr>][, <REPLACE_expr>] ...
FROM <TABLE_reference>
[WHERE <WHERE_condition>]
[GROUP BY {<col_list>|ROLLUP(<col_list>)}]
[HAVING <HAVING_condition>]
[WINDOW <WINDOW_clause>]
[ORDER BY <ORDER_condition>]
[DISTRIBUTE BY <DISTRIBUTE_condition> [SORT BY <SORT_condition>]|[ CLUSTER BY <CLUSTER_condition>] ]
[LIMIT <number>]For more information about the execution order of clauses in a SELECT statement, see Execution order of clauses in a SELECT statement.
Sample data
This topic provides source data and related examples to help you understand how to use the commands. The following sample commands create the `sale_detail` table and add data to it.
-- Create a partitioned table named sale_detail.
CREATE TABLE IF NOT EXISTS sale_detail
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
-- Add partitions to the sale_detail table.
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china');
-- Insert data into the sale_detail table.
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);The following example shows a command to query data in the partitioned table `sale_detail`:
SELECT * FROM sale_detail;
--The following result is returned.
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s3 | c3 | 100.3 | 2013 | china |
+------------+-------------+-------------+------------+------------+WITH clause (CTE)
Optional. The `WITH` clause contains one or more common table expressions (CTEs). A CTE acts as a temporary table in the current runtime environment. You can reference this table in later queries. The following rules apply to CTEs:
CTEs within the same `WITH` clause must have unique names.
A CTE can be referenced only by other CTEs defined within the same `WITH` clause.
For example, assume that A is the first CTE and B is the second CTE in the same `WITH` clause:
The 'A references A' condition is invalid. The following is an example of an incorrect command.
WITH A AS (SELECT 1 FROM A) SELECT * FROM A;The following result is returned:
FAILED: ODPS-0130161:[1,6] Parse exception - recursive cte A is invalid, it must have an initial_part and a recursive_part, which must be connected by UNION ALLCircular references are not supported. For example, A cannot reference B if B references A. Incorrect command example:
WITH A AS (SELECT * FROM B ), B AS (SELECT * FROM A ) SELECT * FROM B;The following result is returned:
FAILED: ODPS-0130071:[1,26] Semantic analysis exception - while resolving view B - [1,51]recursive function call is not supported, cycle is A->B->AThe following example shows a correct command.
WITH A AS (SELECT 1 AS C), B AS (SELECT * FROM A) SELECT * FROM B;The following result is returned.
+---+ | c | +---+ | 1 | +---+
Column expression (SELECT_expr)
`select_expr` is required. SELECT_expr uses the format col1_name, col2_name, column expression,.... This expression specifies the columns, partition key columns, or regular expressions that you want to query. The following rules apply to `select_expr`:
You can specify the columns to read by name.
The following statement reads the
shop_namecolumn from thesale_detailtable.SELECT shop_name FROM sale_detail;The following result is returned.
+------------+ | shop_name | +------------+ | s1 | | s2 | | s3 | +------------+You can use an asterisk (
*) to represent all columns. You can also use it with aWHEREclause to specify filter conditions.The following command reads all columns from the
sale_detailtable.-- Enable a full table scan only for the current session. SET odps.sql.allow.fullscan=true; SELECT * FROM sale_detail;The following result is returned.
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | +------------+-------------+-------------+------------+------------+You can specify filter conditions in the
WHEREclause. The following statement provides an example.SELECT * FROM sale_detail WHERE shop_name='s1';The following result is returned.
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | +------------+-------------+-------------+------------+------------+
You can use regular expressions.
The following command selects all columns from the
sale_detailtable whose names start withsh.SELECT `sh.*` FROM sale_detail;The following result is returned.
+------------+ | shop_name | +------------+ | s1 | | s2 | | s3 | +------------+The following statement selects all columns whose names are not
shop_namefrom thesale_detailtable.SELECT `(shop_name)?+.+` FROM sale_detail;The following result is returned.
+-------------+-------------+------------+------------+ | customer_id | total_price | sale_date | region | +-------------+-------------+------------+------------+ | c1 | 100.1 | 2013 | china | | c2 | 100.2 | 2013 | china | | c3 | 100.3 | 2013 | china | +-------------+-------------+------------+------------+The following command selects all columns from the
sale_detailtable except forshop_nameandcustomer_id. The following statement provides an example.SELECT `(shop_name|customer_id)?+.+` FROM sale_detail;The following result is returned.
+-------------+------------+------------+ | total_price | sale_date | region | +-------------+------------+------------+ | 100.1 | 2013 | china | | 100.2 | 2013 | china | | 100.3 | 2013 | china | +-------------+------------+------------+The following command selects all columns from the
sale_detailtable except for those whose names start witht. The following command provides an example.SELECT `(t.*)?+.+` FROM sale_detail;The following result is returned.
+------------+-------------+------------+------------+ | shop_name | customer_id | sale_date | region | +------------+-------------+------------+------------+ | s1 | c1 | 2013 | china | | s2 | c2 | 2013 | china | | s3 | c3 | 2013 | china | +------------+-------------+------------+------------+NoteIf you want to exclude multiple columns and one column name is a prefix of another, you must place the longer column name before the shorter one in the regular expression. For example, a table has two partitions that you do not need to query. One partition is named
ds, and the other is nameddshh. Because `ds` is a prefix of `dshh`, the correct expression isSELECT `(dshh|ds)?+.+` FROM t;. The incorrect expression isSELECT `(ds|dshh)?+.+` FROM t;.
You can use the
DISTINCTkeyword before a column name to remove duplicate values and return only unique values. TheALLkeyword returns all values in the column, including duplicates. If you do not specify a keyword,ALLis used by default.The following example shows a command that queries the `region` column in the `sale_detail` table. If duplicate values exist, only one is displayed.
SELECT DISTINCT region FROM sale_detail;The following result is returned.
+------------+ | region | +------------+ | china | +------------+When you remove duplicates from multiple columns, the
DISTINCTkeyword applies to the entire set of columns in theSELECTclause, not to individual columns. The following statement is an example.SELECT DISTINCT region, sale_date FROM sale_detail;The following result is returned.
+------------+------------+ | region | sale_date | +------------+------------+ | china | 2013 | +------------+------------+You can use DISTINCT with a window function. In this case, DISTINCT removes duplicate results that are returned by the window function. The following statement provides an example:
SET odps.sql.allow.fullscan=true; SELECT DISTINCT sale_date, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total_price) AS rn FROM sale_detail;The following result is returned.
+-----------+------------+ | sale_date | rn | +-----------+------------+ | 2013 | 1 | +-----------+------------+You cannot use DISTINCT and GROUP BY in the same query. For example, if you execute the following command, an error is returned.
SELECT DISTINCT shop_name FROM sale_detail GROUP BY shop_name; -- The error message "GROUP BY cannot be used with SELECT DISTINCT" is returned.
Exclude columns (EXCEPT_expr)
The EXCEPT_expr expression is optional and uses the EXCEPT(col1_name, col2_name, ...) format. You can use this expression to read data from most columns in a table while excluding a few specific columns. For example, the SELECT * EXCEPT(col1_name, col2_name, ...) FROM ...; statement reads data from all columns except for the `col1` and `col2` columns.
The following example shows a sample command.
-- Read data from all columns, except the region column, in the sale_detail table.
SELECT * EXCEPT(region) FROM sale_detail;The following result is returned.
+-----------+-------------+-------------+-----------+
| shop_name | customer_id | total_price | sale_date |
+-----------+-------------+-------------+-----------+
| s1 | c1 | 100.1 | 2013 |
| s2 | c2 | 100.2 | 2013 |
| s3 | c3 | 100.3 | 2013 |
+-----------+-------------+-------------+-----------+Replace columns (REPLACE_expr)
`replace_expr` is optional. REPLACE_expr uses the REPLACE(exp1 [as] col1_name, exp2 [as] col2_name, ...) format. You can use `replace_expr` to read data from most columns in a table and modify the data of a few columns. For example, you can execute the SELECT * REPLACE(exp1 as col1_name, exp2 as col2_name, ...) FROM ...; statement to replace the data in the `col1` column with the result of `exp1` and the data in the `col2` column with the result of `exp2`.
The following example shows a command that reads data from the `sale_detail` table and modifies the data in the `total_price` and `region` columns.
SELECT * REPLACE(total_price+100 AS total_price, 'shanghai' AS region) FROM sale_detail;The following result is returned.
+-----------+-------------+-------------+-----------+--------+
| shop_name | customer_id | total_price | sale_date | region |
+-----------+-------------+-------------+-----------+--------+
| s1 | c1 | 200.1 | 2013 | shanghai |
| s2 | c2 | 200.2 | 2013 | shanghai |
| s3 | c3 | 200.3 | 2013 | shanghai |
+-----------+-------------+-------------+-----------+--------+Target table information (TABLE_reference)
`table_reference` is required. TABLE_reference specifies the table that you want to query. The following rules apply to `table_reference`:
You can directly specify the target table name. The following example shows a sample command.
SELECT customer_id FROM sale_detail;The following result is returned.
+-------------+ | customer_id | +-------------+ | c1 | | c2 | | c3 | +-------------+You can use a nested subquery. The following example shows a sample command.
SELECT * FROM (SELECT region,sale_date FROM sale_detail) t WHERE region = 'china';The following result is returned.
+------------+------------+ | region | sale_date | +------------+------------+ | china | 2013 | | china | 2013 | | china | 2013 | +------------+------------+
WHERE clause (WHERE_condition)
`where_condition` is optional. The WHERE clause specifies a filter condition. If the `WHERE` clause is used for a partitioned table, partition pruning can be performed. The following rules apply to `where_condition`:
You can use relational operators to filter data that meets specified conditions. Relational operators include the following:
>,<,=,>=,<=, and<>LIKEandRLIKEINandNOT INBETWEEN…AND
For more information, see Relational operators.
In the
WHEREclause, you can specify a partition range to scan only a specific portion of the table and avoid a full table scan. The following command provides an example.SELECT * FROM sale_detail WHERE sale_date >= '2008' AND sale_date <= '2014'; -- The preceding statement is equivalent to the following statement: SELECT * FROM sale_detail WHERE sale_date BETWEEN '2008' AND '2014';The following result is returned.
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | +------------+-------------+-------------+------------+------------+NoteYou can run the EXPLAIN statement to check whether partition pruning takes effect. A user-defined function (UDF) or the partition condition settings of a JOIN operation may cause partition pruning to fail. For more information, see Check whether partition pruning is effective.
You can use a UDF to implement partition pruning. The UDF is first run as a small job. Then, the result of the job is used to replace the UDF in the original statement.
Implementation methods
When you write a UDF, you can add an annotation to the UDF class.
@com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)Notecom.aliyun.odps.udf.annotation.UdfPropertyis defined in the odps-sdk-udf.jar file. You must upgrade the version of the referenced odps-sdk-udf to 0.30.x or later.You can add
SET odps.sql.udf.ppr.deterministic = true;before the SQL statement that you want to execute. Then, all UDFs in the SQL statement are considereddeterministicUDFs. The preceding `SET` command backfills partitions with the job results. A maximum of 1,000 partitions can be backfilled with the job results. If you add an annotation to a UDF class, an error that indicates more than 1,000 partitions are backfilled may be returned. To ignore this error, you can run theSET odps.sql.udf.ppr.to.subquery = false;command. After you run this command, UDF-based partition pruning is no longer in effect.
Notes
For UDF-based partition pruning to take effect, the UDF must be in the
WHEREclause.The following example shows how to correctly use a UDF for partition pruning.
--The UDF must be placed in the WHERE clause of the source table: SELECT key, value FROM srcp WHERE udf(ds) = 'xx';The following example shows how to incorrectly use a UDF for partition pruning.
--Partition pruning does not take effect for conditions in the JOIN ON clause. SELECT A.c1, A.c2 FROM srcp1 A JOIN srcp2 B ON A.c1 = B.c1 AND udf(A.ds) ='xx';
If a column in a Column expression (SELECT_expr) uses a function and is renamed using an alias, the alias cannot be referenced in the
WHEREclause. Incorrect example:SELECT task_name ,inst_id ,settings ,GET_JSON_OBJECT(settings, '$.SKYNET_ID') AS skynet_id ,GET_JSON_OBJECT(settings, '$.SKYNET_NODENAME') AS user_agent FROM Information_Schema.TASKS_HISTORY WHERE ds = '20211215' AND skynet_id IS NOT NULL LIMIT 10;
GROUP BY clause for grouped queries (col_list)
`GROUP BY` is optional. In most cases, GROUP BY is used with aggregate functions to group rows based on specified columns, partition key columns, or regular expressions. The following rules apply to GROUP BY:
The
GROUP BYclause is executed before theSELECTclause. Therefore, the columns in theGROUP BYclause can be specified by the column names of the input table for theSELECTstatement, or by an expression that is formed from the columns of that table. When you use the GROUP BY clause, note the following points:If you use a regular expression in the
GROUP BYclause, you must use the complete expression for the columns.The
GROUP BY ALLoperation is supported. To use this operation, setodps.sql.bigquery.compatible=true;to enable the BigQuery compatible mode.All columns in the
SELECTlist that are not encapsulated in an aggregate function must be included in theGROUP BYclause.
Usage examples:
You can directly use the `region` column in the
GROUP BYclause to group data by the values in this column. The following statement provides an example.SELECT region FROM sale_detail GROUP BY region;The following result is returned.
+------------+ | region | +------------+ | china | +------------+The following command groups data by `region` and returns the total sales for each group. The following example shows a sample command.
SELECT SUM(total_price) FROM sale_detail GROUP BY region;The following result is returned.
+------------+ | _c0 | +------------+ | 300.6 | +------------+The following command groups data by `region` and returns the unique `region` value and the total sales for each group. The following example shows a sample command.
SELECT region, SUM (total_price) FROM sale_detail GROUP BY region;The following result is returned.
+------------+------------+ | region | _c1 | +------------+------------+ | china | 300.6 | +------------+------------+The following command provides an example of grouping by the alias of a column in a
SELECTstatement.SELECT region AS r FROM sale_detail GROUP BY r; -- The preceding statement is equivalent to the following statement: SELECT region AS r FROM sale_detail GROUP BY region;The following result is returned.
+------------+ | r | +------------+ | china | +------------+You can group by a column expression. The following example shows a sample command.
SELECT 2 + total_price AS r FROM sale_detail GROUP BY 2 + total_price;The following result is returned.
+------------+ | r | +------------+ | 102.1 | | 102.2 | | 102.3 | +------------+All columns in the
SELECTlist that are not used in an aggregate function must be included in theGROUP BYclause. Otherwise, an error is returned. Incorrect command example:-- Error: FAILED: ODPS-0130071:[1,16] Semantic analysis exception - column reference sale_detail.total_price should appear in GROUP BY key SELECT region, total_price FROM sale_detail GROUP BY region;The following example shows a correct command.
SELECT region, total_price FROM sale_detail GROUP BY region, total_price;The following result is returned.
+------------+-------------+ | region | total_price | +------------+-------------+ | china | 100.1 | | china | 100.2 | | china | 100.3 | +------------+-------------+When you set
odps.sql.bigquery.compatible=true;to enable BigQuery compatible mode, you can use theGROUP BY ALLsyntax to automatically group by all query fields.-- Explicitly list all fields for grouping. SELECT shop_name, customer_id, sale_date, region, SUM(total_price) AS total_sales FROM sale_detail GROUP BY shop_name, customer_id, sale_date, region; -- Use GROUP BY ALL to group by all fields, which is equivalent to explicitly listing all fields for grouping. SET odps.sql.bigquery.compatible=true; SELECT shop_name, customer_id, sale_date, region, SUM(total_price) AS total_sales FROM sale_detail GROUP BY ALL;The following result is returned.
+-----------+-------------+-----------+--------+-------------+ | shop_name | customer_id | sale_date | region | total_sales | +-----------+-------------+-----------+--------+-------------+ | s1 | c1 | 2013 | china | 100.1 | | s2 | c2 | 2013 | china | 100.2 | | s3 | c3 | 2013 | china | 100.3 | +-----------+-------------+-----------+--------+-------------+
If you run the
SET odps.sql.groupby.position.alias=true;command, integer constants in theGROUP BYclause are treated as ordinal numbers of columns in theSELECTlist. Sample statement:-- Run this command with the following SELECT statement. SET odps.sql.groupby.position.alias=true; -- 1 indicates the region column, which is the first column read by the following SELECT statement. This statement groups table data based on the values of the region column and returns distinct values of the region column and total sales of each group. SELECT region, SUM(total_price) FROM sale_detail GROUP BY 1;The following result is returned.
+------------+------------+ | region | _c1 | +------------+------------+ | china | 300.6 | +------------+------------+
HAVING clause (HAVING_condition)
The `HAVING` clause is optional. The HAVING clause is typically used with aggregate functions to filter grouped data. Sample statement:
-- Insert data into the sale_detail table to display the data rendering effect.
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
-- Use having_condition with aggregate functions to filter data.
SELECT region,SUM(total_price) FROM sale_detail
GROUP BY region
HAVING SUM(total_price)<305;The following result is returned.
+------------+------------+
| region | _c1 |
+------------+------------+
| china | 300.6 |
| shanghai | 200.9 |
+------------+------------+ORDER BY for global sorting (ORDER_condition)
`order_condition` is optional. ORDER BY sorts all data records based on a specified column, partition key column, or constant. The following rules apply to ORDER BY:
By default, data is sorted in ascending order. To sort data in descending order, you must use the
DESCkeyword.By default, the
ORDER BYclause must be used with aLIMITclause to restrict the number of output rows. If you do not include aLIMITclause, an error is returned. To remove the restriction that theORDER BYclause must be used with theLIMITclause, see Restricting output rows with LIMIT > Removing the LIMIT requirement for ORDER BY.The following command queries data from the `sale_detail` table and sorts the results by `total_price` in ascending order, returning the top two rows. The following example shows a sample command.
SELECT * FROM sale_detail ORDER BY total_price LIMIT 2;The following result is returned.
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | +------------+-------------+-------------+------------+------------+The following command queries data from the `sale_detail` table and sorts the results by `total_price` in descending order, returning the top two rows. The following example shows a sample command.
SELECT * FROM sale_detail ORDER BY total_price DESC LIMIT 2;The following result is returned.
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s3 | c3 | 100.3 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | +------------+-------------+-------------+------------+------------+
When you use
ORDER BYto sort data, `NULL` values are treated as the smallest values. This behavior is consistent with MySQL but different from Oracle.The following command queries data from the `sale_detail` table and sorts the results by `total_price` in ascending order, returning the top two rows. The following example shows a sample command.
SELECT * FROM sale_detail ORDER BY total_price LIMIT 2;The following result is returned.
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | +------------+-------------+-------------+------------+------------+The
ORDER BYclause can reference the alias of a column from theSELECTlist. If you do not specify an alias for a column in theSELECTlist, the column name is used as its alias.The
ORDER BYclause can be followed by a column alias. The following statement provides an example:SELECT total_price AS t FROM sale_detail ORDER BY total_price LIMIT 3; -- The preceding statement is equivalent to the following statement: SELECT total_price AS t FROM sale_detail ORDER BY t LIMIT 3;The following result is returned.
+------------+ | t | +------------+ | 100.1 | | 100.2 | | 100.3 | +------------+If you run the
SET odps.sql.orderby.position.alias=true;command, integer constants in theORDER BYclause are treated as ordinal numbers of columns in theSELECTlist. The following statement provides an example:-- Run this command with the following SELECT statement. SET odps.sql.orderby.position.alias=true; SELECT * FROM sale_detail ORDER BY 3 LIMIT 3;The following result is returned.
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | +------------+-------------+-------------+------------+------------+The
OFFSETclause can be used with theORDER BY...LIMITclause to specify the number of rows to skip. The syntax isORDER BY...LIMIT m OFFSET n, which can be abbreviated asORDER BY...LIMIT n, m. In this syntax,LIMIT mspecifies that `m` rows are returned, andOFFSET nspecifies that the first `n` rows of the sorted result are skipped.OFFSET 0has the same effect as omitting the `OFFSET` clause.The following command sorts the `sale_detail` table by `total_price` in ascending order and then outputs three rows, starting from the third row. The following example shows a sample command.
SELECT customer_id,total_price FROM sale_detail ORDER BY total_price LIMIT 3 OFFSET 2; -- The preceding statement is equivalent to the following statement: SELECT customer_id,total_price FROM sale_detail ORDER BY total_price LIMIT 2, 3;The following result is returned.
+-------------+-------------+ | customer_id | total_price | +-------------+-------------+ | c3 | 100.3 | +-------------+-------------+Because only one row remains after skipping the first two rows, the result contains only one row, even though the limit is 3.
You can use Range Clustering to accelerate global sorting. In typical `ORDER BY` scenarios, all data must be merged and processed on a single instance to ensure global order. This approach prevents you from taking advantage of parallel processing. You can use the `PARTITION` step of Range Clustering to perform concurrent global sorting. This process first samples the data and divides it into ranges, and then sorts each range concurrently to produce a globally sorted result. For more information, see Global sorting acceleration.
DISTRIBUTE BY for hash sharding (DISTRIBUTE_condition)
`distribute_condition` is optional. DISTRIBUTE BY performs hash sharding on data based on the values of specified columns.
DISTRIBUTE BY controls how the output of a map task is partitioned among reduce tasks. If you want to process data from the same group together or prevent content from overlapping in reducers, you can use DISTRIBUTE BY to ensure that rows with the same key are sent to the same reducer.
You must specify the alias of an output column from the SELECT statement. If you execute a SELECT statement to query data of a column and the alias of this column is not specified, the column name is used as the alias. Sample statement:
-- The following statement queries the values of the region column from the sale_detail table and performs hash sharding on data based on the values of the region column.
SELECT region FROM sale_detail DISTRIBUTE BY region;
-- The preceding statement is equivalent to the following statements:
SELECT region AS r FROM sale_detail DISTRIBUTE BY region;
SELECT region AS r FROM sale_detail DISTRIBUTE BY r;The following result is returned.
+------------+
| r |
+------------+
| china |
| china |
| china |
+------------+SORT BY for local sorting (SORT_condition)
This optional clause is typically used with DISTRIBUTE BY. The following rules apply to SORT BY:
By default,
SORT BYsorts data in ascending order. If you want to sort data in descending order, theDESCkeyword is required.If
SORT BYis used withDISTRIBUTE BY,SORT BYsorts the data within each partition created byDISTRIBUTE BY.The following command queries the values of the `region` and `total_price` columns from the `sale_detail` table, performs hash sharding on the results based on the `region` values, and then locally sorts the hash-sharded results by `total_price` in ascending order. The following example shows a sample command.
-- Insert data into the sale_detail table to display the data rendering effect. INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5); ---- Modify the number of Workers for each Reduce stage. SET odps.stage.reducer.num=2; SELECT region,total_price FROM sale_detail DISTRIBUTE BY region SORT BY total_price;The following result is returned.
+------------+-------------+ | region | total_price | +------------+-------------+ | shanghai | NULL | | shanghai | 100.4 | | shanghai | 100.5 | | china | 100.1 | | china | 100.2 | | china | 100.3 | +------------+-------------+The following command queries the values of the `region` and `total_price` columns from the `sale_detail` table, performs hash sharding on the results based on the `region` values, and then locally sorts the hash-sharded results by `total_price` in descending order. The following example shows a sample command.
-- Change the number of Workers for each Reduce stage. SET odps.stage.reducer.num=2; SELECT region,total_price FROM sale_detail DISTRIBUTE BY region SORT BY total_price DESC;The following result is returned.
+------------+-------------+ | region | total_price | +------------+-------------+ | shanghai | 100.5 | | shanghai | 100.4 | | shanghai | NULL | | china | 100.3 | | china | 100.2 | | china | 100.1 | +------------+-------------+
If
SORT BYis used withoutDISTRIBUTE BY,SORT BYsorts the data within each reducer.This ensures that the output data of each reducer is ordered, which can improve the storage compression ratio. It can also reduce the amount of data read from the disk during filtering and improve the efficiency of subsequent global sorting operations. The following example shows a sample command.
-- Set the number of Workers for each Reduce stage. SET odps.stage.reducer.num=2; SELECT region,total_price FROM sale_detail SORT BY total_price DESC;The following result is returned.
+------------+-------------+ | region | total_price | +------------+-------------+ | shanghai | 100.5 | | shanghai | 100.4 | | china | 100.3 | | china | 100.2 | | china | 100.1 | | shanghai | NULL | +------------+-------------+
Columns in an
ORDER BY, DISTRIBUTE BY, or SORT BYclause must be specified by the aliases of the output columns in aSELECTstatement. Column aliases can be specified in Chinese.The
ORDER BY,DISTRIBUTE BY, and `SORT BY` clauses are executed after the `SELECT` operation. Therefore, the columns in these clauses must be output columns from theSELECTstatement.You cannot use
ORDER BYtogether withDISTRIBUTE BYorSORT BY. Similarly, you cannot useGROUP BYtogether withDISTRIBUTE BYorSORT BY.
LIMIT to restrict the number of output rows (number)
`LIMIT ` is optional. The number in LIMIT is a constant that restricts the number of output rows. The value of `number` is a 32-bit integer, and the maximum value is 2,147,483,647.
The LIMIT clause filters data after a distributed scan. Therefore, using LIMIT does not reduce the amount of scanned data or lower computing costs.
If you encounter the following scenarios, you can refer to the corresponding solutions:
The
ORDER BYclause can now be used without theLIMITclause.Because
ORDER BYperforms a global sort on a single execution node, aLIMITrestriction is enforced by default. This prevents misuse that can cause a single node to process a large amount of data. If your scenario requires you to useORDER BYwithout theLIMITrestriction, you can use one of the following methods:Project level: Run the
SETPROJECT odps.sql.validate.orderby.limit=false;command to disable the restriction thatORDER BYmust be used withLIMIT.Session level: Set
SET odps.sql.validate.orderby.limit=false;to disable the requirement thatORDER BYclauses must be used withLIMITclauses. Submit this command together with the SQL statement.NoteIf you disable the requirement to use
ORDER BYwithLIMIT, sorting large amounts of data on a single execution node will consume more resources and take longer.
Removing the screen display limit
If you run a
SELECTstatement without aLIMITclause, or if theNUMBERspecified in theLIMITclause exceeds the configured display limit (n), the results window displays a maximum of n rows.The screen display limit can vary for each project. You can control this limit using the following methods:
If data protection is disabled for the project, you must modify the `odpscmd_config.ini` file.
Set
use_instance_tunnel=truein the `odpscmd_config.ini` file. If theinstance_tunnel_max_recordparameter is not configured, the number of rows that can be displayed is unlimited. Otherwise, the number of rows that can be displayed is limited by the value of theinstance_tunnel_max_recordparameter. The maximum value of theinstance_tunnel_max_recordparameter is 10,000 rows. For more information about Instance Tunnel, see Usage notes.If data protection is enabled for the project, the number of displayable rows is limited by the
READ_TABLE_MAX_ROWparameter. The maximum value of this parameter is 10,000.
NoteYou can run the
SHOW SecurityConfiguration;command to view the configuration of theProjectProtectionproperty. IfProjectProtection=true, you must determine whether to disable the data protection mechanism based on your project's data protection requirements. To disable the mechanism, run theSET ProjectProtection=false;command. By default, theProjectProtectionproperty is not enabled. For more information about the project data protection mechanism, see Data protection mechanism.
WINDOW clause (window_clause)
For more information about the window clause, see Window function syntax.