Introduction to SELECT syntax - MaxCompute - Alibaba Cloud Documentation Center

MaxCompute lets you query data using SELECT statements. This topic describes the syntax of SELECT commands and how to perform operations such as nested queries, group queries, and sorting.

Before you execute SELECT statements, make sure that you have the `Select` permission on the destination table. For more information, see MaxCompute permissions.

You can execute the statements on the following platforms:

Features

SELECT statements are used to query data that meets the specified conditions from a table. The following table describes the query operations that can be performed in different scenarios.

Type	Feature
Subqueries	Performs a further query on the results of a previous query.
INTERSECT, UNION, and EXCEPT	Performs intersection, union, or complement operations on query result datasets.
JOIN	Uses a `JOIN` operation to join tables and return data that meets the join and query conditions.
SEMI JOIN	Filters data in the left table using the right table. The result set contains only data from the left table.
MAPJOIN HINT	To improve query performance for a `JOIN` operation on a large table and one or more small tables, use the `SELECT` statement to explicitly specify a `MAPJOIN` hint.
SKEWJOIN HINT	When a JOIN operation on two tables has a hot spot, a long-tail issue can occur. To improve JOIN efficiency, extract the hot spot keys, process the hot spot data and non-hot spot data separately, and then merge the results.
LATERAL VIEW	Use Lateral View with a user-defined table-valued function (UDTF) to split a single row of data into multiple rows.
GROUPING SETS	Aggregates and analyzes data from multiple dimensions.
SELECT TRANSFORM	`SELECT TRANSFORM` syntax starts a specified child process, sends formatted input data to the process through standard input, and parses the standard output from the process to retrieve the output data.
Split Size hint	Modify the Split Size to control the concurrency level.
Time travel queries and incremental queries	Delta tables support: Time Travel queries to query historical snapshots of the source table from a specific point in time or version. Incremental queries to query historical incremental data from the source table within a specified time or version range.

Limits

When a SELECT statement is executed, a maximum of 10,000 rows of results are displayed, and the size of the returned results cannot exceed 10 MB. This limit does not apply when a SELECT statement is used as a subquery. In this case, the SELECT clause returns all results to the parent query.
When you use a SELECT statement to query a partitioned table, full table scans are prohibited by default.
For projects created after 20:00:00 on January 10, 2018, you cannot perform a full table scan on a partitioned table in the project by default. When you query data from a partitioned table, you must specify a partition. This practice reduces unnecessary I/O operations and computing resource consumption. It also reduces unnecessary computing costs if you use the pay-as-you-go billing method.
To perform a full table scan on a partitioned table, you can add the SET odps.sql.allow.fullscan=true; command before the SQL statement and submit them together for execution. For example, if the sale_detail table is a partitioned table, you must execute the following statements together to perform a full table query:
```
SET odps.sql.allow.fullscan=true;
SELECT * from sale_detail;
```
When you query a clustered table, the current version optimizes bucket pruning only when 400 or fewer partitions are scanned in a single table. If bucket pruning does not take effect, more data is scanned. If you use the pay-as-you-go billing method, your costs increase. If you use the subscription billing method, the SQL computing performance decreases.

Syntax

[WITH <cte>[, ...] ]
SELECT [ALL | DISTINCT] <SELECT_expr>[, <EXCEPT_expr>][, <REPLACE_expr>] ...
       FROM <TABLE_reference>
       [WHERE <WHERE_condition>]
       [GROUP BY {<col_list>|ROLLUP(<col_list>)}]
       [HAVING <HAVING_condition>]
       [WINDOW <WINDOW_clause>]
       [ORDER BY <ORDER_condition>]
       [DISTRIBUTE BY <DISTRIBUTE_condition> [SORT BY <SORT_condition>]|[ CLUSTER BY <CLUSTER_condition>] ]
       [LIMIT <number>]

For more information about the execution order of clauses in a SELECT statement, see Execution order of clauses in a SELECT statement.

Sample data

This topic provides source data and related examples to help you understand how to use the commands. The following sample commands create the `sale_detail` table and add data to it.

-- Create a partitioned table named sale_detail.
CREATE TABLE IF NOT EXISTS sale_detail
(
shop_name     STRING,
customer_id   STRING,
total_price   DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);

-- Add partitions to the sale_detail table.
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china');

-- Insert data into the sale_detail table.
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);

The following example shows a command to query data in the partitioned table `sale_detail`:

SELECT * FROM sale_detail;
--The following result is returned.
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

WITH clause (CTE)

Optional. The `WITH` clause contains one or more common table expressions (CTEs). A CTE acts as a temporary table in the current runtime environment. You can reference this table in later queries. The following rules apply to CTEs:

CTEs within the same `WITH` clause must have unique names.
A CTE can be referenced only by other CTEs defined within the same `WITH` clause.
For example, assume that A is the first CTE and B is the second CTE in the same `WITH` clause:
- The 'A references A' condition is invalid. The following is an example of an incorrect command.
```
WITH 
A AS (SELECT 1 FROM A) 
SELECT * FROM A;
```
  The following result is returned:
```
FAILED: ODPS-0130161:[1,6] Parse exception - recursive cte A is invalid, it must have an initial_part and a recursive_part, which must be connected by UNION ALL
```
- Circular references are not supported. For example, A cannot reference B if B references A. Incorrect command example:
```
WITH 
A AS (SELECT * FROM B ), 
B AS (SELECT * FROM A ) 
SELECT * FROM B;
```
  The following result is returned:
```
FAILED: ODPS-0130071:[1,26] Semantic analysis exception - while resolving view B - [1,51]recursive function call is not supported, cycle is A->B->A
```
- The following example shows a correct command.
```
WITH 
A AS (SELECT 1 AS C),
B AS (SELECT * FROM A) 
SELECT * FROM B;
```
  The following result is returned.
```
+---+
| c |
+---+
| 1 |
+---+
```

Column expression (SELECT_expr)

`select_expr` is required. SELECT_expr uses the format col1_name, col2_name, column expression,.... This expression specifies the columns, partition key columns, or regular expressions that you want to query. The following rules apply to `select_expr`:

You can specify the columns to read by name.
The following statement reads the shop_name column from the sale_detail table.
```
SELECT shop_name FROM sale_detail;
```
The following result is returned.
```
+------------+
| shop_name  |
+------------+
| s1         |
| s2         |
| s3         |
+------------+
```

You can use an asterisk (*) to represent all columns. You can also use it with a WHERE clause to specify filter conditions.

The following command reads all columns from the sale_detail table.

-- Enable a full table scan only for the current session.
SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail;

The following result is returned.

+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

You can specify filter conditions in the WHERE clause. The following statement provides an example.

SELECT * FROM sale_detail WHERE shop_name='s1';

The following result is returned.

+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

You can use regular expressions.

The following command selects all columns from the sale_detail table whose names start with sh.

SELECT `sh.*` FROM sale_detail;

The following result is returned.

+------------+
| shop_name  |
+------------+
| s1         |
| s2         |
| s3         |
+------------+

The following statement selects all columns whose names are not shop_name from the sale_detail table.

SELECT `(shop_name)?+.+` FROM sale_detail;

The following result is returned.

+-------------+-------------+------------+------------+
| customer_id | total_price | sale_date  | region     |
+-------------+-------------+------------+------------+
| c1          | 100.1       | 2013       | china      |
| c2          | 100.2       | 2013       | china      |
| c3          | 100.3       | 2013       | china      |
+-------------+-------------+------------+------------+

The following command selects all columns from the sale_detail table except for shop_name and customer_id. The following statement provides an example.

SELECT `(shop_name|customer_id)?+.+` FROM sale_detail;

The following result is returned.

+-------------+------------+------------+
| total_price | sale_date  | region     |
+-------------+------------+------------+
| 100.1       | 2013       | china      |
| 100.2       | 2013       | china      |
| 100.3       | 2013       | china      |
+-------------+------------+------------+

The following command selects all columns from the sale_detail table except for those whose names start with t. The following command provides an example.
```
SELECT `(t.*)?+.+` FROM sale_detail;
```
The following result is returned.
```
+------------+-------------+------------+------------+
| shop_name  | customer_id | sale_date  | region     |
+------------+-------------+------------+------------+
| s1         | c1          | 2013       | china      |
| s2         | c2          | 2013       | china      |
| s3         | c3          | 2013       | china      |
+------------+-------------+------------+------------+
```
Note
If you want to exclude multiple columns and one column name is a prefix of another, you must place the longer column name before the shorter one in the regular expression. For example, a table has two partitions that you do not need to query. One partition is named ds, and the other is named dshh. Because `ds` is a prefix of `dshh`, the correct expression is SELECT `(dshh|ds)?+.+` FROM t;. The incorrect expression is SELECT `(ds|dshh)?+.+` FROM t; .

You can use the DISTINCT keyword before a column name to remove duplicate values and return only unique values. The ALL keyword returns all values in the column, including duplicates. If you do not specify a keyword, ALL is used by default.
- The following example shows a command that queries the `region` column in the `sale_detail` table. If duplicate values exist, only one is displayed.
```
SELECT DISTINCT region FROM sale_detail;
```
  The following result is returned.
```
+------------+
| region     |
+------------+
| china      |
+------------+
```
- When you remove duplicates from multiple columns, the DISTINCT keyword applies to the entire set of columns in the SELECT clause, not to individual columns. The following statement is an example.
```
SELECT DISTINCT region, sale_date FROM sale_detail;
```
  The following result is returned.
```
+------------+------------+
| region     | sale_date  |
+------------+------------+
| china      | 2013       |
+------------+------------+
```
- You can use DISTINCT with a window function. In this case, DISTINCT removes duplicate results that are returned by the window function. The following statement provides an example:
```
SET odps.sql.allow.fullscan=true;
SELECT DISTINCT sale_date, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total_price) AS rn FROM sale_detail;
```
  The following result is returned.
```
+-----------+------------+
| sale_date | rn         |
+-----------+------------+
| 2013      | 1          |
+-----------+------------+
```
  You cannot use DISTINCT and GROUP BY in the same query. For example, if you execute the following command, an error is returned.
```
SELECT DISTINCT shop_name FROM sale_detail GROUP BY shop_name;
-- The error message "GROUP BY cannot be used with SELECT DISTINCT" is returned.
```

Exclude columns (EXCEPT_expr)

The EXCEPT_expr expression is optional and uses the EXCEPT(col1_name, col2_name, ...) format. You can use this expression to read data from most columns in a table while excluding a few specific columns. For example, the SELECT * EXCEPT(col1_name, col2_name, ...) FROM ...; statement reads data from all columns except for the `col1` and `col2` columns.

The following example shows a sample command.

-- Read data from all columns, except the region column, in the sale_detail table.
SELECT * EXCEPT(region) FROM sale_detail;

The following result is returned.

+-----------+-------------+-------------+-----------+
| shop_name | customer_id | total_price | sale_date |
+-----------+-------------+-------------+-----------+
| s1        | c1          | 100.1       | 2013      |
| s2        | c2          | 100.2       | 2013      |
| s3        | c3          | 100.3       | 2013      |
+-----------+-------------+-------------+-----------+

Replace columns (REPLACE_expr)

`replace_expr` is optional. REPLACE_expr uses the REPLACE(exp1 [as] col1_name, exp2 [as] col2_name, ...) format. You can use `replace_expr` to read data from most columns in a table and modify the data of a few columns. For example, you can execute the SELECT * REPLACE(exp1 as col1_name, exp2 as col2_name, ...) FROM ...; statement to replace the data in the `col1` column with the result of `exp1` and the data in the `col2` column with the result of `exp2`.

The following example shows a command that reads data from the `sale_detail` table and modifies the data in the `total_price` and `region` columns.

SELECT * REPLACE(total_price+100 AS total_price, 'shanghai' AS region) FROM sale_detail;

The following result is returned.

+-----------+-------------+-------------+-----------+--------+
| shop_name | customer_id | total_price | sale_date | region |
+-----------+-------------+-------------+-----------+--------+
| s1        | c1          | 200.1       | 2013      | shanghai |
| s2        | c2          | 200.2       | 2013      | shanghai |
| s3        | c3          | 200.3       | 2013      | shanghai |
+-----------+-------------+-------------+-----------+--------+

Target table information (TABLE_reference)

`table_reference` is required. TABLE_reference specifies the table that you want to query. The following rules apply to `table_reference`:

You can directly specify the target table name. The following example shows a sample command.

SELECT customer_id FROM sale_detail;

The following result is returned.

+-------------+
| customer_id |
+-------------+
| c1          |
| c2          |
| c3          |
+-------------+

You can use a nested subquery. The following example shows a sample command.

SELECT * FROM (SELECT region,sale_date FROM sale_detail) t WHERE region = 'china';

The following result is returned.

+------------+------------+
| region     | sale_date  |
+------------+------------+
| china      | 2013       |
| china      | 2013       |
| china      | 2013       |
+------------+------------+

WHERE clause (WHERE_condition)

`where_condition` is optional. The WHERE clause specifies a filter condition. If the `WHERE` clause is used for a partitioned table, partition pruning can be performed. The following rules apply to `where_condition`:

You can use relational operators to filter data that meets specified conditions. Relational operators include the following:
- >, <, =, >=, <=, and <>
- LIKE and RLIKE
- IN and NOT IN
- BETWEEN…AND
For more information, see Relational operators.
In the WHERE clause, you can specify a partition range to scan only a specific portion of the table and avoid a full table scan. The following command provides an example.
```
SELECT * 
FROM sale_detail
WHERE sale_date >= '2008' AND sale_date <= '2014';

-- The preceding statement is equivalent to the following statement:
SELECT * 
FROM sale_detail 
WHERE sale_date BETWEEN '2008' AND '2014';
```
The following result is returned.
```
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
+------------+-------------+-------------+------------+------------+
```
Note
You can run the EXPLAIN statement to check whether partition pruning takes effect. A user-defined function (UDF) or the partition condition settings of a JOIN operation may cause partition pruning to fail. For more information, see Check whether partition pruning is effective.
You can use a UDF to implement partition pruning. The UDF is first run as a small job. Then, the result of the job is used to replace the UDF in the original statement.
- Implementation methods
  - When you write a UDF, you can add an annotation to the UDF class.
```
@com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)
```
    Note
    com.aliyun.odps.udf.annotation.UdfProperty is defined in the odps-sdk-udf.jar file. You must upgrade the version of the referenced odps-sdk-udf to 0.30.x or later.
  - You can add SET odps.sql.udf.ppr.deterministic = true; before the SQL statement that you want to execute. Then, all UDFs in the SQL statement are considered deterministic UDFs. The preceding `SET` command backfills partitions with the job results. A maximum of 1,000 partitions can be backfilled with the job results. If you add an annotation to a UDF class, an error that indicates more than 1,000 partitions are backfilled may be returned. To ignore this error, you can run the SET odps.sql.udf.ppr.to.subquery = false; command. After you run this command, UDF-based partition pruning is no longer in effect.
- Notes
  For UDF-based partition pruning to take effect, the UDF must be in the WHERE clause.
  - The following example shows how to correctly use a UDF for partition pruning.
```
--The UDF must be placed in the WHERE clause of the source table:
SELECT key, value FROM srcp WHERE udf(ds) = 'xx';
```
  - The following example shows how to incorrectly use a UDF for partition pruning.
```
--Partition pruning does not take effect for conditions in the JOIN ON clause.
SELECT A.c1, A.c2 FROM srcp1 A  JOIN srcp2  B ON A.c1 = B.c1 AND udf(A.ds) ='xx';
```

If a column in a Column expression (SELECT_expr) uses a function and is renamed using an alias, the alias cannot be referenced in the WHERE clause. Incorrect example:

SELECT  task_name
        ,inst_id
        ,settings
        ,GET_JSON_OBJECT(settings, '$.SKYNET_ID') AS skynet_id
        ,GET_JSON_OBJECT(settings, '$.SKYNET_NODENAME') AS user_agent
FROM    Information_Schema.TASKS_HISTORY
WHERE   ds = '20211215' AND skynet_id IS NOT NULL
LIMIT 10;

GROUP BY clause for grouped queries (col_list)

`GROUP BY` is optional. In most cases, GROUP BY is used with aggregate functions to group rows based on specified columns, partition key columns, or regular expressions. The following rules apply to GROUP BY:

The GROUP BY clause is executed before the SELECT clause. Therefore, the columns in the GROUP BY clause can be specified by the column names of the input table for the SELECT statement, or by an expression that is formed from the columns of that table. When you use the GROUP BY clause, note the following points:

If you use a regular expression in the GROUP BY clause, you must use the complete expression for the columns.
The GROUP BY ALL operation is supported. To use this operation, set odps.sql.bigquery.compatible=true; to enable the BigQuery compatible mode.
All columns in the SELECT list that are not encapsulated in an aggregate function must be included in the GROUP BY clause.

Usage examples:

You can directly use the `region` column in the GROUP BY clause to group data by the values in this column. The following statement provides an example.
```
SELECT region FROM sale_detail GROUP BY region;
```
The following result is returned.
```
+------------+
| region     |
+------------+
| china      |
+------------+
```
The following command groups data by `region` and returns the total sales for each group. The following example shows a sample command.
```
SELECT SUM(total_price) FROM sale_detail GROUP BY region;
```
The following result is returned.
```
+------------+
| _c0        |
+------------+
| 300.6      |
+------------+
```

The following command groups data by `region` and returns the unique `region` value and the total sales for each group. The following example shows a sample command.

SELECT region, SUM (total_price) FROM sale_detail GROUP BY region;

The following result is returned.

+------------+------------+
| region     | _c1        |
+------------+------------+
| china      | 300.6      |
+------------+------------+

The following command provides an example of grouping by the alias of a column in a SELECT statement.

SELECT region AS r FROM sale_detail GROUP BY r;

-- The preceding statement is equivalent to the following statement:
SELECT region AS r FROM sale_detail GROUP BY region;

The following result is returned.

+------------+
| r          |
+------------+
| china      |
+------------+

You can group by a column expression. The following example shows a sample command.

SELECT 2 + total_price AS r FROM sale_detail GROUP BY 2 + total_price;

The following result is returned.

+------------+
| r          |
+------------+
| 102.1      |
| 102.2      |
| 102.3      |
+------------+

All columns in the SELECT list that are not used in an aggregate function must be included in the GROUP BY clause. Otherwise, an error is returned. Incorrect command example:

-- Error: FAILED: ODPS-0130071:[1,16] Semantic analysis exception - column reference sale_detail.total_price should appear in GROUP BY key
SELECT region, total_price FROM sale_detail GROUP BY region;

The following example shows a correct command.

SELECT region, total_price FROM sale_detail GROUP BY region, total_price;

The following result is returned.

+------------+-------------+
| region     | total_price |
+------------+-------------+
| china      | 100.1       |
| china      | 100.2       |
| china      | 100.3       |
+------------+-------------+

When you set odps.sql.bigquery.compatible=true; to enable BigQuery compatible mode, you can use the GROUP BY ALL syntax to automatically group by all query fields.

-- Explicitly list all fields for grouping.
SELECT 
    shop_name, 
    customer_id, 
    sale_date, 
    region, 
    SUM(total_price) AS total_sales
FROM sale_detail
GROUP BY shop_name, customer_id, sale_date, region;

-- Use GROUP BY ALL to group by all fields, which is equivalent to explicitly listing all fields for grouping.
SET odps.sql.bigquery.compatible=true; 
SELECT 
    shop_name, 
    customer_id, 
    sale_date, 
    region, 
    SUM(total_price) AS total_sales
FROM sale_detail
GROUP BY ALL;

The following result is returned.

+-----------+-------------+-----------+--------+-------------+
| shop_name | customer_id | sale_date | region | total_sales |
+-----------+-------------+-----------+--------+-------------+
| s1        | c1          | 2013      | china  | 100.1       |
| s2        | c2          | 2013      | china  | 100.2       |
| s3        | c3          | 2013      | china  | 100.3       |
+-----------+-------------+-----------+--------+-------------+

If you run the SET odps.sql.groupby.position.alias=true; command, integer constants in the GROUP BY clause are treated as ordinal numbers of columns in the SELECT list. Sample statement:

-- Run this command with the following SELECT statement. 
SET odps.sql.groupby.position.alias=true;
-- 1 indicates the region column, which is the first column read by the following SELECT statement. This statement groups table data based on the values of the region column and returns distinct values of the region column and total sales of each group. 
SELECT region, SUM(total_price) FROM sale_detail GROUP BY 1;

The following result is returned.

+------------+------------+
| region     | _c1        |
+------------+------------+
| china      | 300.6      |
+------------+------------+

HAVING clause (HAVING_condition)

The `HAVING` clause is optional. The HAVING clause is typically used with aggregate functions to filter grouped data. Sample statement:

-- Insert data into the sale_detail table to display the data rendering effect.
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);

-- Use having_condition with aggregate functions to filter data.
SELECT region,SUM(total_price) FROM sale_detail 
GROUP BY region 
HAVING SUM(total_price)<305;

The following result is returned.

+------------+------------+
| region     | _c1        |
+------------+------------+
| china      | 300.6      |
| shanghai   | 200.9      |
+------------+------------+

ORDER BY for global sorting (ORDER_condition)

`order_condition` is optional. ORDER BY sorts all data records based on a specified column, partition key column, or constant. The following rules apply to ORDER BY:

By default, data is sorted in ascending order. To sort data in descending order, you must use the DESC keyword.

By default, the ORDER BY clause must be used with a LIMIT clause to restrict the number of output rows. If you do not include a LIMIT clause, an error is returned. To remove the restriction that the ORDER BY clause must be used with the LIMIT clause, see Restricting output rows with LIMIT > Removing the LIMIT requirement for ORDER BY.

The following command queries data from the `sale_detail` table and sorts the results by `total_price` in ascending order, returning the top two rows. The following example shows a sample command.

SELECT * FROM sale_detail ORDER BY total_price LIMIT 2;

The following result is returned.

+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

The following command queries data from the `sale_detail` table and sorts the results by `total_price` in descending order, returning the top two rows. The following example shows a sample command.

SELECT * FROM sale_detail ORDER BY total_price DESC LIMIT 2;

The following result is returned.

+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s3         | c3          | 100.3       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

When you use ORDER BY to sort data, `NULL` values are treated as the smallest values. This behavior is consistent with MySQL but different from Oracle.

The following command queries data from the `sale_detail` table and sorts the results by `total_price` in ascending order, returning the top two rows. The following example shows a sample command.

SELECT * FROM sale_detail ORDER BY total_price LIMIT 2;

The following result is returned.

+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

The ORDER BY clause can reference the alias of a column from the SELECT list. If you do not specify an alias for a column in the SELECT list, the column name is used as its alias.

The ORDER BY clause can be followed by a column alias. The following statement provides an example:

SELECT total_price AS t FROM sale_detail ORDER BY total_price LIMIT 3;

-- The preceding statement is equivalent to the following statement:
SELECT total_price AS t FROM sale_detail ORDER BY t LIMIT 3;

The following result is returned.

+------------+
| t          |
+------------+
| 100.1      |
| 100.2      |
| 100.3      |
+------------+

If you run the SET odps.sql.orderby.position.alias=true; command, integer constants in the ORDER BY clause are treated as ordinal numbers of columns in the SELECT list. The following statement provides an example:

-- Run this command with the following SELECT statement.
SET odps.sql.orderby.position.alias=true;
SELECT * FROM sale_detail ORDER BY 3 LIMIT 3;

The following result is returned.

+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
+------------+-------------+-------------+------------+------------+

The OFFSET clause can be used with the ORDER BY...LIMIT clause to specify the number of rows to skip. The syntax is ORDER BY...LIMIT m OFFSET n, which can be abbreviated as ORDER BY...LIMIT n, m. In this syntax, LIMIT m specifies that `m` rows are returned, and OFFSET n specifies that the first `n` rows of the sorted result are skipped. OFFSET 0 has the same effect as omitting the `OFFSET` clause.
The following command sorts the `sale_detail` table by `total_price` in ascending order and then outputs three rows, starting from the third row. The following example shows a sample command.
```
SELECT customer_id,total_price FROM sale_detail ORDER BY total_price LIMIT 3 OFFSET 2;

-- The preceding statement is equivalent to the following statement:
SELECT customer_id,total_price FROM sale_detail ORDER BY total_price LIMIT 2, 3;
```
The following result is returned.
```
+-------------+-------------+
| customer_id | total_price |
+-------------+-------------+
| c3          | 100.3       |
+-------------+-------------+
```
Because only one row remains after skipping the first two rows, the result contains only one row, even though the limit is 3.
You can use Range Clustering to accelerate global sorting. In typical `ORDER BY` scenarios, all data must be merged and processed on a single instance to ensure global order. This approach prevents you from taking advantage of parallel processing. You can use the `PARTITION` step of Range Clustering to perform concurrent global sorting. This process first samples the data and divides it into ranges, and then sorts each range concurrently to produce a globally sorted result. For more information, see Global sorting acceleration.

DISTRIBUTE BY for hash sharding (DISTRIBUTE_condition)

`distribute_condition` is optional. DISTRIBUTE BY performs hash sharding on data based on the values of specified columns.

DISTRIBUTE BY controls how the output of a map task is partitioned among reduce tasks. If you want to process data from the same group together or prevent content from overlapping in reducers, you can use DISTRIBUTE BY to ensure that rows with the same key are sent to the same reducer.

You must specify the alias of an output column from the SELECT statement. If you execute a SELECT statement to query data of a column and the alias of this column is not specified, the column name is used as the alias. Sample statement:

-- The following statement queries the values of the region column from the sale_detail table and performs hash sharding on data based on the values of the region column.
SELECT region FROM sale_detail DISTRIBUTE BY region;

-- The preceding statement is equivalent to the following statements:
SELECT region AS r FROM sale_detail DISTRIBUTE BY region;
SELECT region AS r FROM sale_detail DISTRIBUTE BY r;

The following result is returned.

+------------+
| r          | 
+------------+
| china      | 
| china      | 
| china      | 
+------------+

SORT BY for local sorting (SORT_condition)

This optional clause is typically used with DISTRIBUTE BY. The following rules apply to SORT BY:

By default, SORT BY sorts data in ascending order. If you want to sort data in descending order, the DESC keyword is required.

If SORT BY is used with DISTRIBUTE BY, SORT BY sorts the data within each partition created by DISTRIBUTE BY.

The following command queries the values of the `region` and `total_price` columns from the `sale_detail` table, performs hash sharding on the results based on the `region` values, and then locally sorts the hash-sharded results by `total_price` in ascending order. The following example shows a sample command.

-- Insert data into the sale_detail table to display the data rendering effect.
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);

---- Modify the number of Workers for each Reduce stage.
SET odps.stage.reducer.num=2;
SELECT region,total_price FROM sale_detail DISTRIBUTE BY region SORT BY total_price;

The following result is returned.

+------------+-------------+
| region     | total_price | 
+------------+-------------+
| shanghai   | NULL        | 
| shanghai   | 100.4       | 
| shanghai   | 100.5       | 
| china      | 100.1       | 
| china      | 100.2       | 
| china      | 100.3       | 
+------------+-------------+

The following command queries the values of the `region` and `total_price` columns from the `sale_detail` table, performs hash sharding on the results based on the `region` values, and then locally sorts the hash-sharded results by `total_price` in descending order. The following example shows a sample command.

-- Change the number of Workers for each Reduce stage.
SET odps.stage.reducer.num=2;
SELECT region,total_price FROM sale_detail DISTRIBUTE BY region SORT BY total_price DESC;

The following result is returned.

+------------+-------------+
| region     | total_price | 
+------------+-------------+
| shanghai   | 100.5       | 
| shanghai   | 100.4       | 
| shanghai   | NULL        | 
| china      | 100.3       | 
| china      | 100.2       | 
| china      | 100.1       | 
+------------+-------------+

If SORT BY is used without DISTRIBUTE BY, SORT BY sorts the data within each reducer.

This ensures that the output data of each reducer is ordered, which can improve the storage compression ratio. It can also reduce the amount of data read from the disk during filtering and improve the efficiency of subsequent global sorting operations. The following example shows a sample command.

-- Set the number of Workers for each Reduce stage.
SET odps.stage.reducer.num=2;
SELECT region,total_price FROM sale_detail SORT BY total_price DESC;

The following result is returned.

+------------+-------------+
| region     | total_price | 
+------------+-------------+
| shanghai   | 100.5       | 
| shanghai   | 100.4       | 
| china      | 100.3       | 
| china      | 100.2       | 
| china      | 100.1       | 
| shanghai   | NULL        | 
+------------+-------------+

Note

Columns in an ORDER BY, DISTRIBUTE BY, or SORT BY clause must be specified by the aliases of the output columns in a SELECT statement. Column aliases can be specified in Chinese.
The ORDER BY, DISTRIBUTE BY, and `SORT BY` clauses are executed after the `SELECT` operation. Therefore, the columns in these clauses must be output columns from the SELECT statement.
You cannot use ORDER BY together with DISTRIBUTE BY or SORT BY. Similarly, you cannot use GROUP BY together with DISTRIBUTE BY or SORT BY.

LIMIT to restrict the number of output rows (number)

`LIMIT ` is optional. The number in LIMIT is a constant that restricts the number of output rows. The value of `number` is a 32-bit integer, and the maximum value is 2,147,483,647.

Note

The LIMIT clause filters data after a distributed scan. Therefore, using LIMIT does not reduce the amount of scanned data or lower computing costs.

If you encounter the following scenarios, you can refer to the corresponding solutions:

The ORDER BY clause can now be used without the LIMIT clause.
Because ORDER BY performs a global sort on a single execution node, a LIMIT restriction is enforced by default. This prevents misuse that can cause a single node to process a large amount of data. If your scenario requires you to use ORDER BY without the LIMIT restriction, you can use one of the following methods:
- Project level: Run the SETPROJECT odps.sql.validate.orderby.limit=false; command to disable the restriction that ORDER BY must be used with LIMIT.
- Session level: Set SET odps.sql.validate.orderby.limit=false; to disable the requirement that ORDER BY clauses must be used with LIMIT clauses. Submit this command together with the SQL statement.
  Note
  If you disable the requirement to use ORDER BY with LIMIT, sorting large amounts of data on a single execution node will consume more resources and take longer.
Removing the screen display limit
If you run a SELECT statement without a LIMIT clause, or if the NUMBER specified in the LIMIT clause exceeds the configured display limit (n), the results window displays a maximum of n rows.
The screen display limit can vary for each project. You can control this limit using the following methods:
- If data protection is disabled for the project, you must modify the `odpscmd_config.ini` file.
  Set use_instance_tunnel=true in the `odpscmd_config.ini` file. If the instance_tunnel_max_record parameter is not configured, the number of rows that can be displayed is unlimited. Otherwise, the number of rows that can be displayed is limited by the value of the instance_tunnel_max_record parameter. The maximum value of the instance_tunnel_max_record parameter is 10,000 rows. For more information about Instance Tunnel, see Usage notes.
- If data protection is enabled for the project, the number of displayable rows is limited by the READ_TABLE_MAX_ROW parameter. The maximum value of this parameter is 10,000.
Note
You can run the SHOW SecurityConfiguration; command to view the configuration of the ProjectProtection property. If ProjectProtection=true, you must determine whether to disable the data protection mechanism based on your project's data protection requirements. To disable the mechanism, run the SET ProjectProtection=false; command. By default, the ProjectProtection property is not enabled. For more information about the project data protection mechanism, see Data protection mechanism.

WINDOW clause (window_clause)

For more information about the window clause, see Window function syntax.