SELECT statements query data from tables. This topic covers the full SELECT syntax in MaxCompute, including clauses for filtering, grouping, sorting, and distributing data.
Before running SELECT statements, make sure you have the Select permission on the target table. For more information, see MaxCompute permissions.
Supported platforms:
Query operations
The following query operations are supported in SELECT statements.
| Operation | Description |
|---|---|
| Subqueries | Perform further queries based on the result of a query |
| INTERSECT, UNION, and EXCEPT | Get the intersection, union, or supplementary set of two datasets |
| JOIN | Join tables based on join and query conditions |
| SEMI JOIN | Filter data in the left table using the right table; return data that appears only in the left table |
| MAPJOIN HINT | Specify MAPJOIN hints to optimize JOIN performance on one large table and one or more small tables |
| SKEWJOIN HINT | Handle hot key values and long tail issues in JOIN operations |
| Lateral View | Use LATERAL VIEW with a user-defined table-valued function (UDTF) to split one row into multiple rows |
| GROUPING SETS | Aggregate and analyze data across multiple dimensions |
| SELECT TRANSFORM | Start a child process and use standard input/output for data I/O |
| Split Size hint | Modify the split size to control subtask parallelism |
| Time travel queries and incremental queries | Query historical data (time travel) or historical incremental data (incremental) on Delta tables |
Limitations
SELECT result display: A maximum of 10,000 rows can be displayed, and the returned result size must be less than 10 MB. This limit does not apply to SELECT clauses used inside larger queries — those return all results to the upper layer.
Full table scan on partitioned tables: Projects created after 20:00:00 on January 10, 2018 cannot perform a full table scan on partitioned tables. Always specify the partitions to scan. This reduces unnecessary I/O, conserves computing resources, and lowers costs when you use pay-as-you-go billing. To enable a full table scan for a specific query, run
SET odps.sql.allow.fullscan=true;together with the query in the same commit:SET odps.sql.allow.fullscan=true; SELECT * FROM sale_detail;Clustered table bucket pruning: Bucket pruning only takes effect when a single table scan covers 400 or fewer partitions. When pruning does not apply, more data is scanned — increasing costs under pay-as-you-go billing and degrading performance under subscription billing.
Syntax
[WITH <cte>[, ...] ]
SELECT [ALL | DISTINCT] <SELECT_expr>[, <EXCEPT_expr>][, <REPLACE_expr>] ...
FROM <TABLE_reference>
[WHERE <WHERE_condition>]
[GROUP BY {<col_list>|ROLLUP(<col_list>)}]
[HAVING <HAVING_condition>]
[WINDOW <WINDOW_clause>]
[ORDER BY <ORDER_condition>]
[DISTRIBUTE BY <DISTRIBUTE_condition> [SORT BY <SORT_condition>]|[ CLUSTER BY <CLUSTER_condition>] ]
[LIMIT <number>]For the order in which clauses execute, see Sequence for executing clauses in a SELECT statement.
Sample data
The examples in this topic use the sale_detail table. Run the following statements to create the table and insert sample data:
-- Create a partitioned table named sale_detail.
CREATE TABLE IF NOT EXISTS sale_detail
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
-- Add a partition.
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china');
-- Insert sample data.
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);Query all columns to confirm the data:
SELECT * FROM sale_detail;Result:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s3 | c3 | 100.3 | 2013 | china |
+------------+-------------+-------------+------------+------------+WITH clause (CTE)
The WITH clause defines one or more common table expressions (CTEs) as temporary tables that subsequent queries can reference.
Rules:
CTE names must be unique within a WITH clause.
A CTE can only reference other CTEs defined in the same WITH clause — not itself, and not in a cycle.
Recursive self-reference (invalid):
WITH
A AS (SELECT 1 FROM A)
SELECT * FROM A;Error:
FAILED: ODPS-0130161:[1,6] Parse exception - recursive cte A is invalid, it must have an initial_part and a recursive_part, which must be connected by UNION ALLCircular reference (invalid):
WITH
A AS (SELECT * FROM B),
B AS (SELECT * FROM A)
SELECT * FROM B;Error:
FAILED: ODPS-0130071:[1,26] Semantic analysis exception - while resolving view B - [1,51]recursive function call is not supported, cycle is A->B->AValid usage:
WITH
A AS (SELECT 1 AS C),
B AS (SELECT * FROM A)
SELECT * FROM B;Result:
+---+
| c |
+---+
| 1 |
+---+Column expression (SELECT_expr)
SELECT_expr (required) specifies the columns or expressions to return. The format is col1_name, col2_name, column expression, ....
Specify column names
Read specific columns by name:
SELECT shop_name FROM sale_detail;Result:
+------------+
| shop_name |
+------------+
| s1 |
| s2 |
| s3 |
+------------+Use * for all columns
* selects all columns. Combine with WHERE to filter:
-- Full table scan must be explicitly enabled for partitioned tables.
SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail;Result:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s3 | c3 | 100.3 | 2013 | china |
+------------+-------------+-------------+------------+------------+With a WHERE filter:
SELECT * FROM sale_detail WHERE shop_name='s1';Result:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
+------------+-------------+-------------+------------+------------+Use regular expressions
Wrap a regular expression in backticks to match column names. The backtick content is interpreted as a regex pattern, not a column name — this is MaxCompute-specific syntax.
Select all columns whose names start with sh:
SELECT `sh.*` FROM sale_detail;Result:
+------------+
| shop_name |
+------------+
| s1 |
| s2 |
| s3 |
+------------+Exclude shop_name (select all other columns):
SELECT `(shop_name)?+.+` FROM sale_detail;Result:
+-------------+-------------+------------+------------+
| customer_id | total_price | sale_date | region |
+-------------+-------------+------------+------------+
| c1 | 100.1 | 2013 | china |
| c2 | 100.2 | 2013 | china |
| c3 | 100.3 | 2013 | china |
+-------------+-------------+------------+------------+Exclude multiple columns (shop_name and customer_id):
SELECT `(shop_name|customer_id)?+.+` FROM sale_detail;Result:
+-------------+------------+------------+
| total_price | sale_date | region |
+-------------+------------+------------+
| 100.1 | 2013 | china |
| 100.2 | 2013 | china |
| 100.3 | 2013 | china |
+-------------+------------+------------+Exclude all columns whose names start with t:
SELECT `(t.*)?+.+` FROM sale_detail;Result:
+------------+-------------+------------+------------+
| shop_name | customer_id | sale_date | region |
+------------+-------------+------------+------------+
| s1 | c1 | 2013 | china |
| s2 | c2 | 2013 | china |
| s3 | c3 | 2013 | china |
+------------+-------------+------------+------------+When excluding multiple columns where one name is a prefix of another, put the longer name first. For example, to exclude bothdsanddshh, use `(dshhds)?+.+— not(dsdshh)?+.+`.
Use DISTINCT and ALL
DISTINCT removes duplicate rows. ALL (the default) returns all rows including duplicates.
Return distinct values of region:
SELECT DISTINCT region FROM sale_detail;Result:
+------------+
| region |
+------------+
| china |
+------------+When DISTINCT is applied to multiple columns, it deduplicates based on the combination of all specified columns:
SELECT DISTINCT region, sale_date FROM sale_detail;Result:
+------------+------------+
| region | sale_date |
+------------+------------+
| china | 2013 |
+------------+------------+DISTINCT can be used with window functions to deduplicate the computed results:
SET odps.sql.allow.fullscan=true;
SELECT DISTINCT sale_date, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total_price) AS rn FROM sale_detail;Result:
+-----------+------------+
| sale_date | rn |
+-----------+------------+
| 2013 | 1 |
+-----------+------------+DISTINCTcannot be used withGROUP BY. The following statement returns an error:
SELECT DISTINCT shop_name FROM sale_detail GROUP BY shop_name;
-- Error: GROUP BY cannot be used with SELECT DISTINCTColumn exclusion (EXCEPT)
EXCEPT (optional) excludes specific columns when reading all other columns. Format: EXCEPT(col1_name, col2_name, ...).
Read all columns except region:
SELECT * EXCEPT(region) FROM sale_detail;Result:
+-----------+-------------+-------------+-----------+
| shop_name | customer_id | total_price | sale_date |
+-----------+-------------+-------------+-----------+
| s1 | c1 | 100.1 | 2013 |
| s2 | c2 | 100.2 | 2013 |
| s3 | c3 | 100.3 | 2013 |
+-----------+-------------+-------------+-----------+Column modification (REPLACE)
REPLACE (optional) replaces specified columns with expression results while reading all other columns unchanged. Format: REPLACE(exp1 [AS] col1_name, exp2 [AS] col2_name, ...).
Read all columns, replacing total_price with total_price + 100 and region with a fixed string:
SELECT * REPLACE(total_price+100 AS total_price, 'shanghai' AS region) FROM sale_detail;Result:
+-----------+-------------+-------------+-----------+--------+
| shop_name | customer_id | total_price | sale_date | region |
+-----------+-------------+-------------+-----------+--------+
| s1 | c1 | 200.1 | 2013 | shanghai |
| s2 | c2 | 200.2 | 2013 | shanghai |
| s3 | c3 | 200.3 | 2013 | shanghai |
+-----------+-------------+-------------+-----------+--------+Table reference (TABLE_reference)
TABLE_reference (required) specifies the table to query.
Query a table by name:
SELECT customer_id FROM sale_detail;Result:
+-------------+
| customer_id |
+-------------+
| c1 |
| c2 |
| c3 |
+-------------+Use a nested subquery as the table source:
SELECT * FROM (SELECT region,sale_date FROM sale_detail) t WHERE region = 'china';Result:
+------------+------------+
| region | sale_date |
+------------+------------+
| china | 2013 |
| china | 2013 |
| china | 2013 |
+------------+------------+WHERE clause
WHERE (optional) filters rows. For partitioned tables, specifying partition columns in WHERE enables partition pruning, which reduces the amount of data scanned.
Supported relational operators:
>,<,=,>=,<=,<>LIKE,RLIKEIN,NOT INBETWEEN...AND
For more information, see Relational operators.
Filter by partition range to avoid a full table scan:
SELECT *
FROM sale_detail
WHERE sale_date >= '2008' AND sale_date <= '2014';
-- Equivalent to:
SELECT *
FROM sale_detail
WHERE sale_date BETWEEN '2008' AND '2014';Result:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s3 | c3 | 100.3 | 2013 | china |
+------------+-------------+-------------+------------+------------+Use the EXPLAIN statement to verify that partition pruning is working. UDFs or JOIN conditions may prevent pruning. For details, see Check whether partition pruning is effective.
Column aliases in WHERE
A column alias defined using a function cannot be referenced in WHERE. The following statement returns an error:
SELECT task_name
,inst_id
,settings
,GET_JSON_OBJECT(settings, '$.SKYNET_ID') AS skynet_id
,GET_JSON_OBJECT(settings, '$.SKYNET_NODENAME') AS user_agent
FROM Information_Schema.TASKS_HISTORY
WHERE ds = '20211215' AND skynet_id IS NOT NULL
LIMIT 10;UDF-based partition pruning
When you use UDFs in WHERE clauses, MaxCompute can execute the UDFs as small jobs to determine partition ranges. Two methods are available:
Method 1: Annotate the UDF class
Add the following annotation to your UDF class:
@com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)The com.aliyun.odps.udf.annotation.UdfProperty annotation requires odps-sdk-udf version 0.30.X or later.Method 2: Use a SET command
SET odps.sql.udf.ppr.deterministic = true;This treats all UDFs in the statement as deterministic. Up to 1,000 partitions can be backfilled. To suppress errors when more than 1,000 partitions are affected (and disable UDF-based pruning):
SET odps.sql.udf.ppr.to.subquery = false;Placement requirement: The UDF must appear in the WHERE clause of the source table query. Placing it in a JOIN ON condition does not trigger pruning.
Correct:
-- UDF in the WHERE clause of the source table query.
SELECT key, value FROM srcp WHERE udf(ds) = 'xx';Incorrect (UDF in JOIN ON — pruning does not apply):
SELECT A.c1, A.c2 FROM srcp1 A JOIN srcp2 B ON A.c1 = B.c1 AND udf(A.ds) ='xx';GROUP BY
GROUP BY (optional) groups rows by specified columns or expressions, typically used with aggregate functions.
Rules:
GROUP BYis evaluated beforeSELECT. Column references inGROUP BYcan be input table column names, expressions, or output column aliases from SELECT.All non-aggregate columns in SELECT must appear in
GROUP BY.If columns in
GROUP BYare specified by a regular expression, use the complete expression.
Group by column name:
SELECT region FROM sale_detail GROUP BY region;Result:
+------------+
| region |
+------------+
| china |
+------------+Group by column name with aggregate:
SELECT region, SUM(total_price) FROM sale_detail GROUP BY region;Result:
+------------+------------+
| region | _c1 |
+------------+------------+
| china | 300.6 |
+------------+------------+Group by output column alias:
SELECT region AS r FROM sale_detail GROUP BY r;
-- Equivalent to:
SELECT region AS r FROM sale_detail GROUP BY region;Result:
+------------+
| r |
+------------+
| china |
+------------+Group by column expression:
SELECT 2 + total_price AS r FROM sale_detail GROUP BY 2 + total_price;Result:
+------------+
| r |
+------------+
| 102.1 |
| 102.2 |
| 102.3 |
+------------+Non-aggregate columns in SELECT must all be in GROUP BY (incorrect vs. correct):
-- Incorrect: total_price is not in GROUP BY.
SELECT region, total_price FROM sale_detail GROUP BY region;
-- Correct:
SELECT region, total_price FROM sale_detail GROUP BY region, total_price;Result of the correct statement:
+------------+-------------+
| region | total_price |
+------------+-------------+
| china | 100.1 |
| china | 100.2 |
| china | 100.3 |
+------------+-------------+Position alias in GROUP BY
Run SET odps.sql.groupby.position.alias=true; (or SET hive.groupby.position.alias=true;) before the SELECT statement to treat integer constants in GROUP BY as column positions in SELECT:
SET odps.sql.groupby.position.alias=true;
-- 1 refers to the first column (region) in the SELECT list.
SELECT region, SUM(total_price) FROM sale_detail GROUP BY 1;Result:
+------------+------------+
| region | _c1 |
+------------+------------+
| china | 300.6 |
+------------+------------+HAVING clause
HAVING (optional) filters grouped results using aggregate functions.
-- Insert additional data to demonstrate HAVING filtering.
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
-- Return only groups where total sales are less than 305.
SELECT region, SUM(total_price) FROM sale_detail
GROUP BY region
HAVING SUM(total_price)<305;Result:
+------------+------------+
| region | _c1 |
+------------+------------+
| china | 300.6 |
| shanghai | 200.9 |
+------------+------------+ORDER BY
ORDER BY (optional) sorts all rows by the specified column or constant.
Default order: ascending. Add
DESCfor descending.ORDER BYmust be followed byLIMIT <number>by default. See LIMIT to remove this requirement.Columns in
ORDER BYmust reference output column aliases from SELECT. If no alias is specified, the column name is used.ORDER BYcannot be used withDISTRIBUTE BYorSORT BYsimultaneously.
Sort ascending, return the first 2 rows:
SELECT * FROM sale_detail ORDER BY total_price LIMIT 2;Result:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
+------------+-------------+-------------+------------+------------+Sort descending, return the first 2 rows:
SELECT * FROM sale_detail ORDER BY total_price DESC LIMIT 2;Result:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s3 | c3 | 100.3 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
+------------+-------------+-------------+------------+------------+Use an output column alias in ORDER BY:
SELECT total_price AS t FROM sale_detail ORDER BY total_price LIMIT 3;
-- Equivalent to:
SELECT total_price AS t FROM sale_detail ORDER BY t LIMIT 3;Result:
+------------+
| t |
+------------+
| 100.1 |
| 100.2 |
| 100.3 |
+------------+NULL sort behavior
NULL is treated as the smallest value in ORDER BY — the same behavior as MySQL, not Oracle. As a result:
Ascending order (
ASC): NULL rows appear first.Descending order (
DESC): NULL rows appear last.
OFFSET
Skip rows with ORDER BY...LIMIT m OFFSET n. LIMIT m returns m rows; OFFSET n skips the first n rows. The shorthand LIMIT n, m is equivalent.
Return 3 rows starting from the 3rd row (skip the first 2):
SELECT customer_id, total_price FROM sale_detail ORDER BY total_price LIMIT 3 OFFSET 2;
-- Equivalent to:
SELECT customer_id, total_price FROM sale_detail ORDER BY total_price LIMIT 2, 3;Result:
+-------------+-------------+
| customer_id | total_price |
+-------------+-------------+
| c3 | 100.3 |
+-------------+-------------+The table has only 3 rows, so skipping 2 returns only the 3rd row.
Position alias in ORDER BY
Run SET odps.sql.orderby.position.alias=true; (or SET hive.orderby.position.alias=true;) before the SELECT statement to treat integer constants in ORDER BY as column positions:
SET odps.sql.orderby.position.alias=true;
-- 3 refers to the third column (total_price) in the SELECT list.
SELECT * FROM sale_detail ORDER BY 3 LIMIT 3;Result:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s3 | c3 | 100.3 | 2013 | china |
+------------+-------------+-------------+------------+------------+Global sorting acceleration
In standard ORDER BY queries, all data is routed to a single instance for sorting, which limits parallelism. Range clustering enables concurrent global sorting by sampling data, dividing it into ranges, sorting each range in parallel, and combining the results. For details, see Global sorting acceleration.
DISTRIBUTE BY
DISTRIBUTE BY (optional) performs hash partitioning on data based on specified column values. It controls how mapper output is distributed among reducers, ensuring that rows with the same column values go to the same reducer.
Column references in DISTRIBUTE BY must be output column aliases from SELECT. If no alias is specified, the column name is used as the alias.
Hash-partition by region:
-- All three statements below are equivalent.
SELECT region FROM sale_detail DISTRIBUTE BY region;
SELECT region AS r FROM sale_detail DISTRIBUTE BY region;
SELECT region AS r FROM sale_detail DISTRIBUTE BY r;SORT BY
SORT BY (optional) is typically used with DISTRIBUTE BY.
Default order: ascending. Add
DESCfor descending.SORT BYcannot be used withGROUP BYsimultaneously.
With DISTRIBUTE BY: Sorts rows within each partition produced by DISTRIBUTE BY.
Hash-partition by region, then sort by total_price ascending within each partition:
-- Insert data for the shanghai partition.
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
SELECT region, total_price FROM sale_detail DISTRIBUTE BY region SORT BY total_price;Result:
+------------+-------------+
| region | total_price |
+------------+-------------+
| shanghai | NULL |
| china | 100.1 |
| china | 100.2 |
| china | 100.3 |
| shanghai | 100.4 |
| shanghai | 100.5 |
+------------+-------------+Sort descending within each partition:
SELECT region, total_price FROM sale_detail DISTRIBUTE BY region SORT BY total_price DESC;Result:
+------------+-------------+
| region | total_price |
+------------+-------------+
| shanghai | 100.5 |
| shanghai | 100.4 |
| china | 100.3 |
| china | 100.2 |
| china | 100.1 |
| shanghai | NULL |
+------------+-------------+Without DISTRIBUTE BY: Sorts data within each reducer independently. This improves output compression and reduces data read during downstream sorting:
SELECT region, total_price FROM sale_detail SORT BY total_price DESC;Result:
+------------+-------------+
| region | total_price |
+------------+-------------+
| china | 100.3 |
| china | 100.2 |
| china | 100.1 |
| shanghai | 100.5 |
| shanghai | 100.4 |
| shanghai | NULL |
+------------+-------------+Usage notes
Columns in
ORDER BY,DISTRIBUTE BY, andSORT BYmust use output column aliases from SELECT. Column aliases can be Chinese characters.ORDER BY,DISTRIBUTE BY, andSORT BYare evaluated after SELECT, so they must reference SELECT output aliases.ORDER BYcannot be used simultaneously withDISTRIBUTE BYorSORT BY.GROUP BYcannot be used simultaneously withDISTRIBUTE BYorSORT BY.
LIMIT \<number\>
LIMIT <number> (optional) restricts the number of rows returned. The value must be a 32-bit integer with a maximum of 2,147,483,647.
LIMIT scans and filters data in the distributed query engine but does not reduce computing costs.Syntax
LIMIT <number>
LIMIT <offset>, <count> -- shorthand for LIMIT count OFFSET offset
LIMIT <count> OFFSET <offset> -- skip offset rows, then return count rowsFor OFFSET examples, see OFFSET in the ORDER BY section.
ORDER BY with LIMIT
ORDER BY requires LIMIT by default to prevent a single node from sorting large datasets. To remove this requirement:
Project level: Run
SETPROJECT odps.sql.validate.orderby.limit=false;Session level: Run
SET odps.sql.validate.orderby.limit=false;together with the query
Removing this limit may consume significantly more resources and time when a single node has large amounts of data to sort.
Maximum rows displayed
When no LIMIT is specified, or when LIMIT exceeds the display maximum, the number of rows shown is capped:
Project data protection disabled: Controlled by
instance_tunnel_max_recordin the odpscmdconfig.inifile (maximum: 10,000). Setuse_instance_tunnel=trueinconfig.ini. Ifinstance_tunnel_max_recordis not configured, no row limit applies.Project data protection enabled: Controlled by the
READ_TABLE_MAX_ROWparameter (maximum: 10,000).
Check whether project data protection is enabled:
SHOW SecurityConfiguration;Disable project data protection (default: disabled):
SET ProjectProtection=false;For more information, see Project data protection.
Window clause
For window clause syntax and usage, see Syntax.