This topic describes the SELECT syntax in MaxCompute and the precautions when you execute SELECT statements to perform operations such as nested queries, sorting, and queries by group.
Syntax
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY order_condition]
[DISTRIBUTE BY distribute_condition [SORT BY sort_condition] ]
[LIMIT number]
Limits
- A SELECT statement can return a maximum of 10,000 rows of results. This limit does not apply to SELECT clauses. SELECT clauses return all results in response to the query from the upper layer.
- You are not allowed to perform full scans on partitioned tables by using SELECT statements.
If your project was created after 20:00 on January 10, 2018, you are not allowed to perform full scans on partitioned tables in your project by using SQL statements, unless otherwise specified. To query data in partitioned tables, you must specify the partitions that you want to scan. This reduces unnecessary I/O and saves computing resources. This also reduces your computing costs if you use the pay-as-you-go billing method.
To perform a full scan on a partitioned table, add theset odps.sql.allow.fullscan=true;
command before the SQL statement that is used for the full table scan. Then, commit and run the added command with the SQL statement. Assume that thesale_detail
table is a partitioned table. To perform a full scan on this table, execute the following statements:set odps.sql.allow.fullscan=true; SELECT * FROM sale_detail;
select_expr
- Specify the names of the columns from which you want to read data. For example, to
read the
shop_name
column from thesale_detail
table, execute the following statement:SELECT shop_name FROM sale_detail;
- Use an asterisk (
*
) to represent all columns. For example, to read all columns from thesale_detail
table, execute the following statement:SELECT * FROM sale_detail;
You can also use aWHERE
clause to specify filter conditions.SELECT * FROM sale_detail WHERE shop_name LIKE 'hang%';
- Use a regular expression. Examples:
SELECT `abc. *` FROM t;
: Select all columns whose names start withabc
from thet
table.SELECT `(ds)? +. +` FROM t;
: Select all columns whose names are notds
from thet
table.SELECT `(ds|pt)? +. +` FROM t;
: Select all columns except the columns whose names areds
andpt
from thet
table.SELECT `(d. *)? +. +` FROM t;
: Select all columns except the columns whose names start withd
from thet
table.
NoteIf the name of col2 is the prefix of the name of col1 and you want to exclude multiple columns, make sure that the name of col1 is placed before that of col2. The longer column name is placed before the shorter column name. For example, if you want to exclude partitions
ds
anddshh
from your query of a table, executeSELECT `(dshh|ds)? +. +` FROM tbl;
instead ofSELECT `(ds|dshh)? +. +` FROM tbl;
. - Use
DISTINCT
before the name of a column to filter out duplicate values from the column and return only distinct values. If you useALL
before the name of a column, all values, including the duplicate values, in the column are returned. If DISTINCT is not used,ALL
is used by default.Examples:-- Query data from the region column in the sale_detail table and return only distinct values. SELECT DISTINCT region FROM sale_detail; +------------+ | region | +------------+ | shanghai | +------------+ -- If you use DISTINCT in a SELECT statement where multiple columns are specified, DISTINCT applies to all the columns you specified, instead of a single column. SELECT DISTINCT region, sale_date FROM sale_detail; +------------+------------+ | region | sale_date | +------------+------------+ | shanghai | 20191110 | +------------+------------+
table_reference
table_reference
specifies the table that you want to query. You can also use table_reference to specify
a nested subquery. Example: SELECT * FROM (SELECT region FROM sale_detail) t WHERE region = 'shanghai';
where_condition
The following table describes the operators that are used as filter conditions supported bywhere_condition
.
Operator | Description |
---|---|
>, <, =, >=, <=, and <> | The relational operators. |
LIKE and RLIKE | The source and pattern parameters of LIKE and RLIKE must be of the STRING type. |
IN and NOT IN | If a subquery is added after the IN or NOT IN operator, the values in only one column can be returned for the subquery and the number of return values cannot exceed 1,000. |
BETWEEN…AND | The operator that specifies the query range. |
- In
where_condition
of a SELECT statement, you can specify the partitions that you want to scan in a table. Example:SELECT sale_detail. * FROM sale_detail WHERE sale_detail.sale_date >= '2008' AND sale_detail.sale_date <= '2014';
Note You can execute theEXPLAIN SELECT
statement to check whether partition pruning takes effect. A common user-defined function (UDF) or the partition condition settings ofJOIN
may cause partition pruning to fail. For more information, see Check whether partition pruning is effective. - UDFs support partition pruning. UDFs are executed as small jobs and then replaced
with the execution results. You can enable UDF-based partition pruning by using one
of the following methods:
- Add an annotation to the UDF class when you write a UDF.
@com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)
Note The UDF annotationcom.aliyun.odps.udf.annotation.UdfProperty
is defined in the odps-sdk-udf.jar file. You must update the version of the referenced odps-sdk-udf to 0.30.X or later. - Add
set odps.sql.udf.ppr.deterministic = true;
before the SQL statements to execute. This way, all UDFs in the SQL statements are considereddeterministic
UDFs. The preceding SET statement backfills partitions with execution results. A maximum of 1,000 partitions can be backfilled with execution results. If you add an annotation to a UDF class, an error may be returned. This indicates that more than 1,000 partitions are backfilled with execution results. To ignore this error, you can run theset odps.sql.udf.ppr.to.subquery = false;
command. After you run this command, UDF-based partition pruning is no longer in effect.
- Add an annotation to the UDF class when you write a UDF.
- Use the
BETWEEN…AND
operator to filter data. Example:SELECT sale_detail. * FROM sale_detail WHERE sale_detail.sale_date BETWEEN '2008' AND '2014';
GROUP BY
GROUP BY
is typically used with aggregate functions. If a SELECT
statement contains aggregate functions, the following rules apply:
- During the parsing of SQL statements,
GROUP BY
precedesSELECT
. Therefore, the key ofGROUP BY
must be the name or expression of a column in the input table in theSELECT
statement. You are not allowed to specify the alias of an output column in theSELECT
statement as the key of GROUP BY. - The key of
GROUP BY
may be both the name or expression of a column in the input table and the alias of an output column in theSELECT
statement. In this case, the name or expression of the column in the input table is used as the key of GROUP BY. - If
set hive.groupby.position.alias=true;
is added before a SELECT statement, the integer constants inGROUP BY
are considered column IDs in the SELECT statement.set hive.groupby.position.alias=true;-- Run this command with the next SELECT statement. SELECT region, SUM(total_price) FROM sale_detail GROUP BY 1;-- 1 indicates the region column, which is the first column read by the SELECT statement. This statement groups the table data based on the values in the region column and returns distinct region values and total sales of each group.
-- The statement uses the name of the region column in the input table as the key of GROUP BY and groups the table data based on the values in the region column.
SELECT region FROM sale_detail GROUP BY region;
-- The statement groups the table data based on the values in the region column and returns the total sales of each group.
SELECT SUM(total_price) FROM sale_detail GROUP BY region;
-- The statement groups the table data based on the values in the region column and returns distinct values and total sales of each group.
SELECT region, SUM(total_price) FROM sale_detail GROUP BY region;
-- An error is returned when the alias of the column in the SELECT statement is used as the key of the GROUP BY clause.
SELECT region AS r FROM sale_detail GROUP BY r;
-- A complete expression of the column is required.
SELECT 2 + total_price AS r FROM sale_detail GROUP BY 2 + total_price;
-- An error is returned when a column that does not use aggregate functions in the SELECT statement is not included in the GROUP BY clause.
SELECT region, total_price FROM sale_detail GROUP BY region;
-- The SELECT statement can be executed when all the columns that do not use aggregate functions are included in the GROUP BY clause.
SELECT region, total_price FROM sale_detail GROUP BY region, total_price;
ORDER BY|DISTRIBUTE BY|SORT BY
ORDER BY
ORDER BY is used to sort all data records based on the specified columns.- To sort data records in descending order, use the
DESC
keyword. By default, data records are sorted in ascending order. - If you use
ORDER BY
to sort data records, NULL is considered the smallest value. This is also the case in MySQL, but not in Oracle. ORDER BY
must be followed by the alias of a specific column in theSELECT
statement. If a column is included in theSELECT
statement but you do not specify an alias for the column, the name of the column is considered its alias.- If
set hive.orderby.position.alias=true;
is added before a SELECT statement, the integer constants in ORDER BY are considered column IDs in the SELECT statement.-- Set the flag. set hive.orderby.position.alias=true; -- Create the src table. CREATE table src(key BIGINT, value BIGINT); -- Query the src table and sort the first 100 records that are returned in ascending order by value. SELECT * FROM src ORDER BY 2 LIMIT 100; The preceding statement is equivalent to the following statement: SELECT * FROM src ORDER BY value LIMIT 100;
- You can use an OFFSET clause with an ORDER BY LIMIT clause to skip a specific number
of rows.
-- Sort the rows of the src table in ascending order by key, and return the eleventh to thirtieth rows. OFFSET 10 indicates that the first 10 rows are skipped, and LIMIT 20 indicates that a maximum of 20 rows can be returned. SELECT * FROM src ORDER BY key LIMIT 20 OFFSET 10;
Examples-- Query data records in the sale_detail table and sort the first 100 records that are returned in ascending order by region. SELECT * FROM sale_detail ORDER BY region LIMIT 100; -- By default, the LIMIT clause is used with the ORDER BY clause. If only the ORDER BY clause is used, an error is returned. If you want to remove this limit, see Remove the limit on simultaneous execution of ORDER BY and LIMIT. SELECT * FROM sale_detail ORDER BY region; -- ORDER BY is followed by the alias of a column. SELECT region AS r FROM sale_detail ORDER BY region LIMIT 100; SELECT region AS r FROM sale_detail ORDER BY r LIMIT 100;
- To sort data records in descending order, use the
DISTRIBUTE BY
DISTRIBUTE BY is used to shard data based on the hash values of specified columns. You must specify the alias of an output column of the
SELECT
statement as the key of DISTRIBUTE BY.Examples-- The statement queries values in the region column of the sale_detail table and shards data based on the hash values of the region column. SELECT region FROM sale_detail DISTRIBUTE BY region; -- The statement can be executed when the column name is an alias of the column. SELECT region AS r FROM sale_detail DISTRIBUTE BY region; The preceding statement is equivalent to the following statement: SELECT region AS r FROM sale_detail DISTRIBUTE BY r;
SORT BY
SORT BY is used to sort specific data records. You can add a keyword, such as ASC or DESC, to sort specific data records in order. If ASC is used, data records are sorted in ascending order. If DESC is used, data records are sorted in descending order. If SORT BY is not followed by a keyword, data records are automatically sorted in ascending order.- If
SORT BY
is preceded byDISTRIBUTE BY
,SORT BY
sorts specific results ofDISTRIBUTE BY
.DISTRIBUTE BY
determines how the map output values are distributed among reducers. If you want to prevent duplicate output values being distributed to different reducers or you want to process the same group of data together, you can useDISTRIBUTE BY
to ensure that the same group of data is distributed to the same reducer. Then, useSORT BY
to sort the group of data. Example:-- The statement queries values in the region column of the sale_detail table, shards data based on the hash values of the region column, and then sorts the sharded data. SELECT region FROM sale_detail DISTRIBUTE BY region SORT BY region;
- If
SORT BY
is not preceded byDISTRIBUTE BY
,SORT BY
sorts the data of each reducer. This process sorts specific data records. This ensures that the output values of each reducer are sorted in order and increases the storage compression ratio. If data is filtered during data reading, this method reduces the amount of data read from disks and improves the efficiency of subsequent global sorting.
ExampleSELECT region FROM sale_detail SORT BY region;
- If
- The value of ORDER BY, DISTRIBUTE BY, or SORT BY must be the alias of an output column
in the
SELECT
statement. The column alias can be Chinese. - During the parsing of MaxCompute SQL statements, the ORDER BY, DISTRIBUTE BY, or SORT
BY clause is executed after the
SELECT
statement. Therefore, the value of ORDER BY, DISTRIBUTE BY, or SORT BY must be the alias of an output column in theSELECT
statement. ORDER BY
cannot be used withDISTRIBUTE BY
orSORT BY
. Similarly,GROUP BY
cannot be used withDISTRIBUTE BY
orSORT BY
.
LIMIT
The number
in the LIMIT
clause is a constant that limits the number of rows returned.
LIMIT
scans and filters data based on a distributed system. When you use LIMIT
, the amount of data returned is not reduced. Therefore, computing costs are not reduced.
Remove the limit on simultaneous execution of ORDER BY
and LIMIT
ORDER BY
sorts all data of a single execution node. By default, ORDER BY is used with LIMIT
to prevent a single node from processing large amounts of data. You can remove the
limit on simultaneous execution of ORDER BY
with LIMIT
from a project or session.
- To remove the limit from a project, run the
setproject odps.sql.validate.orderby.limit=false;
command. - To remove the limit from a session, commit and run the
setproject odps.sql.validate.orderby.limit=false;
command with the SQL statement.Note After the limit is removed, if a single execution node has large amounts of data to sort, more resources and time are consumed.
Limits on the number of rows displayed on the screen
If you use a SELECT
statement without LIMIT
or the number in the LIMIT
clause exceeds the maximum number of rows that can be displayed on the screen, you
can view only the rows whose quantity does not reach the maximum on the screen.
- If project data protection is disabled, modify the odpscmd config.ini file.
Set
use_instance_tunnel
to true in the odpscmd config.ini file. If theinstance_tunnel_max_record
parameter is not configured, the number of rows displayed on the screen is not limited. Otherwise, the number of rows displayed is limited by theinstance_tunnel_max_record
parameter. The maximum value of theinstance_tunnel_max_record
parameter is 10000. For more information about InstanceTunnel, see Tunnel command usage. - If project data protection is enabled, the number of rows that can be displayed on
the screen is limited by
READ_TABLE_MAX_ROW
. The maximum value of this parameter is 10000.
show SecurityConfiguration;
command to view the value of ProjectProtection
. If ProjectProtection
is set to true, you can determine whether to disable project data protection as required.
You can run the set ProjectProtection=false;
command to disable project data protection. By default, ProjectProtection
is set to false. For more information about project data protection, see Project data protection.