All Products
Search
Document Center

MaxCompute:SELECT syntax

Last Updated:Jan 15, 2024

MaxCompute allows you to query data by using SELECT statements. This topic describes the syntax of SELECT statements in MaxCompute. This topic also describes how to use SELECT statements to perform operations, such as nested queries, sorting, and queries by group.

Before you execute SELECT statements, make sure that you are granted the Select permission on the destination table. For more information, see MaxCompute permissions.

You can execute the statements that are described in this topic on the following platforms:

Overview

SELECT statements are used to query data that meets the specified conditions from a table. The following table describes the query operations that can be performed in different scenarios.

Query operation

Description

Subqueries

Allows you to perform further queries based on the result of a query.

INTERSECT, UNION, and EXCEPT

Allows you to obtain the intersection, union, or supplementary set of two datasets.

JOIN

Allows you to perform JOIN operations to join tables and obtain the data that meets the join condition and query condition.

SEMI JOIN

Allows you to filter data in the left table by using the right table and obtain the data that appears only in the left table.

MAPJOIN HINT

Allows you to explicitly specify MAPJOIN hints in SELECT statements when you perform JOIN operations on one large table and one or more small tables. This improves query performance.

SKEWJOIN HINT

If two tables that you want to join contain hot key values, a long tail issue may occur. You can extract hot key values from the two tables, separately calculate the join result of hot key values and the join result of non-hot key values, and then join the calculated data.

Lateral View

Allows you to use LATERAL VIEW with a user-defined table-valued function (UDTF) to split one row of data into multiple rows.

GROUPING SETS

Allows you to aggregate and analyze data from multiple dimensions.

SELECT TRANSFORM

Allows you to start a specified child process and use standard input to enter data in the required format. Then, you can parse the standard output of the child process to obtain the output data.

Split Size Hint

Allows you to modify the split size to control the parallelism of subtasks.

Time travel queries and incremental queries

In MaxCompute, you can perform time travel queries and incremental queries on Transaction Table 2.0 tables.

  • If you perform a time travel query, you can query historical data based on the snapshot of the source table at a historical point in time or of a historical version.

  • If you perform an incremental query, you can query historical incremental data within a historical period of time or between two versions of a source table.

    Note

    Transaction Table 2.0 is in invitational preview. By default, you cannot directly use Transaction Table 2.0. To use this feature, you can submit a ticket on the trail application page of MaxCompute. After the Transaction Table 2.0 feature is enabled, you can run commands related to this feature. For more information, see Overview of Transaction Table 2.0.

Limits

  • After a SELECT statement is executed, a maximum of 10,000 rows of results can be displayed. The size of the returned result must be less than 10 MB. This limit does not apply to SELECT clauses. SELECT clauses return all results in response to the query from the upper layer.

  • If you execute a SELECT statement to query data from a partitioned table, you cannot perform a full table scan on the table.

    If your project was created after 20:00:00 on January 10, 2018, you cannot perform a full table scan on a partitioned table in the project. This limit applies when you execute a SELECT statement to query data from the table. To query data from a partitioned table, you must specify the partitions that you want to scan. This reduces unnecessary I/O and conserves computing resources. This also reduces your computing costs if you use the pay-as-you-go billing method.

    To perform a full table scan on a partitioned table, add the set odps.sql.allow.fullscan=true; command before the SQL statement that is used for the full table scan. Then, commit and run the added command with the SQL statement. For example, if you want to perform a full table scan on the sale_detail partitioned table, execute the following statement:

    set odps.sql.allow.fullscan=true;
    select * from sale_detail;
  • If you want to query clustered tables, you can execute a SELECT statement to perform bucket pruning only when the number of partitions that can be scanned in a single table is less than or equal to 400. If bucket pruning does not take effect, the number of data records that are scanned increases. If you use the pay-as-you-go billing method and bucket pruning does not take effect, your costs increase. If you use the subscription billing method and bucket pruning does not take effect, your SQL computing performance deteriorates.

Syntax

[with <cte>[, ...] ]
select [all | distinct] <select_expr>[, <except_expr>][, <replace_expr>] ...
       from <table_reference>
       [where <where_condition>]
       [group by {<col_list>|rollup(<col_list>)}]
       [having <having_condition>]
       [window <window_clause>]
       [order by <order_condition>]
       [distribute by <distribute_condition> [sort by <sort_condition>]|[ cluster by <cluster_condition>] ]
       [limit <number>]

For more information about the sequence to execute clauses in a SELECT statement, see Sequence for executing clauses in a SELECT statement.

Sample data

This topic provides sample source data and sample statements to demonstrate how to prepare source data. The following sample statements show how to create the sale_detail table and insert data into this table.

-- Create a partitioned table named sale_detail. 
create table if not exists sale_detail
(
shop_name     string,
customer_id   string,
total_price   double
)
partitioned by (sale_date string, region string);

-- Add partitions to the sale_detail table. 
alter table sale_detail add partition (sale_date='2013', region='china');

-- Insert data into the sale_detail table. 
insert into sale_detail partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);

Query data in the sale_detail table. Sample statement:

select * from sale_detail;
-- The following result is returned: 
+------------+------------+------------+------------+------------+
| shop_name  | price      | customer   | sale_date  | region     |
+------------+------------+------------+------------+------------+
| s1         | 100.1      | c1         | 2013       | china      |
| s2         | 100.2      | c2         | 2013       | china      |
| s3         | 100.3      | c3         | 2013       | china      |
+------------+------------+------------+------------+------------+

WITH clause (CTE)

The WITH clause is optional. The WITH clause contains one or more common table expressions (CTEs). A CTE is used as a temporary table in the runtime environment. You can reference the temporary table in subsequent queries. When you use a CTE, you must comply with the following rules:

  • The name of a CTE must be unique in a WITH clause.

  • A CTE that is defined in a WITH clause can be referenced only by other CTEs that are defined in the same WITH clause.

    For example, A is the first CTE in a WITH clause and B is the second CTE in the same WITH clause.

    • If A references A, the reference is invalid. Incorrect usage of CTEs:

      with 
      A as (select 1 from A) 
      select * from A;
    • If A references B and B references A, the references are invalid. Circular reference is not supported. Incorrect usage of CTEs:

      with 
      A as (select * from B ), 
      B as (select * from A ) 
      select * from B;

Sample statement of correct usage:

with 
A as (select 1 as C),
B as (select * from A) 
select * from B;

The following result is returned:

+---+
| c |
+---+
| 1 |
+---+

Column expression (select_expr)

select_expr is required. select_expr is in the format of col1_name, col2_name, column expression,.... This format indicates a common column or partition key column that you want to query, or a regular expression that you use to query data. When you use select_expr, you must comply with the following rules:

  • Specify the names of the columns from which you want to read data.

    The following statement reads data of the shop_name column from the sale_detail table. Sample statement:

    select shop_name from sale_detail;

    The following result is returned:

    +------------+
    | shop_name  |
    +------------+
    | s1         |
    | s2         |
    | s3         |
    +------------+
  • Use an asterisk (*) to represent all columns. You can also use an asterisk (*) with where_condition to specify filter conditions.

    • The following statement queries data of all columns from the sale_detail table. Sample statement:

      -- Enable a full table scan only for the current session. 
      set odps.sql.allow.fullscan=true;
      select * from sale_detail;

      The following result is returned:

      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      | s1         | c1          | 100.1       | 2013       | china      |
      | s2         | c2          | 100.2       | 2013       | china      |
      | s3         | c3          | 100.3       | 2013       | china      |
      +------------+-------------+-------------+------------+------------+
    • The following statement uses an asterisk (*) with where_condition to specify filter conditions. Sample statement:

      select * from sale_detail where shop_name='s1';

      The following result is returned:

      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      | s1         | c1          | 100.1       | 2013       | china      |
      +------------+-------------+-------------+------------+------------+
  • Use a regular expression.

    • The following statement queries data of all columns whose names start with sh from the sale_detail table. Sample statement:

      select `sh.*` from sale_detail;

      The following result is returned:

      +------------+
      | shop_name  |
      +------------+
      | s1         |
      | s2         |
      | s3         |
      +------------+
    • The following statement queries data of all columns whose names are not shop_name from the sale_detail table. Sample statement:

      select `(shop_name)?+.+` from sale_detail;

      The following result is returned:

      +-------------+-------------+------------+------------+
      | customer_id | total_price | sale_date  | region     |
      +-------------+-------------+------------+------------+
      | c1          | 100.1       | 2013       | china      |
      | c2          | 100.2       | 2013       | china      |
      | c3          | 100.3       | 2013       | china      |
      +-------------+-------------+------------+------------+
    • The following statement queries data of all columns except the columns whose names are shop_name and customer_id from the sale_detail table. Sample statement:

      select `(shop_name|customer_id)?+.+` from sale_detail;

      The following result is returned:

      +-------------+------------+------------+
      | total_price | sale_date  | region     |
      +-------------+------------+------------+
      | 100.1       | 2013       | china      |
      | 100.2       | 2013       | china      |
      | 100.3       | 2013       | china      |
      +-------------+------------+------------+
    • The following statement queries data of all columns except the columns whose names start with t from the sale_detail table. Sample statement:

      select `(t.*)?+.+` from sale_detail;

      The following result is returned:

      +------------+-------------+------------+------------+
      | shop_name  | customer_id | sale_date  | region     |
      +------------+-------------+------------+------------+
      | s1         | c1          | 2013       | china      |
      | s2         | c2          | 2013       | china      |
      | s3         | c3          | 2013       | china      |
      +------------+-------------+------------+------------+
      Note

      If the name of col2 is the prefix of the name of col1 and you want to exclude multiple columns, you must make sure that the name of col1 is placed before that of col2. The longer column name is placed before the shorter column name. For example, two partitions of a partitioned table do not need to be queried. One partition is named ds and the other is named dshh. The name of the ds partition is the prefix for the name of the dshh partition. Therefore, the select `(dshhds)?+.+` from t; expression is correct, but the select `(dsdshh)?+.+` from t; expression is incorrect.

  • Use DISTINCT before the name of a column to filter out duplicate values from the column and return only distinct values. If you use ALL before the name of a column, all values of the column, including duplicate values, are returned. If DISTINCT is not used, ALL is used.

    • The following statement queries data of the region column from the sale_detail table and returns only one distinct value. Sample statement:

      select distinct region from sale_detail;

      The following result is returned:

      +------------+
      | region     |
      +------------+
      | china      |
      +------------+
    • The following statement specifies multiple columns after the DISTINCT option. The DISTINCT option takes effect on all the specified columns instead of a single column. Sample statement:

      select distinct region, sale_date from sale_detail;

      The following result is returned:

      +------------+------------+
      | region     | sale_date  |
      +------------+------------+
      | china      | 2013       |
      +------------+------------+
    • In the following statement, DISTINCT is used together with a window function. DISTINCT is used to deduplicate the computing results of the window function. Sample statement:

      set odps.sql.allow.fullscan=true;
      select distinct sale_date, row_number() over (partition by customer_id order by total_price) as rn from sale_detail;

      The following result is returned:

      +-----------+------------+
      | sale_date | rn         |
      +-----------+------------+
      | 2013      | 1          |
      +-----------+------------+

      In the following statement, DISTINCT cannot be used together with GROUP BY. For example, if you execute the following statement, an error is returned.

      select distinct shop_name from sale_detail group by shop_name;
      -- The error message "GROUP BY cannot be used with SELECT DISTINCT" is returned.

Column exclusion expression (except_expr)

except_expr is optional. except_expr is in the except(col1_name, col2_name, ...) format. You can use except_expr to read data from most columns in a table and exclude data from a small number of columns in the table. For example, you can execute the select * except(col1_name, col2_name, ...) from ...; statement to read data from all columns except the col1 and col2 columns.

Sample statement:

-- Read data from all columns, except the region column, in the sale_detail table. 
select * except(region) from sale_detail;

The following result is returned:

+-----------+-------------+-------------+-----------+
| shop_name | customer_id | total_price | sale_date |
+-----------+-------------+-------------+-----------+
| s1        | c1          | 100.1       | 2013      |
| s2        | c2          | 100.2       | 2013      |
| s3        | c3          | 100.3       | 2013      |
+-----------+-------------+-------------+-----------+

Column modification expression (replace_expr)

replace_expr is optional. replace_expr is in the replace(exp1 [as] col1_name, exp2 [as] col2_name, ...) format. You can use replace_expr to read data from most columns in a table and modify data of a small number of columns in the table. For example, you can execute the select * replace(exp1 as col1_name, exp2 as col2_name, ...) from ...; statement to replace the data of the col1 column with the calculation result of exp1, and replace the data of the col2 column with the calculation result of exp2 when you read data from a table.

Sample statement:

-- Read data from the sale_detail table and modify the data in the total_price and region columns. 
select * replace(total_price+100 as total_price, 'shanghai' as region) from sale_detail;

The following result is returned:

+-----------+-------------+-------------+-----------+--------+
| shop_name | customer_id | total_price | sale_date | region |
+-----------+-------------+-------------+-----------+--------+
| s1        | c1          | 200.1       | 2013      | shanghai |
| s2        | c2          | 200.2       | 2013      | shanghai |
| s3        | c3          | 200.3       | 2013      | shanghai |
+-----------+-------------+-------------+-----------+--------+

Destination table information (table_reference)

table_reference is required. table_reference specifies the table that you want to query. When you use table_reference, you must comply with the following rules:

  • Specify the name of a destination table. Sample statement:

    select customer_id from sale_detail;

    The following result is returned:

    +-------------+
    | customer_id |
    +-------------+
    | c1          |
    | c2          |
    | c3          |
    +-------------+
  • Use a nested subquery. Sample statement:

    select * from (select region,sale_date from sale_detail) t where region = 'china';

    The following result is returned:

    +------------+------------+
    | region     | sale_date  |
    +------------+------------+
    | china      | 2013       |
    | china      | 2013       |
    | china      | 2013       |
    +------------+------------+

WHERE clause (where_condition)

where_condition is optional. where_condition specifies a filter condition. If where_condition is used for a partitioned table, column pruning can be performed. When you use where_condition, you must comply with the following rules:

  • Use where_condition with relational operators to obtain the data that meets the specified conditions. Relational operators include:

    • >, <,=, >=, <=, and <>

    • LIKE and RLIKE

    • IN, NOT IN

    • BETWEEN…AND

    For more information, see Relational operators.

    The following statement specifies the partitions that you want to scan in where_condition. This prevents a full table scan. Sample statement:

    select * 
    from sale_detail
    where sale_date >= '2008' and sale_date <= '2014';
    -- The preceding statement is equivalent to the following statement: 
    select * 
    from sale_detail 
    where sale_date between '2008' and '2014';

    The following result is returned:

    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    +------------+-------------+-------------+------------+------------+
    Note

    You can execute the EXPLAIN statement to check whether partition pruning takes effect. A common user-defined function (UDF) or the partition condition settings of JOIN may cause partition pruning to fail. For more information, see Check whether partition pruning is effective.

  • Use UDF-based partition pruning. If you use UDFs, MaxCompute executes the UDFs as small jobs and backfills partitions with the results of these jobs.

    • Implementation methods

      • Add an annotation to a UDF class when you write a UDF.

        @com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)
        Note

        The UDF annotation com.aliyun.odps.udf.annotation.UdfProperty is defined in the odps-sdk-udf.jar file. To use this annotation, you must update the version of the referenced odps-sdk-udf to 0.30.X or later.

      • Add set odps.sql.udf.ppr.deterministic = true; before the SQL statement that you want to execute. Then, all UDFs in the SQL statement are considered deterministic UDFs. The preceding SET command backfills partitions with the results of jobs. A maximum of 1,000 partitions can be backfilled with the results of jobs. If you add an annotation to a UDF class, an error that indicates more than 1,000 partitions are backfilled may be returned. To ignore this error, you can run the set odps.sql.udf.ppr.to.subquery = false; command. After you run this command, UDF-based partition pruning is no longer in effect.

    • Precautions

      When you use a UDF to perform partition pruning, the UDF must be placed in the WHERE clause that is used to query a table. UDF-based partition pruning takes effect only when you place the UDF in the WHERE clause.

      • Sample statement of correct usage:

        --Place a UDF in the WHERE clause that is used to query the source table.
        select key, value from srcp where udf(ds) = 'xx';
      • Sample statement of incorrect usage:

        --Place a UDF after the ON condition in the JOIN clause. UDF-based partition pruning does not take effect.
        select A.c1, A.c2 from srcp1 A  join srcp2  B on A.c1 = B.c1 and udf(A.ds) ='xx';
  • In a column expression (select_expr), if the column that is renamed a column alias uses a function, the column alias cannot be referenced in the WHERE clause. Incorrect sample statement:

    select  task_name
            ,inst_id
            ,settings
            ,GET_JSON_OBJECT(settings, '$.SKYNET_ID') as skynet_id
            ,GET_JSON_OBJECT(settings, '$.SKYNET_NODENAME') as user_agent
    from    Information_Schema.TASKS_HISTORY
    where   ds = '20211215' and skynet_id is not null
    limit 10;

GROUP BY (col_list)

GROUP BY is optional. In most cases, GROUP BY is used with aggregate functions to group columns based on the specified common columns, partition key columns, or regular expressions. When you use GROUP BY, you must comply with the following rules:

  • GROUP BY takes precedence over SELECT. Therefore, columns in GROUP BY can be specified by column names of the input table of SELECT or an expression that is formed by columns of the input table of SELECT. When you use GROUP BY, take note of the following points:

    • If columns in GROUP BY are specified by a regular expression, the complete expression must be used.

    • The columns that do not use aggregate functions in a SELECT statement must be specified in GROUP BY.

    Examples:

    • The following statement groups table data by the column name region. In this case, data is grouped based on the values of the region column. Sample statement:

      select region from sale_detail group by region;

      The following result is returned:

      +------------+
      | region     |
      +------------+
      | china      |
      +------------+
    • The following statement groups table data based on the values of the region column and returns the total sales of each group. Sample statement:

      select sum(total_price) from sale_detail group by region;

      The following result is returned:

      +------------+
      | _c0        |
      +------------+
      | 300.6      |
      +------------+
    • The following statement groups table data based on the values of the region column and returns distinct values and total sales of each group. Sample statement:

      select region, sum (total_price) from sale_detail group by region;

      The following result is returned:

      +------------+------------+
      | region     | _c1        |
      +------------+------------+
      | china      | 300.6      |
      +------------+------------+
    • The following statement groups table data based on the alias of an output column in a SELECT statement. Sample statement:

      select region as r from sale_detail group by r;
      -- The preceding statement is equivalent to the following statement: 
      select region as r from sale_detail group by region;

      The following result is returned:

      +------------+
      | r          |
      +------------+
      | china      |
      +------------+
    • The following statement groups table data based on the column expression. Sample statement:

      select 2 + total_price as r from sale_detail group by 2 + total_price;

      The following result is returned:

      +------------+
      | r          |
      +------------+
      | 102.1      |
      | 102.2      |
      | 102.3      |
      +------------+
    • If some columns in a SELECT statement do not use aggregate functions, these columns must be specified in GROUP BY. Otherwise, an error is returned. Sample statement of incorrect usage:

      select region, total_price from sale_detail group by region;

      Sample statement of correct usage:

      select region, total_price from sale_detail group by region, total_price;

      The following result is returned:

      +------------+-------------+
      | region     | total_price |
      +------------+-------------+
      | china      | 100.1       |
      | china      | 100.2       |
      | china      | 100.3       |
      +------------+-------------+
  • If you add the set hive.groupby.position.alias=true; command before a SELECT statement, integer constants in the GROUP BY clause are considered column numbers in a SELECT statement. Sample statement:

    -- Run this command with the following SELECT statement. 
    set odps.sql.groupby.position.alias=true;
    -- 1 indicates the region column, which is the first column read by the following SELECT statement. This statement groups table data based on the values of the region column and returns distinct values of the region column and total sales of each group. 
    select region, sum(total_price) from sale_detail group by 1;

    The following result is returned:

    +------------+------------+
    | region     | _c1        |
    +------------+------------+
    | china      | 300.6      |
    +------------+------------+

HAVING clause (having_condition)

having_condition is optional. In most cases, having_condition is used with aggregate functions to filter data. Sample statement:

-- Insert data into the sale_detail table to display the data rendering effect. 
insert into sale_detail partition (sale_date='2014', region='shanghai') values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
-- Use having_condition with aggregate functions to filter data. 
select region,sum(total_price) from sale_detail 
group by region 
having sum(total_price)<305;

The following result is returned:

+------------+------------+
| region     | _c1        |
+------------+------------+
| china      | 300.6      |
| shanghai   | 200.9      |
+------------+------------+

ORDER BY (order_condition)

order_condition is optional. ORDER BY is used to sort all data records based on a specified common column or partition key column. ORDER BY can also be used to sort all data records based on a specified constant. When you use ORDER BY, you must comply with the following rules:

  • By default, data is sorted in ascending order. If you want to sort data in descending order, the DESC keyword is required.

  • By default, ORDER BY is followed by LIMIT <number> to limit the number of data rows that are displayed in the output. If ORDER BY is not followed by LIMIT <number>, an error is returned. You can also work around this limit. For more information, see LIMIT .

    • The following statement queries data from the sale_detail table, sorts data records in ascending order based on the values of the total_price column, and then displays the first two records. Sample statement:

      select * from sale_detail order by total_price limit 2;

      The following result is returned:

      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      | s1         | c1          | 100.1       | 2013       | china      |
      | s2         | c2          | 100.2       | 2013       | china      |
      +------------+-------------+-------------+------------+------------+
    • The following statement queries data from the sale_detail table, sorts data records in descending order based on the values of the total_price column, and then displays the first two records. Sample statement:

      select * from sale_detail order by total_price desc limit 2;

      The following result is returned:

      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      | s3         | c3          | 100.3       | 2013       | china      |
      | s2         | c2          | 100.2       | 2013       | china      |
      +------------+-------------+-------------+------------+------------+
  • NULL is the smallest value when you use ORDER BY to sort data. This is also the case in MySQL. However, this is not the case in Oracle.

    The following statement queries data from the sale_detail table, sorts data records in ascending order based on the values of the total_price column, and then displays the first two records. Sample statement:

    select * from sale_detail order by total_price limit 2;

    The following result is returned:

    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    +------------+-------------+-------------+------------+------------+
  • ORDER BY is followed by the alias of an output column of a SELECT statement. If you do not specify the alias of an output column of a SELECT statement, the name of this column is used as the alias of this column.

    The following statement adds the alias of an output column after ORDER BY. Sample statement:

    select total_price as t from sale_detail order by total_price limit 3;
    -- The preceding statement is equivalent to the following statement: 
    select total_price as t from sale_detail order by t limit 3;

    The following result is returned:

    +------------+
    | t          |
    +------------+
    | 100.1      |
    | 100.2      |
    | 100.3      |
    +------------+
  • If you add the set hive.orderby.position.alias=true; command before a SELECT statement, integer constants in the ORDER BY clause are considered column numbers in the SELECT statement. Sample statement:

    -- Run this command with the following SELECT statement. 
    set odps.sql.orderby.position.alias=true;
    select * from sale_detail order by 3 limit 3;

    The following result is returned:

    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    +------------+-------------+-------------+------------+------------+
  • An OFFSET clause can be used with an ORDER BY...LIMIT clause to specify the number of rows to skip. The format is ORDER BY...LIMIT m OFFSET n, which can be abbreviated as ORDER BY...LIMIT n, m. LIMIT m specifies that m rows of data are returned. OFFSET n specifies that n rows are skipped before data is returned. If you do not want to skip rows, you can use OFFSET 0 in the statement that you want to execute. You can also execute the statement without specifying an OFFSET clause.

    The following statement sorts the data of the sale_detail table in ascending order based on the values of the total_price column and displays three rows of data starting from the third row. Sample statement:

    select customer_id,total_price from sale_detail order by total_price limit 3 offset 2;
    -- The preceding statement is equivalent to the following statement: 
    select customer_id,total_price from sale_detail order by total_price limit 2, 3;

    The following result is returned:

    +-------------+-------------+
    | customer_id | total_price |
    +-------------+-------------+
    | c3          | 100.3       |
    +-------------+-------------+

    The queried data contains only three rows of data. In this case, only the third row is returned.

  • Range clustering can also be used for global sorting acceleration. In common scenarios in which ORDER BY is used, all sorted data is distributed to the same instance to ensure global sorting. However, concurrent processing cannot be fully utilized in these scenarios. You can use the partitioning step of range clustering to implement concurrent global sorting. If you want to perform global sorting, you must sample data and divide the data into ranges, sort data in each range in parallel, and then obtain the result of global sorting. For more information, see Global sorting acceleration.

DISTRIBUTE BY hash partition (distribute_condition)

distribute_condition is optional. DISTRIBUTE BY is used to perform hash partitioning on data based on the values of specific columns.

DISTRIBUTE BY controls how the output of a mapper is distributed among reducers. To prevent the same data from being distributed to different reducers, you can use DISTRIBUTE BY. This ensures that the same group of data is distributed to the same reducer.

The alias of an output column of a SELECT statement must be specified. If you execute a SELECT statement to query data of a column and the alias of this column is not specified, the column name is used as the alias. Sample statement:

-- The following statement queries the values of the region column from the sale_detail table and performs hash partitioning on data based on the values of the region column. 
select region from sale_detail distribute by region;
-- The preceding statement is equivalent to the following statements: 
select region as r from sale_detail distribute by region;
select region as r from sale_detail distribute by r;

SORT BY (sort_condition)

sort_condition is optional. In most cases, SORT BY is used with DISTRIBUTE BY. When you use SORT BY, you must comply with the following rules:

  • By default, data is sorted in ascending order. If you want to sort data in descending order, the DESC keyword is required.

  • If SORT BY is preceded by DISTRIBUTE BY, SORT BY sorts the result of DISTRIBUTE BY based on the values of a specified column.

    • The following statements query the values of the region and total_price columns from the sale_detail table, perform hash partitioning on the query results based on the values of the region column, and then sort the partitioning results in ascending order based on the values of the total_price column. Sample statement:

      -- Insert data into the sale_detail table to display the data rendering effect. 
      insert into sale_detail partition (sale_date='2014', region='shanghai') values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
      select region,total_price from sale_detail distribute by region sort by total_price;

      The following result is returned:

      +------------+-------------+
      | region     | total_price |
      +------------+-------------+
      | shanghai   | NULL        |
      | china      | 100.1       |
      | china      | 100.2       |
      | china      | 100.3       |
      | shanghai   | 100.4       |
      | shanghai   | 100.5       |
      +------------+-------------+
    • The following statement queries the values of the region and total_price columns from the sale_detail table, performs hash partitioning on the query results based on the values of the region column, and then sorts the partitioning results in descending order based on the values of the total_price column. Sample statement:

      select region,total_price from sale_detail distribute by region sort by total_price desc;

      The following result is returned:

      +------------+-------------+
      | region     | total_price |
      +------------+-------------+
      | shanghai   | 100.5       |
      | shanghai   | 100.4       |
      | china      | 100.3       |
      | china      | 100.2       |
      | china      | 100.1       |
      | shanghai   | NULL        |
      +------------+-------------+
  • If SORT BY is not preceded by DISTRIBUTE BY, SORT BY sorts the data that is distributed to each reducer.

    This ensures that the output data of each reducer is sorted in order and increases the storage compression ratio. If data is filtered during data reading, this method reduces the amount of data that is read from disks and improves the efficiency of subsequent global sorting. Sample statement:

    select region,total_price from sale_detail sort by total_price desc;

    The following result is returned:

    +------------+-------------+
    | region     | total_price |
    +------------+-------------+
    | china      | 100.3       |
    | china      | 100.2       |
    | china      | 100.1       |
    | shanghai   | 100.5       |
    | shanghai   | 100.4       |
    | shanghai   | NULL        |
    +------------+-------------+
Note
  • Columns in an ORDER BY, DISTRIBUTE BY, or SORT BY clause must be specified by the aliases of the output columns in a SELECT statement. Column aliases can be Chinese.

  • In MaxCompute, an ORDER BY, DISTRIBUTE BY, or SORT BY clause is executed after a SELECT statement. Therefore, columns in ORDER BY, DISTRIBUTE BY, or SORT BY must be specified by the aliases of the output columns in the SELECT statement.

  • ORDER BY cannot be used at the same time with DISTRIBUTE BY or SORT BY. Similarly, GROUP BY cannot be used at the same time with DISTRIBUTE BY or SORT BY.

LIMIT <number>

LIMIT <number> is optional. The number in the limit <number> is a constant that restricts the number of rows that can be displayed. The value of number is a 32-bit integer, and the maximum value is 2,147,483,647.

Note

LIMIT <number> is used to scan and filter data for a distributed query system. When you use LIMIT <number>, the amount of data returned is not reduced. Therefore, computing costs are not reduced.

The following sections describe the limits of LIMIT <number> and how to work around these limits.

  • ORDER BY must be used with LIMIT <number>.

    ORDER BY sorts all data of a single node. By default, ORDER BY is used with LIMIT <number> to prevent a single node from processing large amounts of data. You can work around this limit by using the following methods:

    • To work around the limit for a project, run the setproject odps.sql.validate.orderby.limit=false; command.

    • To work around the limit for a session, commit and run the set odps.sql.validate.orderby.limit=false; command with the SQL statement that you want to execute.

      Note

      After you work around this limit, if a single node has large amounts of data to sort, more resources and time are consumed.

  • Limited rows are displayed.

    If you execute a SELECT statement without a LIMIT <number> clause or the number specified in the LIMIT <number> clause exceeds the maximum number (n) of rows that can be displayed, a maximum of n rows can be displayed.

    The maximum number of rows that can be displayed varies based on projects. You can use one of the following methods to control the maximum number:

    • If project data protection is disabled, modify the odpscmd config.ini file.

      Set use_instance_tunnel to true in the odpscmd config.ini file. If the instance_tunnel_max_record parameter is not configured, the number of rows that can be displayed is not limited. Otherwise, the number of rows that can be displayed is limited by the instance_tunnel_max_record parameter. The maximum value of the instance_tunnel_max_record parameter is 10000. For more information about InstanceTunnel, see Usage notes.

    • If project data protection is enabled, the number of rows that can be displayed is limited by the READ_TABLE_MAX_ROW parameter. The maximum value of this parameter is 10000.

    Note

    You can run the SHOW SecurityConfiguration; command to view the value of ProjectProtection. If ProjectProtection is set to true, you can determine whether to disable project data protection based on your business requirements. You can run the set ProjectProtection=false; command to disable project data protection. By default, ProjectProtection is set to false. For more information about project data protection, see Project data protection.

Window clause (window_clause)

For more information about the window clause, see Syntax.

Split size hint

The split size hint is optional. You can modify the split size to control the parallelism of subtasks and adjust the computing performance. Split sizes can be applied to tables. The unit of a split size is MB. The default value is 256 MB.

  • Scenarios

    • If a large number of subtasks are waiting for resources in a job but no resources can be allocated to the subtasks, you can increase the split size to reduce the parallelism of subtasks. This way, the time for starting and stopping subtasks is reduced.

    • If the parallelism of subtasks is low, such as hundreds of subtasks, and the current subtask does not return results within an expected period of time, check whether some resources in the resource pool are available. If sufficient resources are available, you can decrease the split size to increase the parallelism of subtasks. This can reduce the duration for which the job runs.

  • Precautions

    • If you use the split size hint for a clustered table and bucketing operations are performed on the table for optimizing the computing performance, the split size hint is invalid.

    • You can change the value of the split size to a value that is a factor or multiple of 256 MB, such as 128 MB or 512 MB.

    • If data in a table is read multiple times in an SQL statement, the smallest split size is used for splitting. For example, the src table is read twice:

      • If one split size is set to 1 MB and the other split size is set to 10 MB, 1 MB is used for splitting.

      • If one split size is set to 1 MB and the other split size is not configured, 1 MB is used for splitting.

  • Sample statement:

    -- Set the split size to 1 MB. This setting indicates that a job is split into subtasks based on a size of 1 MB when data in the src table is read.
    select a.key from src a /*+split_size(1)*/ join src2 b on a.key=b.key;