You can use window functions to flexibly analyze and process jobs in MaxCompute SQL.

Notice
  • Window functions can only be included in the SELECT clause.
  • A window function cannot contain nested window or aggregate functions.
  • Window functions cannot be used with aggregate functions of the same level.
  • A maximum of five window functions can be used in a MaxCompute SQL statement.

Syntax of window functions

The syntax of a window function is as follows:
window_func() over (partition by [col1,col2…]
[order by [col1[asc|desc], col2[asc|desc]…]] windowing_clause)
  • PARTITION BY specifies partitioning columns.
  • Rows with the same partitioning column value are considered in the same window. A window can contain a maximum of 100 million rows of data. However, we recommend that the number of rows in a window not exceed 5 million. If the number of rows exceeds the upper limit, an error is returned. This limit applies to all built-in window functions.
    Note For example, GROUP BY UID indicates that records with the same UID are considered in the same window. Therefore, the records with the same UID can contain a maximum of 100 million rows of data. However, we recommend that the number of rows in a window not exceed 5 million.
  • ORDER BY specifies the rule for sorting data in a window.
    Note If some values in ORDER BY are the same, the sorting result is unstable. To reduce randomness, keep each value in ORDER BY unique.
  • You can use ROWS in windowing_clause to specify the partitioning method. There are two methods:
    • rows between x preceding|following and y preceding|following: indicates a window range from the x th row preceding or following the current row to the y th
    • rows x preceding|following: indicates a window range from the x th row preceding or following the current row to the current row.
    Note
    • x and y must be integer constants greater than or equal to 0. Their values range from 0 to 10000. 0 indicates the current row. You must specify ORDER BY before you use ROWS to specify a window range.
    • Only the following window functions can use ROWS to specify the partitioning method: AVG, COUNT, MAX, MIN, STDDEV, and SUM.

COUNT

The syntax, description, and example of the COUNT function are as follows:
  • Syntax
    Bigint count([distinct] expr) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the number of rows that meet expr.

  • Parameters
    • expr: a value of any data type. If the value for a row is NULL, the row is not included in the calculation. If the distinct keyword is specified, this parameter indicates that only distinct values are counted.
    • partition by [col1, col2…]: specifies partitioning columns.
    • order by col1 [asc|desc], col2[asc|desc]: If ORDER BY is not specified, the number of rows that meet expr in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the number of rows from the starting row to the current row in the current window is returned.
    Note ORDER BY cannot be used if the distinct keyword is specified.
  • Return value

    Returns a value of the BIGINT type.

  • Example
    Assume that the test_src table contains the user_id column of the BIGINT type.
    select user_id,count(user_id) over (partition by user_id) as count
    from test_src;
    +---------+------------+
    | user_id |  count     |
    +---------+------------+
    | 1       | 3          |
    | 1       | 3          |
    | 1       | 3          |
    | 2       | 1          |
    | 3       | 1          |
    +---------+------------+    
    -- If ORDER BY is not specified, the number of rows of the user_id column from the current window is returned.
    select user_id,count(user_id) over (partition by user_id order by user_id) as count
    from test_src;
    +---------+------------+
    | user_id | count      |
    +---------+------------+
    | 1       | 1          |      -- This row is the starting row of this window.
    | 1       | 2          |      -- Two records exist from the starting row to the current row. The number 2 is returned.
    | 1       | 3          |
    | 2       | 1          |
    | 3       | 1          |
    +---------+------------+
    -- If ORDER BY is specified, the number of rows from the starting row to the current row in the current window is returned.
If duplicate values are specified for ORDER BY, the processing method varies depending on whether MaxCompute is compatible with Hive.
  • If MaxCompute is incompatible with Hive, the return value is the value of COUNT for each row.
    set odps.sql.hive.compatible=false;
    select user_id, price, count(price) over
      (partition by user_id order by price) as count from test_src;
    
    +------------+------------+------------+
    | user_id    | price      | count      |
    +------------+------------+------------+
    | 1          | 4.5        | 1          |  -- This row is the starting row of this window.
    | 1          | 5.5        | 2          |  -- The value of COUNT for the second row is 2.
    | 1          | 5.5        | 3          |  -- The value of COUNT for the third row is 3.
    | 1          | 6.5        | 4          |
    | 2          | NULL       | 0          |
    | 2          | 3.0        | 1          |
    | 3          | NULL       | 0          |
    | 3          | 4.0        | 1          |
    +------------+------------+------------+
  • If MaxCompute is compatible with Hive, the return value is the value of COUNT for the last row of the rows with same values.
    set odps.sql.hive.compatible=true;
    select user_id, price, count(price) over
      (partition by user_id order by price) as count from test_src;
    
    +------------+------------+------------+
    | user_id    | price      | count      |
    +------------+------------+------------+
    | 1          | 4.5        | 1          | -- This row is the starting row of this window.
    | 1          | 5.5        | 3          | -- The value of COUNT for the second row is the value of COUNT for the third row because the prices of the two rows are the same.
    | 1          | 5.5        | 3          | -- The value of COUNT for the third row is 3.
    | 1          | 6.5        | 4          |
    | 2          | NULL       | 0          |
    | 2          | 3.0        | 1          |
    | 3          | NULL       | 0          |
    | 3          | 4.0        | 1          |
    +------------+------------+------------+

AVG

The syntax, description, and example of the AVG function are as follows:
  • Syntax
    avg([distinct] expr) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the average value of input values.

  • Parameters
    • distinct: If the distinct keyword is specified, this parameter indicates that the average value of distinct values is calculated.
    • expr: a value of the DOUBLE or DECIMAL type.
      • If the input value is of the STRING or BIGINT type, it is implicitly converted into the DOUBLE type before calculation. If it is of another data type, an error is returned.
      • If the value for a row is NULL, the row is not included in the calculation.
      • The values of the BOOLEAN type are not included in the calculation.
    • partition by [col1, col2...] specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: If ORDER BY is not specified, the average value of all values in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the average value from the starting row to the current row in the current window is returned.
    Note ORDER BY cannot be used if the distinct keyword is specified.
  • Return value

    Returns a value of the DOUBLE type.

If duplicate values are specified for ORDER BY, the processing method varies depending on whether MaxCompute is compatible with Hive.
  • If MaxCompute is incompatible with Hive, the return value is the average value for each row.
    set odps.sql.hive.compatible=false;
    select user_id, price, avg(price) over
      (partition by user_id order by price) from test_src;
    
    +------------+------------+-------------------+
    | user_id    | price      | _c2               |
    +------------+------------+-------------------+
    | 1          | 4.5        | 4.5               |    -- This row is the starting row of this window.
    | 1          | 5.5        | 5.0               |    -- The return value is the average value of the values for the first and second rows.
    | 1          | 5.5        | 5.166666666666667 |    -- The return value is the average value of the values from the first row to the third row.
    | 1          | 6.5        | 5.5               |
    | 2          | NULL       | NULL              |
    | 2          | 3.0        | 3.0               |
    | 3          | NULL       | NULL              |
    | 3          | 4.0        | 4.0               |
    +------------+------------+-------------------+
  • If MaxCompute is compatible with Hive, the return value is the average value of values for the last row of the rows with same values.
    set odps.sql.hive.compatible=true;
    select user_id, price, avg(price) over
      (partition by user_id order by price) from test_src;
    
    +------------+------------+-------------------+
    | user_id    | price      | _c2               |
    +------------+------------+-------------------+
    | 1          | 4.5        | 4.5               |  -- This row is the starting row of this window.
    | 1          | 5.5        | 5.166666666666667 |  -- The return value is the average value of values from the first row to the third row.
    | 1          | 5.5        | 5.166666666666667 |  -- The return value is the average value of values from the first row to the third row.
    | 1          | 6.5        | 5.5               |
    | 2          | NULL       | NULL              |
    | 2          | 3.0        | 3.0               |
    | 3          | NULL       | NULL              |
    | 3          | 4.0        | 4.0               |
    +------------+------------+-------------------+

MAX

  • Syntax
    max([distinct] expr) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the maximum value of input values.

  • Parameters
    • expr: a value of any data type except BOOLEAN. If the value for a row is NULL, the row is not included in the calculation. If the distinct keyword is specified, this parameter indicates that the maximum value of distinct values is calculated. The calculation result is not affected by whether the parameter is set.
    • partition by [col1, col2…]: specifies partitioning columns.
    • order by [col1[asc|desc], col2[asc|desc: If ORDER BY is not specified, the maximum value of all values in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the maximum value of values from the starting row to the current row in the current window is returned.
    Note ORDER BY cannot be used if the distinct keyword is specified.
  • Return value

    Returns values of the same data type as expr.

MIN

  • Syntax
    min([distinct] expr) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the minimum value of input values.

  • Parameters
    • expr: a value of any data type except BOOLEAN. If the value for a row is NULL, the row is not included in the calculation. If the distinct keyword is specified, this parameter indicates that the minimum value of distinct values is calculated. The calculation result is not affected by whether the parameter is set.
    • partition by [col1, col2…]: specifies partitioning columns.
    • order by [col1[asc|desc], col2[asc|desc: If ORDER BY is not specified, the minimum value in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the minimum value of values from the starting row to the current row in the current window is returned.
    Note ORDER BY cannot be used if the distinct keyword is specified.
  • Return value

    Returns values of the same data type as expr.

MEDIAN

  • Syntax
    Double median(Double number1,number2...) over(partition by [col1, col2…])
    Decimal median(Decimal number1,number2...) over(partition by [col1,col2…])
  • Description

    Calculates the minimum value of median values.

  • Parameters
    • number1,number1…: numbers of the DOUBLE or DECIMAL type. You can enter 1 to 255 numbers.
      • If the input value is of the DOUBLE type, it is converted into an array of the DOUBLE type for calculation by default.
      • If the input value is of the STRING or BIGINT type, it is implicitly converted into the DOUBLE type before calculation. If it is of another data type, an error is returned.
      • If the input value is NULL, NULL is returned.
    • partition by [col1, col2…]: specifies partitioning columns.
  • Return value

    Returns a value of the DOUBLE or DECIMAL type.

STDDEV

The syntax, description, and example of the STDDEV function are as follows:
  • Syntax
    Double stddev([distinct] expr) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
    Decimal stddev([distinct] expr) over(partition by [col1, col2…] 
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the population standard deviation.

  • Parameters
    • expr: a value of the DOUBLE or DECIMAL type.
      • If the input value is of the STRING or BIGINT type, it is implicitly converted into the DOUBLE type before calculation. If it is of another data type, an error is returned.
      • If the value for a row is NULL, the row is not included in the calculation.
      • If the distinct keyword is specified, this parameter indicates that the population standard deviation of distinct values is calculated.
      Note
      • ORDER BY cannot be used if the distinct keyword is specified.
      • stddev has an alias function stddev_pop, which is used the same as stddev.
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: If ORDER BY is not specified, the population standard deviation of the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the population standard deviation from the starting row to the current row in the current window is returned.
  • Return value

    If the input value is of the DECIMAL type, a value of the DECIMAL type is returned. Otherwise, a value of the DOUBLE type is returned.

  • Example
    select window, seq, stddev_pop('1\01') over (partition by window order by seq) from dual;
If duplicate values are specified for ORDER BY, the processing method varies depending on whether MaxCompute is compatible with Hive.
  • If MaxCompute is incompatible with Hive, the return value is the value of STDDEV for each row.
    set odps.sql.hive.compatible=false;
    select user_id, price, stddev(price) over
      (partition by user_id order by price) from test_src;
    
    +------------+------------+--------------------+
    | user_id    | price      | _c2                |
    +------------+------------+--------------------+
    | 1          | 4.5        | 0.0                |  -- This row is the starting row of this window.
    | 1          | 5.5        | 0.5                |  -- The return value is the value of STDDEV for the first and second rows.
    | 1          | 5.5        | 0.4714045207910316 |  -- The return value is the value of STDDEV from the first row to the third row.
    | 1          | 6.5        | 0.7071067811865475 |
    | 2          | NULL       | NULL               |
    | 2          | 3.0        | 0.0                |
    | 3          | NULL       | NULL               |
    | 3          | 4.0        | 0.0                |
    +------------+------------+--------------------+
  • If MaxCompute is compatible with Hive, the return value is the value of STDDEV for the last row of the rows with same values.
    set odps.sql.hive.compatible=true;
    select user_id, price, stddev(price) over
      (partition by user_id order by price) from test_src;
    
    +------------+------------+--------------------+
    | user_id    | price      | _c2                |
    +------------+------------+--------------------+
    | 1          | 4.5        | 0.0                | -- This row is the starting row of this window.
    | 1          | 5.5        | 0.4714045207910316 | -- The return value is the value of STDDEV from the first row to the third row.
    | 1          | 5.5        | 0.4714045207910316 | -- The return value is the value of STDDEV from the first row to the third row.
    | 1          | 6.5        | 0.7071067811865475 |
    | 2          | NULL       | NULL               |
    | 2          | 3.0        | 0.0                |
    | 3          | NULL       | NULL               |
    | 3          | 4.0        | 0.0                |
    +------------+------------+--------------------+

STDDEV_SAMP

  • Syntax
    Double stddev_samp([distinct] expr) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
    Decimal stddev_samp([distinct] expr) over((partition by [col1,col2…] 
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the sample standard deviation.

  • Parameters
    • expr: a value of the DOUBLE or DECIMAL type.
      • If the input value is of the STRING or BIGINT type, it is implicitly converted into the DOUBLE type before calculation. If it is of another data type, an error is returned.
      • If the value for a row is NULL, the row is not included in the calculation.
      • If the distinct keyword is specified, this parameter indicates that the sample standard deviation of distinct values is calculated.
      Note ORDER BY cannot be used if the distinct keyword is specified.
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: If ORDER BY is not specified, the sample standard deviation in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the sample standard deviation from the starting row to the current row in the current window is returned.
  • Return value

    If the input value is of the DECIMAL type, a value of the DECIMAL type is returned. Otherwise, a value of the DOUBLE type is returned.

SUM

The syntax, description, and example of the SUM function are as follows:
  • Syntax
    sum([distinct] expr) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the sum of input values.

  • Parameters
    • expr: a value of the DOUBLE, DECIMAL, or BIGINT type.
      • If the input value is of the STRING type, it is implicitly converted into the DOUBLE type before calculation. If it is of another data type, an error is returned.
      • If the value for a row is NULL, the row is not included in the calculation.
      • If the distinct keyword is specified, this parameter indicates that the sum of distinct values is calculated.
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: If ORDER BY is not specified, the sum of the expr value in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the sum of values from the first row to the current row in the current window is returned.
    Note ORDER BY cannot be used if the distinct keyword is specified.
  • Return value
    • If the input value is of the BIGINT type, a value of the BIGINT type is returned.
    • If the input value is of the DECIMAL type, a value of the DECIMAL type is returned.
    • If the input value is of the DOUBLE or STRING type, a value of the DOUBLE type is returned.
If duplicate values are specified for ORDER BY, the processing method varies depending on whether MaxCompute is compatible with Hive.
  • If MaxCompute is incompatible with Hive, the return values are different. The return value is the sum of values for each row.
    set odps.sql.hive.compatible=false;
    select user_id, price, sum(price) over
      (partition by user_id order by price) from test_src;
    
    +------------+------------+------------+
    | user_id    | price      | _c2        |
    +------------+------------+------------+
    | 1          | 4.5        | 4.5        |  -- This row is the starting row of this window.
    | 1          | 5.5        | 10.0       |  -- The return value is the sum of values for the first and second rows.
    | 1          | 5.5        | 15.5       |  -- The return value is the sum of values from the first row to the third row.
    | 1          | 6.5        | 22.0       |
    | 2          | NULL       | NULL       |
    | 2          | 3.0        | 3.0        |
    | 3          | NULL       | NULL       |
    | 3          | 4.0        | 4.0        |
    +------------+------------+------------+
  • If MaxCompute is compatible with Hive, the return values are the same. The return value is the sum of values for the last row of the rows with same values.
    set odps.sql.hive.compatible=true;
    select user_id, price, sum(price) over
      (partition by user_id order by price) from test_src;
    +------------+------------+------------+
    | user_id    | price      | _c2        |
    +------------+------------+------------+
    | 1          | 4.5        | 4.5        |  -- This row is the starting row of this window.
    | 1          | 5.5        | 15.5       |  -- The return value is the sum of values from the first row to the third row.
    | 1          | 5.5        | 15.5       |  -- The return value is the sum of values from the first row to the third row.
    | 1          | 6.5        | 22.0       |
    | 2          | NULL       | NULL       |
    | 2          | 3.0        | 3.0        |
    | 3          | NULL       | NULL       |
    | 3          | 4.0        | 4.0        |
    +------------+------------+------------+

DENSE_RANK

  • Syntax
    Bigint dense_rank() over(partition by [col1, col2…]
    order by [col1[asc|desc], col2[asc|desc]…])
  • Description

    Calculates continuous rankings. The rankings for the data in rows with the same col2 are the same.

  • Parameters
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: specifies the field based on which data is ranked.
  • Return value

    Returns a value of the BIGINT type.

  • Example
    Assume that table emp has the following data:
    | empno | ename | job | mgr | hiredate| sal| comm | deptno |
    7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
    7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
    7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
    7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
    7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
    7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
    7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
    7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
    7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
    7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
    7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
    7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
    7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
    7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
    7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
    7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
    7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10
    Group employees by department, sort the employees in each group in descending order of sal, and then obtain the sequence numbers of employees in each group.
    -- deptno that specifies a department is used as a partitioning column. sal that specifies the salary is used as the value to be sorted in the returned result.
    SELECT deptno, ename, sal, DENSE_RANK() OVER (PARTITION BY deptno ORDER BY sal DESC) AS nums
        FROM emp;
    -- The result is as follows:
    +------------+-------+------------+------------+
    | deptno     | ename | sal        | nums       |
    +------------+-------+------------+------------+
    | 10         | JACCKA | 5000.0     | 1          |
    | 10         | KING  | 5000.0     | 1          |
    | 10         | CLARK | 2450.0     | 2          |
    | 10         | WELAN | 2450.0     | 2          |
    | 10         | TEBAGE | 1300.0     | 3          |
    | 10         | MILLER | 1300.0     | 3          |
    | 20         | SCOTT | 3000.0     | 1          |
    | 20         | FORD  | 3000.0     | 1          |
    | 20         | JONES | 2975.0     | 2          |
    | 20         | ADAMS | 1100.0     | 3          |
    | 20         | SMITH | 800.0      | 4          |
    | 30         | BLAKE | 2850.0     | 1          |
    | 30         | ALLEN | 1600.0     | 2          |
    | 30         | TURNER | 1500.0     | 3          |
    | 30         | MARTIN | 1250.0     | 4          |
    | 30         | WARD  | 1250.0     | 4          |
    | 30         | JAMES | 950.0      | 5          |
    +------------+-------+------------+------------+

RANK

  • Syntax
    Bigint rank() over(partition by [col1, col2…]
    order by [col1[asc|desc], col2[asc|desc]…])
  • Description

    Returns the rankings of data in a window. If two rows have the same col2, the ranking of the second row drops.

  • Parameters
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: specifies the field based on which data is ranked.
  • Return value

    Returns a value of the BIGINT type.

  • Example
    Assume that table emp has the following data:
    | empno | ename | job | mgr | hiredate| sal| comm | deptno |
    7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
    7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
    7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
    7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
    7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
    7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
    7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
    7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
    7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
    7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
    7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
    7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
    7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
    7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
    7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
    7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
    7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10
    Group employees by department, sort the employees in each group in descending order of sal, and then obtain the sequence numbers of employees in each group.
    -- deptno that specifies a department is used as a partitioning column. sal that specifies the salary is used as the value to be sorted in the returned result.
    SELECT deptno,ename,sal, RANK() OVER (PARTITION BY deptno ORDER BY sal DESC) AS nums
    FROM emp;
    -- The result is as follows:
    +------------+-------+------------+------------+
    | deptno     | ename | sal        | nums       |
    +------------+-------+------------+------------+
    | 10         | JACCKA | 5000.0     | 1          |
    | 10         | KING  | 5000.0     | 1          |
    | 10         | CLARK | 2450.0     | 3          |
    | 10         | WELAN | 2450.0     | 3          |
    | 10         | TEBAGE | 1300.0     | 5          |
    | 10         | MILLER | 1300.0     | 5          |
    | 20         | SCOTT | 3000.0     | 1          |
    | 20         | FORD  | 3000.0     | 1          |
    | 20         | JONES | 2975.0     | 3          |
    | 20         | ADAMS | 1100.0     | 4          |
    | 20         | SMITH | 800.0      | 5          |
    | 30         | BLAKE | 2850.0     | 1          |
    | 30         | ALLEN | 1600.0     | 2          |
    | 30         | TURNER | 1500.0     | 3          |
    | 30         | MARTIN | 1250.0     | 4          |
    | 30         | WARD  | 1250.0     | 4          |
    | 30         | JAMES | 950.0      | 6          |
    +------------+-------+------------+------------+

LAG

  • Syntax
    lag(expr, Bigint offset, default) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]])
  • Description

    Returns the value for a row at a given offset preceding the current row. If the current row number is rn, the value of the row with the number of rn-offset is retrieved.

  • Parameters
    • expr: a value of any data type.
    • offset: a constant of the BIGINT type. The value of the offset must be greater than 0. If the input value is of the STRING or DOUBLE type, it is implicitly converted into the BIGINT type before calculation.
    • default: the default value that is used if offset is out of the valid range. It is a constant value and its default value is NULL.
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: specifies the field based on which data is ranked.
  • Return value

    Returns a value of the same data type as expr.

  • Example
    select seq, lag(seq+100, 1) over (partition by window order by seq) as r from sliding_window;
    -- The result is as follows:
    +------------+------------+
    | seq        | r          |
    +------------+------------+
    | 0          | NULL       |
    | 1          | 100        |
    | 2          | 101        |
    | 3          | 102        |
    | 4          | 103        |
    | 5          | 104        |
    | 6          | 105        |
    | 7          | 106        |
    | 8          | 107        |
    | 9          | 108        |
    +------------+------------+

LEAD

  • Syntax
    lead(expr, Bigint offset, default) over(partition by [col1, col2…]
    [order by [col1[asc|desc], col2[asc|desc]…]])
  • Description

    Retrieves the value for a row at a given offset following the current row. If the current row number is rn, the value of the row with the number of rn+offset is retrieved.

  • Parameters
    • expr: a value of any data type.
    • offset: a constant of the BIGINT type. The value of the offset must be greater than 0. If the input value is of the STRING or DOUBLE type, it is implicitly converted into the BIGINT type before calculation.
    • default: the default value that is used if offset is out of the valid range. It is a constant value and its default value is NULL.
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: specifies the field based on which data is ranked.
  • Return value

    Returns a value of the same data type as expr.

  • Example
    select c_Double_a,c_String_b,c_int_a,lead(c_int_a,1) over(partition by c_Double_a order by c_String_b) from dual;
    select c_String_a,c_time_b,c_Double_a,lead(c_Double_a,1) over(partition by c_String_a order by c_time_b) from dual;
    select c_String_in_fact_num,c_String_a,c_int_a,lead(c_int_a) over(partition by c_String_in_fact_num order by c_String_a) from dual;

PERCENT_RANK

  • Syntax
    percent_rank() over(partition by [col1, col2…]
    order by [col1[asc|desc], col2[asc|desc]…])
  • Description

    Calculates the relative ranking of a row in a group of data.

  • Parameters
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: specifies the field based on which data is ranked.
  • Return value

    Returns a value of the DOUBLE type. Value range: [0, 1]. The relative ranking is calculated by using the following formula: (Rank - 1)/(Number of rows - 1).

ROW_NUMBER

  • Syntax
    row_number() over(partition by [col1, col2…]
    order by [col1[asc|desc], col2[asc|desc]…])
  • Description

    Calculates the row number, starting from 1.

  • Parameters
    • partition by [col1, col2..]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: specifies the field based on which data is ranked.
  • Return value

    Returns a value of the BIGINT type.

  • Example
    Assume that table emp has the following data:
    | empno | ename | job | mgr | hiredate| sal| comm | deptno |
    7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
    7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
    7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
    7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
    7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
    7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
    7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
    7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
    7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
    7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
    7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
    7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
    7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
    7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
    7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
    7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
    7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10
    Group employees by department, sort the employees in each group in descending order of sal, and then obtain the sequence numbers of employees in each group.
    SELECT deptno,ename,sal,ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY sal DESC) AS nums -- deptno that specifies a department is used as a partitioning column. sal that specifies the salary is used as the value to be sorted in the returned result.
        FROM emp;
    -- The result is as follows:
    +------------+-------+------------+------------+
    | deptno     | ename | sal        | nums       |
    +------------+-------+------------+------------+
    | 10         | JACCKA | 5000.0     | 1          |
    | 10         | KING  | 5000.0     | 2          |
    | 10         | CLARK | 2450.0     | 3          |
    | 10         | WELAN | 2450.0     | 4          |
    | 10         | TEBAGE | 1300.0     | 5          |
    | 10         | MILLER | 1300.0     | 6          |
    | 20         | SCOTT | 3000.0     | 1          |
    | 20         | FORD  | 3000.0     | 2          |
    | 20         | JONES | 2975.0     | 3          |
    | 20         | ADAMS | 1100.0     | 4          |
    | 20         | SMITH | 800.0      | 5          |
    | 30         | BLAKE | 2850.0     | 1          |
    | 30         | ALLEN | 1600.0     | 2          |
    | 30         | TURNER | 1500.0     | 3          |
    | 30         | MARTIN | 1250.0     | 4          |
    | 30         | WARD  | 1250.0     | 5          |
    | 30         | JAMES | 950.0      | 6          |
    +------------+-------+------------+------------+

CLUSTER_SAMPLE

  • Syntax
    boolean cluster_sample([Bigint x, Bigint y])
    over(partition by [col1, col2..])
  • Description

    Performs group sampling on data in a window.

  • Parameters
    • x: a constant of the BIGINT type. The value of x must be greater than or equal to 1. If y is specified, x indicates that a window is divided into x portions. Otherwise, x indicates that the records of x rows in a window are extracted. In this case, True is returned for the x rows. If x is NULL, NULL is returned.
    • y: a constant of the BIGINT type. The value of y must be greater than or equal to 1 and must be less than or equal to x. y indicates that the extracted y records of the x portions in a window are extracted. In this case, True is returned for the y records. If y is set to NULL, NULL is returned.
    • partition by [col1, col2]: specifies partitioning columns.
  • Return value

    Returns a value of the BOOLEAN type.

  • Example
    For example, if table test_tbl has the key and value columns, key is the grouping field and the value options include groupa and groupb. value is the key value, as shown in the following code:
    +------------+--------------------+
    | key        | value              |
    +------------+--------------------+
    | groupa     | -1.34764165478145  |
    | groupa     | 0.740212609046718  |
    | groupa     | 0.167537127858695  |
    | groupa     | 0.630314566185241  |
    | groupa     | 0.0112401388646925 |
    | groupa     | 0.199165745875297  |
    | groupa     | -0.320543343353587 |
    | groupa     | -0.273930924365012 |
    | groupa     | 0.386177958942063  |
    | groupa     | -1.09209976687047  |
    | groupb     | -1.10847690938643  |
    | groupb     | -0.725703978381499 |
    | groupb     | 1.05064697475759   |
    | groupb     | 0.135751224393789  |
    | groupb     | 2.13313102040396   |
    | groupb     | -1.11828960785008  |
    | groupb     | -0.849235511508911 |
    | groupb     | 1.27913806620453   |
    | groupb     | -0.330817716670401 |
    | groupb     | -0.300156896191195 |
    | groupb     | 2.4704244205196    |
    | groupb     | -1.28051882084434  |
    +------------+--------------------+
    If you want to extract a sample of 10% of the values in each group, execute the following MaxCompute SQL statement:
    select key, value
        from (
            select key, value, cluster_sample(10, 1) over(partition by key) as flag
            from tbl
            ) sub
        where flag = true;
    -- The result is as follows:
    +-----+------------+
    | key | value      |
    +-----+------------+
    | groupa | 0.167537127858695 |
    | groupb | 0.135751224393789 |
    +-----+------------+

CUME_DIST

  • Syntax
    cume_dist() over(partition by col1[, col2…] order by col1 [asc|desc][, col2[asc|desc]…]])
  • Description

    Calculates the cumulative distribution. The cumulative distribution is the ratio of rows whose values are greater than or equal to the current value to all rows in a group.

  • Parameters

    The ORDER BY column specifies the values to be compared.

  • Return value

    Returns the ratio of rows whose values are greater than or equal to the current value to all rows in a group.

  • Example
    Assume that table emp has the following data:
    | empno | ename | job | mgr | hiredate| sal| comm | deptno |
    7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
    7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
    7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
    7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
    7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
    7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
    7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
    7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
    7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
    7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
    7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
    7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
    7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
    7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
    7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
    7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
    7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10
    Group all employees by department and obtain the cumulative distribution of sal for each group.
    SELECT deptno
    , ename
    , sal
    , concat(round(cume_dist() OVER(PARTITION BY deptno ORDER BY sal desc)*100,2),'%') as cume_dist
    FROM emp;
    -- The result is as follows:
    +------------+-------+------------+-----------+
    | deptno     | ename | sal        | cume_dist |
    +------------+-------+------------+-----------+
    | 10         | JACCKA | 5000.0     | 33.33%    |
    | 10         | KING  | 5000.0     | 33.33%    |
    | 10         | CLARK | 2450.0     | 66.67%    |
    | 10         | WELAN | 2450.0     | 66.67%    |
    | 10         | TEBAGE | 1300.0     | 100.0%    |
    | 10         | MILLER | 1300.0     | 100.0%    |
    | 20         | SCOTT | 3000.0     | 40.0%     |
    | 20         | FORD  | 3000.0     | 40.0%     |
    | 20         | JONES | 2975.0     | 60.0%     |
    | 20         | ADAMS | 1100.0     | 80.0%     |
    | 20         | SMITH | 800.0      | 100.0%    |
    | 30         | BLAKE | 2850.0     | 16.67%    |
    | 30         | ALLEN | 1600.0     | 33.33%    |
    | 30         | TURNER | 1500.0     | 50.0%     |
    | 30         | MARTIN | 1250.0     | 83.33%    |
    | 30         | WARD  | 1250.0     | 83.33%    |
    | 30         | JAMES | 950.0      | 100.0%    |
    +------------+-------+------------+-----------+

FIRST_VALUE

The syntax, description, and example of the FIRST_VALUE function are as follows:
  • Syntax
    first_value(expr) over(partition by col1[, col2…]
            [order by col1 [asc|desc][, col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the first value of input values.

  • Parameters
    • expr: a value of any basic data type.
    • partition by col1[, col2…]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: If ORDER BY is not specified, the value of expr of the starting row in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the value of expr of the starting row in the current window is returned.
  • Return value

    Returns a value of the same data type as expr.

  • Example
    select user_id, price, first_value(price) over
      (partition by user_id) as first_value from test_src;
    
    +------------+------------+-------------+
    | user_id    | price      | first_value |
    +------------+------------+-------------+
    | 1          | 5.5        | 5.5         | -- This row is the starting row of this window.
    | 1          | 4.5        | 5.5         |
    | 1          | 6.5        | 5.5         |
    | 1          | 5.5        | 5.5         |
    | 2          | NULL       | NULL        | -- This row is the starting row of this window.
    | 2          | 3.0        | NULL        |
    | 3          | 4.0        | 4.0         | -- This row is the starting row of this window.
    | 3          | NULL       | 4.0         |
    +------------+------------+-------------+
    -- If ORDER BY is not specified, rows from the first row to the last row belong to the current window. The value of the starting row in the current window is returned.
    
    
    select user_id, price, first_value(price) over
      (partition by user_id order by price) as first_value from test_src;
    
    +------------+------------+-------------+
    | user_id    | price      | first_value |
    +------------+------------+-------------+
    | 1          | 4.5        | 4.5         | -- This row is the starting row of this window.
    | 1          | 5.5        | 4.5         |
    | 1          | 5.5        | 4.5         |
    | 1          | 6.5        | 4.5         |
    | 2          | NULL       | NULL        | -- This row is the starting row of this window.
    | 2          | 3.0        | NULL        |
    | 3          | NULL       | NULL        | -- This row is the starting row of this window.
    | 3          | 4.0        | NULL        |
    +------------+------------+-------------+
    -- If ORDER BY is specified, rows from the first row to the current row belong to the current window. The value of the starting row in the current window is returned.

LAST_VALUE

The syntax, description, and example of the LAST_VALUE function are as follows:
  • Syntax
    last_value(expr) over(partition by col1[, col2…]
            [order by col1 [asc|desc][, col2[asc|desc]…]] [windowing_clause])
  • Syntax

    Calculates the last value of input values.

  • Parameters
    • expr: a value of any basic data type.
    • partition by col1[, col2…]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: If ORDER BY is not specified, the value of expr of the last row in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the value of expr of the current row in the current window is returned.
  • Return value

    Returns a value of the same data type as expr.

  • Example
    select user_id, price, last_value(price) over
      (partition by user_id) as last_value from test_src;
    
    +------------+------------+------------+
    | user_id    | price      | last_value |
    +------------+------------+------------+
    | 1          | 5.5        | 5.5        |
    | 1          | 4.5        | 5.5        |
    | 1          | 6.5        | 5.5        |
    | 1          | 5.5        | 5.5        | -- This row is the last row in this window.
    | 2          | NULL       | 3.0        |
    | 2          | 3.0        | 3.0        | -- This row is the last row in this window.
    | 3          | 4.0        | NULL       |
    | 3          | NULL       | NULL       | -- This row is the last row in this window.
    +------------+------------+------------+
    -- If ORDER BY is not specified, the rows from the first row to the last row belong to the current window, and the value of the last row in the current window is returned.
    
    
    select user_id, price, last_value(price) over
      (partition by user_id order by price) as last_value from test_src;
    
    +------------+------------+------------+
    | user_id    | price      | last_value |
    +------------+------------+------------+
    | 1          | 4.5        | 4.5        | -- This row is the current row in the current window.
    | 1          | 5.5        | 5.5        | -- This row is the current row in the current window.
    | 1          | 5.5        | 5.5         | -- This row is the current row in the current window.
    | 1          | 6.5        | 6.5        | -- This row is the current row in the current window.
    | 2          | NULL       | NULL       | -- This row is the current row in the current window.
    | 2          | 3.0        | 3.0        | -- This row is the current row in the current window.
    | 3          | NULL       | NULL       | -- This row is the current row in the current window.
    | 3          | 4.0        | 4.0        | -- This row is the current row in the current window.
    +------------+------------+------------+
    -- If ORDER BY is specified, rows from the first row to the current row belong to the current window. The value of the current row in the current window is returned.

NTH_VALUE

The syntax, description, and example of the NTH_VALUE function are as follows:
  • Syntax
    nth_value(expr, number [, skipNull]) over(partition by col1[, col2…]
        [order by col1 [asc|desc][, col2[asc|desc]…]] [windowing_clause])
  • Description

    Calculates the n th value of input values. If the value of n exceeds the total number of rows in the window, NULL is returned.

  • Parameters
    • expr: a value of any basic data type.
    • number: an integer greater than or equal to 1.
    • skipNull: indicates whether to skip NULL when the n th value is calculated. The default value is false.
    • partition by col1[, col2…]: specifies partitioning columns.
    • order by col1[asc|desc], col2[asc|desc]: If ORDER BY is not specified, the value of expr of the n th row in the current window is returned. If ORDER BY is specified, the returned results are sorted in the specified order and the value of expr of the nth row from the starting row to the current row in the current window is returned.
  • Return value

    Returns a value of the same data type as expr.

  • Example
    select user_id, price, nth_value(price, 2) over
      (partition by user_id) as nth_value from test_src;
    
    +------------+------------+------------+
    | user_id    | price      | nth_value  |
    +------------+------------+------------+
    | 1          | 5.5        | 4.5        |
    | 1          | 4.5        | 4.5        | -- This row is the second row in the current window.
    | 1          | 6.5        | 4.5        |
    | 1          | 5.5        | 4.5        |
    | 2          | NULL       | 3.0        |
    | 2          | 3.0        | 3.0        | -- This row is the second row in the current window.
    | 3          | 4.0        | NULL       |
    | 3          | NULL       | NULL       | -- This row is the second row in the current window.
    +------------+------------+------------+
    -- If ORDER BY is not specified, rows from the first row to the last row belong to the current window. The value of the second row is returned.
    
    select user_id, price, nth_value(price, 3) over
      (partition by user_id) as nth_value from test_src;
    +------------+------------+------------+
    | user_id    | price      | nth_value  |
    +------------+------------+------------+
    | 1          | 5.5        | 6.5        |
    | 1          | 4.5        | 6.5        |
    | 1          | 6.5        | 6.5        | -- This row is the third row in the current window.
    | 1          | 5.5        | 6.5        |
    | 2          | NULL       | NULL       |
    | 2          | 3.0        | NULL       |
    | 3          | 4.0        | NULL       |
    | 3          | NULL       | NULL       |
    +------------+------------+------------+
    -- If ORDER BY is not specified, rows from the first row to the last row belong to the current window. The value of the third row is returned.
    -- The second and third windows have only two rows.
    
    
    select user_id, price, nth_value(price, 2) over
      (partition by user_id order by price) as nth_value from test_src;
    +------------+------------+------------+
    | user_id    | price      | nth_value  |
    +------------+------------+------------+
    | 1          | 4.5        | NULL       | -- The current window has only one row. The second row exceeds the window length.
    | 1          | 5.5        | 5.5        |
    | 1          | 5.5        | 5.5        |
    | 1          | 6.5        | 5.5        |
    | 2          | NULL       | NULL       |
    | 2          | 3.0        | 3.0        |
    | 3          | NULL       | NULL       |
    | 3          | 4.0        | 4.0        |
    +------------+------------+------------+
    -- If ORDER BY is specified, rows from the first row to the current row belong to the current window. The value of the second row is returned.
    
    
    select user_id, price, nth_value(price, 1, true) over
      (partition by user_id) as nth_value from test_src;
    +------------+------------+------------+
    | user_id    | price      | nth_value  |
    +------------+------------+------------+
    | 1          | 5.5        | 5.5        |
    | 1          | 4.5        | 5.5        |
    | 1          | 6.5        | 5.5        |
    | 1          | 5.5        | 5.5        |
    | 2          | NULL       | 3.0        | -- The value of the first row is NULL, and therefore this row is skipped.
    | 2          | 3.0        | 3.0        |
    | 3          | 4.0        | 4.0        |
    | 3          | NULL       | 4.0        |
    +------------+------------+------------+
    -- If ORDER BY is not specified, rows from the first row to the last row belong to the current window.
    -- The value of the first row is returned. skipNull is set to true.