Aggregate functions group multiple input records together to form a single output record. You can use an aggregate function together with a GROUP BY clause of MaxCompute SQL. This topic describes the syntax, parameters, and examples of aggregate functions that are supported by MaxCompute SQL. It guides you through data development by using aggregate functions.

The following table describes the aggregate functions that are supported by MaxCompute SQL.

Function Description
AVG Returns the average value of a column.
COUNT Returns the number of records that match the specified criteria.
COUNT_IF Returns the number of records whose expr value is True.
MAX Returns the maximum value of a column.
MIN Returns the minimum value of a column.
MEDIAN Returns the median value of a column.
STDDEV Returns the population standard deviation of all input values.
STDDEV_SAMP Returns the sample standard deviation of all input values.
SUM Returns the sum of a column.
WM_CONCAT Concatenates strings with a specified delimiter.
ANY_VALUE Returns a non-deterministic value from a specified column.
APPROX_DISTINCT Returns the approximate number of distinct input values in a specified column.
ARG_MAX Returns the column value of the row that corresponds to the maximum value of a specified column.
ARG_MIN Returns the column value of the row that corresponds to the minimum value of a specified column.
COLLECT_LIST Aggregates the specified columns into an array.
COLLECT_SET Aggregates the specified columns into an array with only distinct values.
NUMERIC_HISTOGRAM Returns the approximate histogram of a specified column.
PERCENTILE_APPROX Returns approximate percentiles. This function applies to scenarios in which a large amount of data is calculated.

Syntax

Syntax of an aggregate function:

<aggregate_name>(<expression>[,...]) [within group (order by <col1>[,<col2>…])] [filter (where <where_condition>)]
  • <aggregate_name>(<expression>[,...]): a built-in aggregate function or user-defined aggregate functions (UDAF). The format of an aggregate function is based on its syntax.
  • within group (order by <col1>[,<col2>…]): If the syntax of an aggregate function includes this expression, the system automatically sorts the input data of <col1>[,<col2>…] in ascending order. To sort the input data in descending order, use the expression within group (order by <col1>[,<col2>…] [desc]).
    Before you use this expression, take note of the following limits:
    • You can use this expression only for WM_CONCAT, COLLECT_LIST, COLLECT_SET, and UDAFs.
    • If multiple aggregate functions of a SELECT statement include the expression within group (order by <col1>[,<col2>…]), order by <col1>[,<col2>…] must be the same for these functions.
    • If the parameters of an aggregate function include the DISTINCT keyword, columns with distinct values must be specified in the expression order by <col1>[,<col2>…].
    Examples:
    -- Example 1: Sort the input data in ascending order and return the output data. 
    select 
      x,
      wm_concat(',', y) within group (order by y)
    from values('k', 1),('k', 3),('k', 2) as t(x, y)
    group by x;
    The following result is returned: 
    +------------+------------+
    | x          | _c1        |
    +------------+------------+
    | k          | 1,2,3      |
    +------------+------------+
    
    -- Example 2: Sort the input data in descending order and return the output data. 
    select 
      x,
      wm_concat(',', y) within group (order by y desc)
    from values('k', 1),('k', 3),('k', 2) as t(x, y)
    group by x;
    -- The following result is returned. 
    +------------+------------+
    | x          | _c1        |
    +------------+------------+
    | k          | 3,2,1      |
    +------------+------------+
  • [filter (where <where_condition>)]: If an aggregate function includes this expression, the aggregate function processes only the data that meets the condition specified by <where_condition>. For more information about <where_condition>, see where_condition.
    Before you use this expression, take note of the following limits:
    • Only built-in aggregate functions support this expression. UDAFs do not support this expression.
    • count(*) does not support this expression. To add filter conditions to count(*), you can use the COUNT_IF function.
    • The COUNT_IF function does not support this expression.
    Examples:
    -- Example 1: Filter and aggregate data. 
    select
      sum(x),
      sum(x) filter (where y > 1),
      sum(x) filter (where y > 2)
      from values(null, 1),(1, 2),(2, 3),(3, null) as t(x, y);
    -- The following result is returned. 
    +------------+------------+------------+
    | _c0        | _c1        | _c2        |
    +------------+------------+------------+
    | 6          | 3          | 2          |
    +------------+------------+------------+
    
    -- Example 2: Use multiple aggregate functions to filter and aggregate data. 
    select
      count_if(x > 2),
      sum(x) filter (where y > 1),
      sum(x) filter (where y > 2)
      from values(null, 1),(1, 2),(2, 3),(3, null) as t(x, y);
    -- The following result is returned. 
    +------------+------------+------------+
    | _c0        | _c1        | _c2        |
    +------------+------------+------------+
    | 1          | 3          | 2          |
    +------------+------------+------------+

Sample data

This section provides sample source data for you to understand how to use the functions. Create a table named emp and insert the sample data into it. Example:
create table if not exists emp
   (empno bigint,
    ename string,
    job string,
    mgr bigint,
    hiredate datetime,
    sal bigint,
    comm bigint,
    deptno bigint);
tunnel upload emp.txt emp;
The emp.txt file contains the following sample data:
7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10

Filter expressions

  • Limits
    • Only built-in aggregate functions of MaxCompute support filter expressions. UDAFs do not support filter expressions.
    • COUNT(*) cannot be used with filter expressions. Use the COUNT_IF function with filter expressions.
  • Syntax
    <aggregate_name>(<expression>[,...]) [filter (where <where_condition>)]
  • Description

    All aggregate functions support filter expressions. If you specify a filter condition, only the row data that meets the filter condition can be passed to the related aggregate function for data processing.

  • Parameters
    • aggregate_name: required. The name of the aggregate function. Select an aggregate function that is described in this topic based on your requirements.
    • expression: required. The parameters of the aggregate function that you select. Specify this parameter based on the description of the aggregate function that you select.
    • where_condition: optional. The filter condition. For more information about where_condition, see where_condition.
  • Return value

    See the description for the return value of each aggregate function.

  • Sample statement:
    select sum(sal) filter (where deptno=10), sum(sal) filter (where deptno=20), sum(sal) filter (where deptno=30) from emp;
    The following result is returned:
    +------------+------------+------------+
    | _c0        | _c1        | _c2        |
    +------------+------------+------------+
    | 17500      | 10875      | 9400       |
    +------------+------------+------------+

AVG

  • Syntax
    double avg(double <colname>)
    decimal avg(decimal <colname>)
  • Description

    Returns the average value of a column.

  • Parameter

    colname: required. The name of a column, which must be of the DOUBLE or DECIMAL type. If the input value is of the STRING or BIGINT type, it is implicitly converted into the DOUBLE type before calculation. If the input value is of another data type, an error is returned. If the input value is NULL, the row that contains this value is not used for the calculation. The values of the BOOLEAN type are not used for the calculation.

  • Return value

    If the input value is of the DECIMAL type, a value of the DECIMAL type is returned. If the input value is of another valid data type, a value of the DOUBLE type is returned.

  • Examples
    • Example 1: Calculate the average salary (sal) of all employees. Sample statement:
      select avg(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 2222.0588235294117 |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the average salary (sal) of employees in each department. Sample statement:
      select deptno, avg(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 2916.6666666666665 |
      | 20         | 2175.0     |
      | 30         | 1566.6666666666667 |
      +------------+------------+

COUNT

  • Syntax
    bigint count([distinct|all] <value>)
  • Description

    Returns the number of records that match the specified criteria.

  • Parameters
    • distinct|all: optional. This parameter specifies whether to remove duplicates during the counting. The default value is all, which indicates that all records are counted. If this parameter is set to distinct, only records with distinct values are counted.
    • colname: required. The name of a column, which can be of any data type. If the input value is NULL, the row that contains this value is not used for the calculation. The value of colname can be an asterisk (*). count(*) indicates that the number of all rows is returned.
  • Return value

    This function returns a value of the BIGINT type.

  • Examples
    • Example 1: Calculate the total number of employees in all departments. Sample statement:
      select count(*) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 17         |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the number of employees in each department. Sample statement:
      select deptno, count(*) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 6          |
      | 20         | 5          |
      | 30         | 6          |
      +------------+------------+
    • Example 3: Remove duplicates when you calculate the number of departments. Sample statement:
      select count(distinct deptno) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 3          |
      +------------+

COUNT_IF

  • Syntax
    bigint count_if(boolean <expr>)
  • Description

    Returns the number of records whose expr value is True.

  • Parameter

    expr: required. A BOOLEAN expression. If the value of the expr parameter is False or the value of a specified column in expr is NULL, the row that contains this value is not used for the calculation.

  • Return value

    This function returns a value of the BIGINT type.

  • Sample statement:
    select count_if(sal > 1000), count_if(sal <=1000) from emp;
    The following result is returned:
    +------------+------------+
    | _c0        | _c1        |
    +------------+------------+
    | 15         | 2          |
    +------------+------------+

MAX

  • Syntax
    max(<colname>)
  • Description

    Returns the maximum value of a column.

  • Parameter

    colname: required. The name of a column, which can be of any data type. If the input value is NULL, the row that contains this value is not used for the calculation. The values of the BOOLEAN type are not used for the calculation.

  • Return value

    The type of the return value is the same as the type of the colname parameter.

  • Examples
    • Example 1: Calculate the highest salary (sal) of all employees. Sample statement:
      select max(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 5000       |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the highest salary of employees in each department. Sample statement:
      select deptno, max(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 5000       |
      | 20         | 3000       |
      | 30         | 2850       |
      +------------+------------+

MIN

  • Syntax
    min(<colname>)
  • Description

    Returns the minimum value of a column.

  • Parameter

    colname: required. The name of a column, which can be of any data type. If the input value is NULL, the row that contains this value is not used for the calculation. The values of the BOOLEAN type are not used for the calculation.

  • Return value

    The type of the return value is the same as the type of the colname parameter.

  • Examples
    • Example 1: Calculate the lowest salary (sal) of all employees. Sample statement:
      select min(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 800        |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the lowest salary of employees in each department. Sample statement:
      select deptno, min(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 1300       |
      | 20         | 800        |
      | 30         | 950        |
      +------------+------------+

MEDIAN

  • Syntax
    double median(double <colname>)
    decimal median(decimal <colname>)
  • Description

    Returns the median value of a column.

  • Parameter

    colname: required. The name of a column, which must be of the DOUBLE or DECIMAL type. If the input value is of the STRING or BIGINT type, it is implicitly converted into the DOUBLE type before calculation. If the input value is NULL, the row that contains this value is not used for the calculation. If the input value is of another data type, an error is returned.

  • Return value

    A value of the DOUBLE or DECIMAL type is returned.

  • Examples
    • Example 1: Calculate the median value of salaries (sal) of all employees. Sample statement:
      select median(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 1600.0     |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the median value of salaries of employees in each department. Sample statement:
      select deptno, median(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 2450.0     |
      | 20         | 2975.0     |
      | 30         | 1375.0     |
      +------------+------------+

STDDEV

  • Syntax
    double stddev(double <colname>)
    decimal stddev(decimal <colname>)
  • Description

    Returns the population standard deviation of all input values.

  • Parameter

    colname: required. A value of the DOUBLE or DECIMAL type. If the input value is of the STRING or BIGINT type, it is implicitly converted into the DOUBLE type before calculation. If the input value is NULL, the row that contains this value is not used for the calculation. If the input value is of another data type, an error is returned.

  • Return value

    A value of the DOUBLE or DECIMAL type is returned.

  • Examples
    • Example 1: Calculate the population standard deviation of salaries (sal) of all employees. Sample statement:
      select stddev(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 1262.7549932628976 |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the population standard deviation of salaries of employees in each department. Sample statement:
      select deptno, stddev(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 1546.1421524412158 |
      | 20         | 1004.7387720198718 |
      | 30         | 610.1001739241043 |
      +------------+------------+

STDDEV_SAMP

  • Syntax
    double stddev_samp(double <colname>)
    decimal stddev_samp(decimal <colname>)
  • Description

    Returns the sample standard deviation of all input values.

  • Parameter

    colname: required. The name of a column, which must be of the DOUBLE or DECIMAL type. If the input value is of the STRING or BIGINT type, it is implicitly converted into the DOUBLE type before calculation. If the input value is NULL, the row that contains this value is not used for the calculation. If the input value is of another data type, an error is returned.

  • Return value

    A value of the DOUBLE or DECIMAL type is returned.

  • Examples
    • Example 1: Calculate the sample standard deviation of salaries (sal) of all employees. Sample statement:
      select stddev_samp(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 1301.6180541247609 |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the sample standard deviation of salaries of employees in each department. Sample statement:
      select deptno, stddev_samp(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 1693.7138680032901 |
      | 20         | 1123.3320969330487 |
      | 30         | 668.3312551921141 |
      +------------+------------+

SUM

  • Syntax
    sum(<colname>)
  • Description

    Returns the sum of a column.

  • Parameter

    colname: required. A value of the DOUBLE, DECIMAL, or BIGINT type, which is the name of a column. If the input value is of the STRING type, it is implicitly converted into the DOUBLE type before calculation. If the input value is NULL, the row that contains this value is not used for the calculation. The values of the BOOLEAN type are not used for the calculation.

  • Return value

    If the input parameter is of the BIGINT type, a value of the BIGINT type is returned. If the input parameter is of the DOUBLE or STRING type, a value of the DOUBLE type is returned.

  • Examples
    • Example 1: Calculate the sum of salaries (sal) of all employees. Sample statement:
      select sum(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 37775      |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the sum of salaries (sal) of employees in each department. Sample statement:
      select deptno, sum(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 17500      |
      | 20         | 10875      |
      | 30         | 9400       |
      +------------+------------+

WM_CONCAT

  • Syntax
    string wm_concat(string <separator>, string <colname>)
  • Description

    Concatenates values in colname with a delimiter specified by separator.

  • Parameters
    • separator: required. The delimiter, which is a constant of the STRING type If the input value is of another data type or is not a constant, an error is returned.
    • colname: required. A value of the STRING type. If the input value is of the BIGINT, DOUBLE, or DATETIME type, it is implicitly converted into the STRING type before calculation. If the input value is of another data type, an error is returned.
  • Return value
    A value of the STRING type.
    Note If table_name in the select wm_concat(',', name) from table_name; statement is an empty set, NULL is returned.
  • Examples
    • Example 1: Concatenate the names (ename) of all employees. Sample statement:
      select wm_concat(',', ename) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | SMITH,ALLEN,WARD,JONES,MARTIN,BLAKE,CLARK,SCOTT,KING,TURNER,ADAMS,JAMES,FORD,MILLER,JACCKA,WELAN,TEBAGE |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and concatenate the names (ename) of employees in each department. Sample statement:
      select deptno, wm_concat(',', ename) from emp group by deptno order by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | CLARK,KING,MILLER,JACCKA,WELAN,TEBAGE |
      | 20         | SMITH,JONES,SCOTT,ADAMS,FORD |
      | 30         | ALLEN,WARD,MARTIN,BLAKE,TURNER,JAMES |
      +------------+------------+
    • Example 3: Use this function with GROUP BY to group all employees by department (deptno) and concatenate the salary values of the employees in each department after duplicates are removed. Sample statement:
      select deptno, wm_concat(distinct ',', sal) from emp group by deptno order by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 1300,2450,5000 |
      | 20         | 1100,2975,3000,800 |
      | 30         | 1250,1500,1600,2850,950 |
      +------------+------------+

Additional functions of MaxCompute V2.0

MaxCompute V2.0 provides additional functions. If the functions you use involve new data types, execute the following SET statement to enable these data types. The new data types include TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, and BINARY.
  • Session level: To use a new data type, you must insert set odps.sql.type.system.odps2=true; before the SQL statement, and commit and execute them together.
  • Project level: The project owner can configure new data types for the project as required. It takes 10 to 15 minutes for the settings to take effect. Run the following command:
    setproject odps.sql.type.system.odps2=true;
    For more information about setproject, see Project operations. For the precautions you must take when you enable data types at the project level, see Data type editions.

ANY_VALUE

  • Syntax
    any_value(<colname>)
  • Description

    Returns a non-deterministic value from a specified column. This function is an additional function of MaxCompute V2.0.

  • Parameter

    colname: required. The name of a column, which can be of any data type. If the input value is NULL, the row that contains this value is not used for the calculation.

  • Return value

    The type of the return value is the same as the type of the colname parameter.

  • Examples
    • Example 1: Select one of the employees. Sample statement:
      select any_value(ename) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | SMITH      |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and select one employee from each group. Sample statement:
      select deptno, any_value(ename) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | CLARK      |
      | 20         | SMITH      |
      | 30         | ALLEN      |
      +------------+------------+

ARG_MAX

  • Syntax
    arg_max(<valueToMaximize>, <valueToReturn>)
  • Description

    Finds the row in which the value of valueToMaximize is included and returns the value of valueToReturn in the row. This function is an additional function of MaxCompute V2.0.

  • Parameters
    • valueToMaximize: required. A value of any data type.
    • valueToReturn: required. A value of any data type.
  • Return value

    The type of the return value is the same as the type of valueToReturn.

  • Examples
    • Example 1: Return the name of the employee with the highest salary. Sample statement:
      select arg_max(sal, ename) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | KING       |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and return the name of the employee with the highest salary in each department. Sample statement:
      select deptno, arg_max(sal, ename) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | KING       |
      | 20         | SCOTT      |
      | 30         | BLAKE      |
      +------------+------------+

ARG_MIN

  • Syntax
    arg_min(<valueToMinimize>, <valueToReturn>)
  • Description

    Finds the row in which the value of valueToMinimize is included and returns the value of valueToReturn in the row. This function is an additional function of MaxCompute V2.0.

  • Parameters
    • valueToMinimize: required. A value of any data type.
    • valueToReturn: required. A value of any data type.
  • Return value

    The type of the return value is the same as the type of valueToReturn.

  • Examples
    • Example 1: Return the name of the employee with the lowest salary. Sample statement:
      select arg_min(sal, ename) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | SMITH      |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and return the name of the employee with the lowest salary in each department. Sample statement:
      select deptno, arg_min(sal, ename) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | MILLER     |
      | 20         | SMITH      |
      | 30         | JAMES      |
      +------------+------------+

APPROX_DISTINCT

  • Syntax
    approx_distinct(<colname>)
  • Description

    Returns the approximate number of distinct input values in a specified column. This function is an additional function of MaxCompute V2.0.

  • Parameter

    colname: required. The column from which duplicates need to be removed.

  • Return value

    A value of the BIGINT type is returned. If all input values are empty, 0 is returned. This function produces a standard error of 5%.

  • Examples
    • Example 1: Calculate the approximate number of distinct values in the sal column. Sample statement:
      select approx_distinct(sal) from emp;
      The following result is returned:
      +-------------------+
      | numdistinctvalues |
      +-------------------+
      | 12                |
      +-------------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the approximate number of distinct values in the sal column. Sample statement:
      select deptno, approx_distinct(sal) from emp group by deptno;
      The following result is returned:
      +------------+-------------------+
      | deptno     | numdistinctvalues |
      +------------+-------------------+
      | 10         | 3                 |
      | 20         | 4                 |
      | 30         | 5                 |
      +------------+-------------------+

COLLECT_LIST

  • Syntax
    array collect_list(<colname>)
  • Description

    Aggregates the values specified by colname into an array. This function is an additional function of MaxCompute V2.0.

  • Parameter

    colname: required. The name of a column, which can be of any data type.

  • Return value

    A value of the ARRAY type is returned.

  • Examples
    • Example 1: Aggregate the salaries (sal) of all employees into an array. Sample statement:
      select collect_list(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | [800,1600,1250,2975,1250,2850,2450,3000,5000,1500,1100,950,3000,1300,5000,2450,1300] |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and aggregate the salaries (sal) of the employees in each department. Sample statement:
      select deptno, collect_list(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | [2450,5000,1300,5000,2450,1300] |
      | 20         | [800,2975,3000,1100,3000] |
      | 30         | [1600,1250,1250,2850,1500,950] |
      +------------+------------+
    • Example 3: Use this function with GROUP BY to group all employees by department (deptno) and aggregate the salaries (sal) of the employees in each department after duplicates are removed. Sample statement:
      select deptno, collect_list(distinct sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | [1300,2450,5000] |
      | 20         | [800,1100,2975,3000] |
      | 30         | [950,1250,1500,1600,2850] |
      +------------+------------+

COLLECT_SET

  • Syntax
    array collect_set(<colname>)
  • Description

    Aggregates the values specified by colname into an array with only distinct values. This function is an additional function of MaxCompute V2.0.

  • Parameter

    colname: required. The name of a column, which can be of any data type.

  • Return value

    A value of the ARRAY type is returned.

  • Examples
    • Example 1: Aggregate the salaries (sal) of all employees into an array with only distinct values. Sample statement:
      select collect_set(sal) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | [800,950,1100,1250,1300,1500,1600,2450,2850,2975,3000,5000] |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and aggregate the salaries (sal) of the employees in each department into an array with only distinct values. Sample statement:
      select deptno, collect_set(sal) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | [1300,2450,5000] |
      | 20         | [800,1100,2975,3000] |
      | 30         | [950,1250,1500,1600,2850] |
      +------------+------------+

NUMERIC_HISTOGRAM

  • Syntax
    map<double key, double value> numeric_histogram(bigint <buckets>, double <colname>)
  • Description

    Returns the approximate histogram of a specified column. This function is an additional function of MaxCompute V2.0.

  • Parameters
    • buckets: required. A value of the BIGINT type. This parameter specifies the maximum number of buckets in the column whose approximate histogram is returned.
    • colname: required. A value of the DOUBLE type. This parameter specifies the columns whose approximate histograms need to be calculated.
  • Return value

    A value of the map<double key, double value> type is returned. In the return value, key indicates the X-axis of the approximate histogram, and value indicates the approximate height of the Y-axis of the approximate histogram.

  • Example
    Return the approximate histogram of the sal column. Sample statement:
    select numeric_histogram(5, sal) from emp;
    The following result is returned:
    +------------+
    | _c0        |
    +------------+
    | {"1328.5714285714287":7.0,"2450.0":2.0,"5000.0":2.0,"875.0":2.0,"2956.25":4.0} |
    +------------+

PERCENTILE_APPROX

  • Syntax
    double percentile_approx (double <colname>, <p> [, <B>])) 
    -- Returns multiple approximate percentiles as an array. 
    array<double> percentile_approx (double <colname>, array(<p1> [, <p2>...]) [, <B>])
  • Description

    Returns approximate percentiles. This function applies to scenarios in which a large amount of data is calculated. This function sorts the values in a specified column in ascending order and then returns the approximate percentile of the column at the p th percentage. The value of percentile_approx starts from 1. For example, if a column contains values of 100, 200, and 300, the sequence numbers of these values are 1, 2, and 3. If you calculate the 0.6th percentile of the column, the PERCENTILE_APPROX function returns 180, which is calculated by using the following formula: 100 + (200 - 100) × 0.8. This value corresponds to 1.8, which is calculated by using the formula 3 × 0.6 and is between sequence numbers 1 and 2. This function is an additional function of MaxCompute V2.0.

    Note PERCENTILE_APPROX differs from PERCENTILE in the following aspects:
    • PERCENTILE_APPROX is used to calculate the approximate percentile, and PERCENTILE is used to calculate the exact percentile. If the amount of data is large, PERCENTILE may fail to run due to memory limits, but PERCENTILE_APPROX does not have this issue.
    • The implementation of PERCENTILE_APPROX is consistent with that of PERCENTILE_APPROX in Hive, but the calculation algorithm of PERCENTILE_APPROX is different from that of percentile. If the amount of data is small, the execution result of PERCENTILE_APPROX is different from that of PERCENTILE.
  • Parameters
    • colname: required. A value of the DOUBLE type, which indicates the name of a column.
    • p: required. The approximate percentile. Valid values: [0.0, 1.0].
    • B: the accuracy of the return value. A higher accuracy indicates a more accurate value. If you do not specify this parameter, 10000 is used. If the value of colname is less than the value of B, the exact percentile is returned.
  • Return value

    A value of the DOUBLE or ARRAY type is returned.

  • Examples
    • Example1: Calculate the 0.3rd percentile in the sal column. Sample statement:
      select percentile_approx(sal, 0.3) from emp;
      The following result is returned:
      +------------+
      | _c0        |
      +------------+
      | 1252.5     |
      +------------+
    • Example 2: Use this function with GROUP BY to group all employees by department (deptno) and calculate the 0.3rd percentile of employees in each department in the sal column. Sample statement:
      select deptno, percentile_approx(sal, 0.3) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | 1300.0     |
      | 20         | 950.0      |
      | 30         | 1070.0     |
      +------------+------------+
    • Example 3: Use this function with GROUP BY to group all employees by department (deptno) and calculate the 0.3rd, 0.5th, and 0.8th percentiles of employees in each department in the sal column. Sample statement:
      select deptno, percentile_approx(sal, array(0.3, 0.5, 0.8), 1000) from emp group by deptno;
      The following result is returned:
      +------------+------------+
      | deptno     | _c1        |
      +------------+------------+
      | 10         | [1300.0,1875.0,3470.000000000001] |
      | 20         | [950.0,2037.5,2987.5] |
      | 30         | [1070.0,1250.0,1580.0] |
      +------------+------------+