Aggregate functions group multiple input records together to form a single output record. The input and output records have a many-to-one relationship. Aggregate functions can be used together with the GROUP BY clause of SQL. This topic describes the syntax, features, parameters, and return value of each aggregate function that MaxCompute supports. This topic also provides examples that show how to use these aggregate functions.

COUNT

  • Syntax
    BIGINT COUNT([distinct|all] value)
  • Description

    This function counts the number of records that match the specified criteria.

  • Parameters
    • distinct|all: specifies whether to remove duplicate records during counting. The default value is all, which indicates that all records are counted. If this parameter is set to distinct, only records with distinct values are counted.
    • value: the column for the function to count the number of records. The column can be of any data type. If a value in the specified column is NULL, the row that contains this value is not included in the calculation. You can set value to an asterisk (*). This means that you can run count(*) to count all records.
  • Return value

    This function returns a value of the BIGINT type.

  • Examples
    -- Assume that the tbla table has a col1 column whose data type is BIGINT.
    +------+
    | COL1 |
    +------+
    | 1 |
    +------+
    | 2 |
    +------+
    | NULL |
    +------+
    SELECT COUNT(*) FROM tbla; -- The return value is 3.
    SELECT COUNT(col1) FROM tbla; -- The return value is 2.
    The COUNT function can be used together with the GROUP BY clause. Assume that the test_src table has two columns: key of the STRING type and value of the DOUBLE type.
    -- The test_src table contains the following data:
    +-----+-------+
    | key | value |
    +-----+-------+
    | a | 2.0 |
    +-----+-------+
    | a | 4.0 |
    +-----+-------+
    | b | 1.0 |
    +-----+-------+
    | b | 3.0 |
    +-----+-------+
    -- Execute the following statement:
    SELECT key, COUNT(value) AS count FROM test_src GROUP BY key;
    +-----+-------+
    | key | count |
    +-----+-------+
    | a | 2 |
    +-----+-------+
    | b | 2 |
    +-----+-------+
    -- The COUNT function aggregates values that have the same key. The usage of the following aggregate functions is similar to that of the COUNT function, and therefore is not described in detail in this topic.

AVG

  • Syntax
    DOUBLE AVG(DOUBLE value)
    DECIMAL AVG(DECIMAL value)
  • Description

    This function calculates the average value of a column.

  • Parameters

    value: the column whose average value you want to calculate. The column must be of the DOUBLE or DECIMAL type. If the specified column is of the STRING or BIGINT type, the values in the column are implicitly converted to those of the DOUBLE type before calculation. If the specified column is of another data type, an error is returned. If a value in the specified column is NULL, the row that contains this value is not included in the calculation. The values of the BOOLEAN type are not included in the calculation.

  • Return value

    If the specified column is of the DECIMAL type, this function returns a value of the DECIMAL type. If the specified column is of another valid data type, this function returns a value of the DOUBLE type.

  • Examples
    -- Assume that the tbla table has a value column whose data type is BIGINT.
    +-------+
    | value |
    +-------+
    | 1 |
    | 2 |
    | NULL |
    +-------+
    -- The average value of this column is: (1 + 2)/2 = 1.5.
    SELECT AVG(value) AS avg FROM tbla;
    +------+
    | avg |
    +------+
    | 1.5 |
    +------+

MAX

  • Syntax
    MAX(value)
  • Description

    This function calculates the maximum value of a column.

  • Parameters

    value: the column whose maximum value you want to calculate. The column can be of any data type. If a value in the specified column is NULL, the row that contains this value is not included in the calculation. The values of the BOOLEAN type are not included in the calculation.

  • Return value

    This function returns a value of the same type as value.

  • Examples
    -- Assume that the tbla table has a col1 column whose data type is BIGINT.
    +------+
    | col1 |
    +------+
    | 1 |
    +------+
    | 2 |
    +------+
    | NULL |
    +------+
    SELECT MAX(value) FROM tbla; -- The return value is 2.

MIN

  • Syntax
    MIN(value)
  • Description

    This function calculates the minimum value of a column.

  • Parameters

    value: the column whose minimum value you want to calculate. The column can be of any data type. If a value in the specified column is NULL, the row that contains this value is not included in the calculation. The values of the BOOLEAN type are not included in the calculation.

  • Examples
    -- Assume that the tbla table has a value column whose data type is BIGINT.
    +------+
    | value|
    +------+
    | 1 |
    +------+
    | 2 |
    +------+
    | NULL |
    +------+
    SELECT MIN(value) FROM tbla; -- The return value is 1.

MEDIAN

  • Syntax
    DOUBLE MEDIAN(DOUBLE number)
    DECIMAL MEDIAN(DECIMAL number)
  • Description

    This function calculates the median value of a column.

  • Parameters

    value: the column whose median value you want to calculate. The column must be of the DOUBLE or DECIMAL type. If the specified column is of the STRING or BIGINT type, the values in the column are implicitly converted to those of the DOUBLE type before calculation. If the specified column is of another data type, an error is returned.

  • Return value

    This function returns a value of the DOUBLE or DECIMAL type.

  • Examples
    -- Assume that the tbla table has a value column whose data type is BIGINT.
    +------+
    | value|
    +------+
    | 1 |
    +------+
    | 2 |
    +------+
    | 3 |
    +------+
    | 4 |
    +------+
    | 5 |
    +------+
    SELECT MEDIAN(value) FROM tbla; -- The return value is 3.0.

STDDEV

  • Syntax
    DOUBLE STDDEV(DOUBLE number)
    DECIMAL STDDEV(DECIMAL number)
  • Description

    This function calculates the population standard deviation of a column.

  • Parameters

    value: the column whose population standard deviation you want to calculate. The column must be of the DOUBLE or DECIMAL type. If the specified column is of the STRING or BIGINT type, the values in the column are implicitly converted to those of the DOUBLE type before calculation. If the specified column is of another data type, an error is returned.

  • Return value

    This function returns a value of the DOUBLE or DECIMAL type.

  • Examples
    -- Assume that the tbla table has a value column whose data type is BIGINT.
    +------+
    | value|
    +------+
    | 1 |
    +------+
    | 2 |
    +------+
    | 3 |
    +------+
    | 4 |
    +------+
    | 5 |
    +------+
    SELECT STDDEV(value) FROM tbla; -- The return value is 1.4142135623730951.

STDDEV_SAMP

  • Syntax
    DOUBLE STDDEV_SAMP(DOUBLE number)
    DECIMAL STDDEV_SAMP(DECIMAL number)
  • Description

    This function calculates the sample standard deviation of a column.

  • Parameters

    value: the column whose sample standard deviation you want to calculate. The column must be of the DOUBLE or DECIMAL type. If the specified column is of the STRING or BIGINT type, the values in the column are implicitly converted to those of the DOUBLE type before calculation. If the specified column is of another data type, an error is returned.

  • Return value

    This function returns a value of the DOUBLE or DECIMAL type.

  • Examples
    -- Assume that the tbla table has a value column whose data type is BIGINT.
    +------+
    | value|
    +------+
    | 1 |
    +------+
    | 2 |
    +------+
    | 3 |
    +------+
    | 4 |
    +------+
    | 5 |
    +------+
    SELECT STDDEV_SAMP(value) FROM tbla; -- The return value is 1.5811388300841898.

SUM

  • Syntax
    SUM(value)
  • Description

    This function calculates the sum of a column.

  • Parameters

    value: the column whose sum you want to calculate. The column must be of the DOUBLE, DECIMAL, or BIGINT type. If the specified column is of the STRING type, the values in the column are implicitly converted to those of the DOUBLE type before calculation. If a value in the specified column is NULL, the row that contains this value is not included in the calculation. The values of the BOOLEAN type are not included in the calculation.

  • Return value

    If the specified column is of the BIGINT type, this function returns a value of the BIGINT type. If the specified column is of the DOUBLE or STRING type, this function returns a value of the DOUBLE type.

  • Examples
    -- Assume that the tbla table has a value column whose data type is BIGINT.
    +------+
    | value|
    +------+
    | 1 |
    +------+
    | 2 |
    +------+
    | NULL |
    +------+
    SELECT SUM(value) FROM tbla; -- The return value is 3.

WM_CONCAT

  • Syntax
    STRING WM_CONCAT(STRING separator, STRING str)
  • Description

    This function uses the delimiter that is specified by separator to connect the values in str.

  • Parameters
    • separator: the delimiter, which is a constant of the STRING type. If you specify a constant of another type or a non-constant, an error is returned.
    • str: the values you want to connect, which must be of the STRING type. If the specified values are of the BIGINT, DOUBLE, or DATETIME type, they are implicitly converted to those of the STRING type before calculation. If other types of values are specified, an error is returned.
  • Return value
    This function returns a value of the STRING type.
    Note If test_src in the SELECT WM_CONCAT(',', name) FROM test_src; statement is an empty set, NULL is returned.
  • Examples:

    Connect values after data grouping and sorting.

    -- Create a table named test.
    CREATE TABLE test(id int , alphabet string);
    -- Insert data into the test table.
    INSERT INTO test VALUES (1,'a'),(1,'b'),(1,'c'),(2,'D'),(2,'E'),(2,'F');
    -- Group and sort values in the table based on the id column. Then, connect values in the same group.
    SELECT id,WM_CONCAT('',alphabet) FROM test GROUP BY id ORDER BY id LIMIT 100;
    +------------+------------+
    | id         | _c1        |
    +------------+------------+
    | 1          | abc        |
    | 2          | DEF        |
    +------------+------------+

COLLECT_LIST

  • Syntax
    ARRAY COLLECT_LIST(col)
  • Description

    This function aggregates the expressions that are specified by col in a given group into an array.

  • Parameters

    col: a table column, which can be of any data type.

  • Return value

    This function returns a value of the ARRAY type.

Note If you need to use new data types, such as TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, and BINARY, in MaxCompute SQL, you must enable the new data types.
  • Session level: To enable new data types at the session level, you must insert set odps.sql.type.system.odps2=true; before the SQL statement. Then, commit and execute them together.
  • Project level: The owner of a project can enable new data types for the project by running the following command:
    setproject odps.sql.type.system.odps2=true;
    For more information about setproject, see Project operations. For the precautions you must take when you enable data types at the project level, see Date types.

COLLECT_SET

  • Syntax
    ARRAY COLLECT_SET(col)
  • Description

    This function aggregates the expressions that are specified by col in a given group into an array without duplicate elements.

  • Parameters

    col: a table column, which can be of any data type.

  • Return value

    This function returns a value of the ARRAY type.

Note If you need to use new data types, such as TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, and BINARY, in MaxCompute SQL, you must enable the new data types.
  • Session level: To enable new data types at the session level, you must insert set odps.sql.type.system.odps2=true; before the SQL statement. Then, commit and execute them together.
  • Project level: The owner of a project can enable new data types for the project by running the following command:
    setproject odps.sql.type.system.odps2=true;
    For more information about setproject, see Project operations. For the precautions you must take when you enable data types at the project level, see Date types.

NUMERIC_HISTOGRAM

  • Syntax
    MAP<DOUBLE, DOUBLE> NUMERIC_HISTOGRAM(BIGINT buckets, DOUBLE value)
  • Description

    This function returns the approximate histogram of a specific column.

  • Parameters
    • buckets: the maximum number of buckets in the column whose approximate histogram is returned. The value must be of the BIGINT type.
    • value: the column whose approximate histogram you want to obtain. The column must be of the DOUBLE type.
  • Return value

    This function returns a value of the MAP<DOUBLE,DOUBLE> type. In the return value, a key indicates an x-coordinate, and its value indicates the height on the approximate histogram of the column.

PERCENTILE_APPROX

  • Syntax
    DOUBLE PERCENTILE_APPROX(DOUBLE col, p [, B])) 
    ARRAY<DOUBLE>  PERCENTILE_APPROX(DOUBLE col, ARRAY(p1 [, p2]...) [, B])
  • Description

    This function calculates an approximate percentile value and applies to the scenarios where large amounts of data are calculated. This function returns the approximate percentile value of the col column at the given percentage p. PERCENTILE_APPROX labels the data entries in a column from 1. For example, if a column contains the data entries of 100, 200, and 300, the sequence numbers of the entries are 1, 2, and 3. If you calculate the 0.5th percentile value of the column, the PERCENTILE_APPROX function returns 150, which corresponds to 1.5 (3 × 0.5) between sequence numbers of 1 and 2.

  • Parameters
    • p: the given percentage that ranges from 0.0 to 1.0. You can specify one percentage to return an approximate percentile value or multiple percentages to return an array that consists of percentile values.
    • B: the accuracy of the return value. A higher accuracy indicates a more accurate value. If you do not specify this parameter, 10000 is used. If the number of values in column col is less than B, the exact percentile value is returned.
  • Return value

    percentile_approx(double col, p [, B])) returns a single approximate percentile value. percentile_approx(double col, array(p1 [, p2]...)[, B]) returns an array that consists of multiple percentile values.

  • Examples
    SELECT PERCENTILE_APPROX(10.0, 0.5, 100); -- The return value is 10.0.
    SELECT PERCENTILE_APPROX(10.0, array(0.5, 0.4, 0.1), 100); -- The return value is [10.0,10.0,10.0].

APPROX_DISTINCT

  • Syntax
    APPROX_DISTINCT(value)
  • Description

    This function returns the approximate number of distinct input values.

  • Parameters

    value: the input values.

  • Return value

    This function returns a value of the BIGINT type. If all input values are NULL, this function returns zero. This function produces a standard error of 5%.

  • Examples
    SELECT key, APPROX_DISTINCT(value) 
    FROM  VALUES
        (1, 99),
        (1, 100),
        (2, 100),
        (2, 100),
        (3, NULL) 
    AS t(key,value) 
    GROUP BY key;
    -- The following result is returned:
    +------------+------------+
    | key        | _c1        |
    +------------+------------+
    | 1          | 2          |
    | 2          | 1          |
    | 3          | 1          |
    +------------+------------+

ANY_VALUE

  • Syntax
    ANY_VALUE(value)
  • Description

    This function returns a non-deterministic value from a specific column.

  • Parameters

    value: the column from which a single non-deterministic value you want to return. The value of this parameter can be of any data type. If a value in the specified column is NULL, the row that contains this value is not included in the calculation.

  • Return value

    This function returns a value of the same type as the input value.

  • Examples
    SELECT key, ANY_VALUE(value) 
    FROM VALUES 
        (1, 'value1'),
        (1, 'value2'),
        (2, 'value3'),
        (2, NULL),
        (3, NULL) 
    AS t(key, value) 
    GROUP BY key;
    -- The following result is returned:
    +------------+------------+
    | key        | _c1        |
    +------------+------------+
    | 1          | value1     |
    | 2          | value3     |
    | 3          | NULL       |
    +------------+------------+

ARG_MAX

  • Syntax
    ARG_MAX(valueToMaximize, valueToReturn)
  • Description

    This function finds the row where the maximum value of valueToMaximize resides and returns the value of valueToReturn in the row.

  • Parameters
    • valueToMaximize: the column that contains the values to compare. The value of this parameter can be of any data type.
    • valueToReturn: the column that contains the value to return. The value of this parameter can be of any data type.
  • Return value

    This function returns a value of the same type as valueToReturn.

  • Examples
    SELECT key, ARG_MAX(comp, value) 
    FROM VALUES 
        (1, 99,   'value1'),
        (1, 100,  'value2'),
        (2, 99,   'value3'),
        (2, 100,   NULL),
        (3, NULL, 'value4') 
    AS t(key,comp, value) 
    GROUP BY key;
    
    -- The following result is returned:
    +------------+------------+
    | key        | _c1        |
    +------------+------------+
    | 1          | value2     |
    | 2          | NULL       |
    | 3          | NULL       |
    +------------+------------+

ARG_MIN

  • Syntax
    ARG_MIN(valueToMinimize, valueToReturn)
  • Description

    This function finds the row where the minimum value of valueToMinimize resides and returns the value of valueToReturn in the row.

  • Parameters
    • valueToMinimize: the column that contains the values to compare. The value of this parameter can be of any data type.
    • valueToReturn: the column that contains the value to return. The value of this parameter can be of any data type.
  • Return value

    This function returns a value of the same type as valueToReturn.

  • Examples
    SELECT key, ARG_MIN(comp, value) 
    FROM VALUES 
      (1, 99, 'value1'),
      (1, 100,  'value2'),
      (2, 99, 'value3'),
      (2, 100, NULL),
      (3, NULL, 'value4') 
    AS t(key,comp, value) 
    GROUP BY key;
    
    -- The following result is returned:
    +------------+------------+
    | key        | _c1        |
    +------------+------------+
    | 1          | value1     |
    | 2          | value3     |
    | 3          | NULL       |
    +------------+------------+