Aggregate functions are used to calculate a result for a group of data.

Except for count(), count_if(), max_by(), min_by(), and approx_distinct(), all other aggregate functions ignore null values and return null for no input rows or when all values are null. For example, the sum() function returns null rather than 0, and the avg() function does not include null values in the count. The coalesce() function converts null into 0.

arbitrary(x) → [same as input]

Returns an arbitrary non-null value of x.

array_agg(x) → array<[same as input]>

Returns an array created from the input x elements.

avg(x) → double

Returns the average (arithmetic mean) of all input values.

bool_and(boolean) → boolean

Returns TRUE if every input value is TRUE. Otherwise, this function returns FALSE.

bool_or(boolean) → boolean

Returns TRUE if one input value is TRUE. Otherwise, this function returns FALSE.

checksum(x) → varbinary

Returns an order-insensitive checksum of the specified values.

count (*) → bigint

Returns the number of input rows.

count(x) → bigint

Returns the number of non-null input values.

count_if(x) → bigint

Returns the number of TRUE input values. This function is equivalent to count(CASE WHEN x THEN 1 END).

every(boolean) → boolean

The alias for bool_and().

geometric_mean(x) → double

Returns the geometric mean of all input values.

max_by(x, y) → [same as x]

Returns the first value of x associated with the maximum value of y over all associations.

max_by(x, y, n) → array<[same as x]>

Returns the first n values of x associated with the first n largest values of all input values of y in descending order.

min_by(x, y) → [same as x]

Returns the first value of x associated with the minimum value of y over all associations.

min_by(x, y, n) → array<[same as x]>

Returns the first n values of x associated with the first n smallest values of all input values of y in ascending order.

max(x) → [same as input]

Returns the maximum value of all input values.

max(x, n) → array<[same as x]>

Returns the first n largest values of all input values of x.

min(x) → [same as input]

Returns the minimum value of all input values.

min(x, n) → array<[same as x]>

Returns the first n smallest values of all input values of x.

sum(x) → [same as input]

Returns the sum of all input values.

Bitwise aggregate functions

bitwise_and_agg(x) → bigint

Returns the bitwise AND of all input values in two's complement representation.

bitwise_or_agg(x) → bigint

Returns the bitwise OR of all input values in two's complement representation.

Map aggregate functions

histogram(x) → map<K,bigint>

Returns a map that contains the number of times each input value occurs.

map_agg(key, value) → map<K,V>

Returns a map created from the input key-value pairs.

map_union(x<K, V>)map<K,V>

Returns the union of all input maps. If a key is found in multiple input maps, the value of the key in the resulting map comes from an arbitrary input map.

multimap_agg(key, value) → map<K,array<V>>

Returns a multimap generated from the input key-value pairs. Each key can be related to multiple values.

Approximate aggregate functions

approx_distinct(x) → bigint

Returns the approximate number of distinct input values. This function provides an approximate calculation of count(DISTINCT x). If all input values are null, 0 is returned.

This function provides a standard error of 2.3%, which is the standard deviation of the approximately normal error distribution over all possible sets. It does not ensure an upper threshold on the error for a specific input set.

approx_distinct(x, e) → bigint

Returns the approximate number of distinct input values. This function provides an approximate calculation of count(DISTINCT x). If all input values are null, 0 is returned.

This function produces a standard error of no more than e, which is the standard deviation of the approximately normal error distribution over all possible sets. It does not ensure an upper threshold on the error for a specific input set. The implementation of this function requires that e be in the range of [0.01150, 0.26000].

approx_percentile(x, percentage) → [same as x]

Returns the approximate percentile for all input values of x at the specified percentage. The value of percentage must range from 0 to 1 and must be constant for all input rows. Example: approx_percentile(x, 0.5).

approx_percentile(x, percentages) → array<[same as x]>

Returns the approximate percentile for all input values of x at each of the specified percentages. Each element of the percentages array must range from 0 to 1, and the array must be constant for all input rows.

approx_percentile(x, w, percentage) → [same as x]

Returns the approximate weighted percentile for all input values of x by using the per-item weight w at the percentage. The weight must contain at least one integer value. It is effectively a replication count for the value x in the percentile set. The value of percentage must range from 0 to 1 and must be constant for all input rows.

approx_percentile(x, w, percentage, accuracy) → [same as x]

Returns the approximate weighted percentile for all input values of x by using the per-item weight w at the percentage, with a maximum rank error of accuracy. The weight must contain at least one integer value. It is effectively a replication count for the value x in the percentile set. The value of percentage must range from 0 to 1 and must be constant for all input rows. The value of accuracy must be a constant between 0 and 1 for all input rows.

approx_percentile(x, w, percentages) → array<[same as x]>

Returns the approximate weighted percentile for all input values of x by using the per-item weight w at each of the given percentages specified in the array. The weight must contain at least one integer value. It is effectively a replication count for the value x in the percentile set. Each element of the percentages array must range from 0 to 1, and the array must be constant for all input rows.

numeric_histogram(buckets, value, weight) → map<double, double>

Computes the approximate histogram based on the number of buckets for all values with a per-item weight of weight. The following algorithm applies:

Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872.

The value of buckets must be of the BIGINT type. The values of value and weight must be numeric.

numeric_histogram(buckets, value) → map<double, double>

Computes the approximate histogram based on the number of buckets for all values. This function is equivalent to the variant of numeric_histogram() that takes a weight, with a per-item weight of 1.

Statistical aggregate functions

corr(y, x) → double

Returns the correlation coefficient of input values.

covar_pop(y, x) → double

Returns the population covariance of input values.

covar_samp(y, x) → double

Returns the sample covariance of input values.

regr_intercept(y, x) → double

Returns the linear regression intercept of input values. y is the dependent value. x is the independent value.

regr_slope(y, x) → double

Returns the linear regression slope of input values. y is the dependent value. x is the independent value.

stddev(x) → double

The alias for stddev_samp().

stddev_pop(x) → double

Returns the population standard deviation of all input values.

stddev_samp(x) → double

Returns the sample standard deviation of all input values.

variance(x) → double

The alias for var_samp().

var_pop(x) → double

Returns the population variance of all input values.

var_samp(x) → double

Returns the sample variance of all input values.

Characteristic aggregate functions

group_concat

A string aggregate function.

Syntax: GROUP_CONCAT(expr [,expr …])

By default, fields are separated by commas (,).