The functions in this section are used for computing datasets. Aggregation functions ignore null values, and will return a null value for empty or null values. For example, the sum() function returns a null value instead of 0, and the avg() function only counts non-null values. In addition, the coalesce function converts null to 0.
Note: The following aggregation functions are excluded: count(), count_if(), max_by(), min_by(), and approx_distinct().
General aggregation functions
arbitrary(x) → [same as input]
Returns an arbitrary non-null value of x.
array_agg(x) → array<[same as input]>
Returns an array from the input x elements.
avg(x) → double
Returns the average (arithmetic mean) of all input values.
bool_and(boolean) → boolean
Returns TRUE if every input value is TRUE, otherwise this function returns FALSE.
bool_or(boolean) → boolean
Returns TRUE if any input value is TRUE, otherwise this function returns FALSE.
checksum(x) → varbinary
Returns an order-insensitive checksum of the specified values.
count(*) → bigint
Returns the number of input rows.
count(x) → bigint
Returns the number of unique non-null input values.
count_if(x) → bigint
Returns the number of TRUE input values.
every(boolean) → boolean
One of the alias for bool_and().
geometric_mean(x) → double
Returns the geometric mean of all input values.
max_by(x, y) → [same as x]
Returns the value of x related to the maximum value of y over all input values.
max_by(x, y, n) → array<[same as x]>
Returns n values of x related to the n largest of all input values of y in descending order of y.
min_by(x, y) → [same as x]
Returns the value of x related to the minimum value of y over all input values.
min_by(x, y, n) → array<[same as x]>
Returns n values of x related to the n smallest of all input values of y in ascending order of y.
max(x) → [same as input]
Returns the maximum value of all input values.
max(x, n) → array<[same as x]>
Returns n largest values of all input values of x.
min(x) → [same as input]
Returns the minimum value of all input values.
min(x, n) → array<[same as x]>
Returns n smallest values of all input values of x.
sum(x) → [same as input]
Returns the sum of all input values.
Bitwise aggregate functions
bitwise_and_agg(x) → bigint
Returns AND of all input values in two’s complement representation.
bitwise_or_agg(x) → bigint
Returns OR of all input values in two’s complement representation.
Map aggregation functions
histogram(x) → map<K,bigint>
Returns a map containing the number of times each input value occurs.
map_agg(key, value) → map<K,V>
Returns a map created from the input key-value pairs.
map_union(x<K, V>)
→ map<K,V>
Returns the union of all input maps. If a key is found in multiple input maps, the resulting map’s key value will be from an arbitrary input map.
multimap_agg(key, value) → map<K,array<V>>
Returns a multimap generated from the input key-value pairs. Each key can be related to multiple values.
Approximate aggregation functions
approx_distinct(x) → bigint
Returns the rough number of distinct input values. This function provides a rough calculation of count (DISTINCT x). If all input values are null, this function returns 0.
This function provides a standard error of 2.3%, which is the standard deviation of the (approximation normal) error distribution over all possible sets. It does not guarantee an upper bound on the error for any specific input set.
approx_distinct(x, e) → bigint
Returns the rough number of distinct input values. This function provides a rough count of (DISTINCT x). If all input values are null, this function returns 0.
This function produces a standard error less than e, which is the standard deviation of the (roughly normal) error distribution over all possible sets. It does not ensure an upper bound on the error for any specific input set. Currently, this function requires e to be in the range of [0.01150, 0.26000].
approx_percentile(x, percentage) → [same as x]
Returns the rough percentile for all input values of x for the specified percentage. The percentage value must range between 0 to 1 and must be constant for all input rows. For example, approx_percentile(x, 0.5) calculates the rough percentile.
approx_percentile(x, percentage) → array<[same as x]>
Returns the rough percentile for all input values of x at each of the specified percentages. Each element of the percentages array must be between 0 and 1, and the array must be constant for all input rows.
approx_percentile(x, w, percentage) → [same as x]
Returns the roughly weighed percentile for all input values of x using the per-item weight w at the percentage p. The weight must be an integer greater or equal to 1. It is a replication count for the value x in the percentile set. The value of p must be between 0 and 1 and must be a constant for all input rows.
approx_percentile(x, w, percentage, accuracy) → [same as x]
Returns the roughly weighed percentile for all input values of x using the per-item weight w at the percentage p, with a maximum rank error of accuracy. The weight must be an integer value greater or equal to 1. It is a replication count for the value x in the percentile set. The value of p must be between 0 and 1 and must be constant for all input rows. Accuracy must be a value between 0 and 1, and it must be a constant for all input rows.
approx_percentile(x, w, percentages) → array<[same as x]>
Returns the roughly weighed percentile for all input values of x using the per-item weight w at each of the given percentages specified in the array. The weight must be an integer value of at least 1. It is effectively a replication count for the value x in the percentile set. Each element of the array must be between 0 and 1, and the array must be constant for all input rows.
numeric_histogram(buckets, value, weight) → map<double, double>
Computes an approximate histogram with the specified number of buckets out of total buckets for all values with a per-item weight w. The algorithm is based loosely on:
Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872.
Note: buckets must be a bigint, value, and weight must be numeric.
numeric_histogram(buckets, value) → map<double, double>
Computes an approximate histogram with the specified number of buckets out of the total buckets for all values. This function is equivalent to the variant of numeric_histogram() that takes a weight, with a per-item weight of 1.
Statistical aggregation functions
corr(y, x) → double
Returns the correlation coefficient of input values.
covar_pop(y, x) → double
Returns the population covariance of input values.
covar_samp(y, x) → double
Returns the sample covariance of input values.
regr_intercept(y, x) → double
Returns linear regression intercept of input values where y is the dependent value, and x is the independent value.
regr_slope(y, x) → double
Returns linear regression slope of input values where y is the dependent value, and x is the independent value.
stddev(x) → double
This is an alias for stddev_samp.
stddev_pop(x) → double
Returns the population standard deviation of all input values.
stddev_samp(x) → double
Returns the sample standard deviation of all input values.
variance(x) → double
One of the alias for var_samp().
var_pop(x) → double
Returns the population variance of all input values.
var_samp(x) → double
Returns the sample variance of all input values.
Characteristic aggregation functions
group_concat
This function concatenates strings from a group into a single string.
Currently, this function can only be used when the GROUP BY clause contains all table partition columns (dimension tables) that participate in computation during aggregation.
Syntax: GROUP_CONCAT(expr [,expr …])
Use SEPARATOR to specify the separator character after aggregation, which is a comma (,) by default.