This topic describes the aggregation operations that are supported by Alibaba Cloud MaxCompute SDK for Python (PyODPS) DataFrame and describes how to group and aggregate data and write custom aggregations. PyODPS DataFrame also provides a HyperLogLog operation for you to count the number of distinct values in columns.

```
from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
```

- Call the
`describe`

function to view the quantity, maximum, minimum, mean, and standard deviation of numerical columns in the DataFrame.

The following result is returned:`>>> print(iris.describe())`

`type sepal_length sepal_width petal_length petal_width 0 count 150.000000 150.000000 150.000000 150.000000 1 mean 5.843333 3.054000 3.758667 1.198667 2 std 0.828066 0.433594 1.764420 0.763161 3 min 4.300000 2.000000 1.000000 0.100000 4 max 7.900000 4.400000 6.900000 2.500000`

- Perform aggregation operations in a single column.

The following result is returned:`>>> iris.sepallength.max()`

`7.9`

- To aggregate over a distinct sequence of data records, call the
`unique`

function before you call the corresponding aggregate function.

The following result is returned:`>>> iris.name.unique().cat(sep=',')`

`u'Iris-setosa,Iris-versicolor,Iris-virginica'`

- If all columns support the same aggregation operation, perform this aggregation operation
on the entire DataFrame.

The following result is returned:`>>> iris.exclude('category').mean()`

`sepal_length sepal_width petal_length petal_width 1 5.843333 3.054000 3.758667 1.198667`

- Call the
`count`

function to calculate the total number of rows in the DataFrame.

The following result is returned:`>>> iris.count()`

`150`

Aggregation | Description |
---|---|

count or size | Calculates the number of rows. |

unique | Calculates the number of distinct values. |

min | Calculates the minimum value. |

max | Calculates the maximum value. |

sum | Calculates the total sum of specified values. |

mean | Calculates the mean value. |

median | Calculates the median value. |

quantile(p) | Calculates the p-quantile. This function returns accurate results only if integers are calculated. |

var | Calculates the variance. |

std | Calculates the standard deviation. |

moment | Calculates the Nth central moment or the Nth moment. |

skew | Calculates the sample skewness. This function returns unbiased estimation results. |

kurtosis | Calculates the sample kurtosis. This function returns unbiased estimation results. |

cat | Concatenates character strings with a separator. |

tolist | Aggregates a column into a list. |

**Note**PyODPS DataFrame ignores null values of aggregation operations on columns in MaxCompute and the pandas backend. This is different from pandas DataFrame but is similar to the logic of SQL.

## Group and aggregate data

- PyODPS DataFrame provides the
`groupby`

function to group data. After the data is grouped, call the`agg`

or`aggregate`

function to aggregate the data. The result columns include the grouped column and the aggregated column.

The following result is returned:`>>> iris.groupby('name').agg(iris.sepallength.max(), smin=iris.sepallength.min())`

`name sepallength_max smin 0 Iris-setosa 5.8 4.3 1 Iris-versicolor 7.0 4.9 2 Iris-virginica 7.9 4.9`

- PyODPS DataFrame provides the
`value_counts`

function. After the data is grouped based on a specified column, you can sort the groups in descending order based on the number of distinct values in each group.- Call the
`groupby`

function.

The following result is returned:`>>> iris.groupby('name').agg(count=iris.name.count()).sort('count', ascending=False).head(5)`

`name count 0 Iris-virginica 50 1 Iris-versicolor 50 2 Iris-setosa 50`

- Call the
`value_counts`

function.

The following result is returned:`>>> iris['name'].value_counts().head(5)`

`name count 0 Iris-virginica 50 1 Iris-versicolor 50 2 Iris-setosa 50`

- Call the
- You can retrieve the column name of a single aggregated column. However, this operation
restricts you to use only aggregate functions to manage the values in the aggregated
column.

The following result is returned:`>>> iris.groupby('name').petallength.sum()`

`petallength_sum 0 73.2 1 213.0 2 277.6`

The following result is returned:`>>> iris.groupby('name').agg(iris.petallength.notnull().sum())`

`name petallength_sum 0 Iris-setosa 50 1 Iris-versicolor 50 2 Iris-virginica 50`

- You can also group data by constant value. This operation requires Scalar initialization.

The following result is returned:`>>> from odps.df import Scalar >>> iris.groupby(Scalar(1)).petallength.sum()`

`petallength_sum 0 563.8`

## Write custom aggregations

Use the `agg`

or `aggregate`

function to call custom aggregations on columns. A custom aggregation requires a
class to provide the following methods:

`buffer()`

: Returns a mutable object such as LIST or DICT. The`buffer`

size must not increase with the amount of data.`__call__(buffer, *val)`

: Aggregates values to`buffer`

.`merge(buffer, pbuffer)`

: Aggregates`pbuffer`

to`buffer`

.`getvalue(buffer)`

: Returns the final value.

```
class Agg(object):
def buffer(self):
return [0.0, 0]
def __call__(self, buffer, val):
buffer[0] += val
buffer[1] += 1
def merge(self, buffer, pbuffer):
buffer[0] += pbuffer[0]
buffer[1] += pbuffer[1]
def getvalue(self, buffer):
if buffer[1] == 0:
return 0.0
return buffer[0] / buffer[1]
```

`>>> iris.sepalwidth.agg(Agg)`

The following result is returned: `3.0540000000000007`

- If the data type of the output is different from that of the input, you must specify
the data type for the output.
`>>> iris.sepalwidth.agg(Agg, 'float')`

- You can use custom aggregations to group and aggregate data.

The following result is returned:`>>> iris.groupby('name').sepalwidth.agg(Agg)`

`petallength_aggregation 0 3.418 1 2.770 2 2.974`

- You can use the
`agg`

function to call custom aggregations on multiple columns.`class Agg(object): def buffer(self): return [0.0, 0.0] def __call__(self, buffer, val1, val2): buffer[0] += val1 buffer[1] += val2 def merge(self, buffer, pbuffer): buffer[0] += pbuffer[0] buffer[1] += pbuffer[1] def getvalue(self, buffer): if buffer[1] == 0: return 0.0 return buffer[0] / buffer[1]`

The following result is returned:`>>> from odps.df import agg >>> to_agg = agg([iris.sepalwidth, iris.sepallength], Agg, rtype='float') # Call custom aggregations on two columns. >>> iris.groupby('name').agg(val=to_agg)`

`name val 0 Iris-setosa 0.682781 1 Iris-versicolor 0.466644 2 Iris-virginica 0.451427`

- To call an existing user-defined aggregate function (UDAF) in MaxCompute, you only
need to specify the function name.
`>>> iris.groupby('name').agg(iris.sepalwidth.agg('your_func')) # Aggregate the values in a single column. >>> to_agg = agg([iris.sepalwidth, iris.sepallength], 'your_func', rtype='float') >>> iris.groupby('name').agg(to_agg.rename('val')) # Aggregate the values in multiple columns.`

**Note**Due to limits of Python user-defined functions (UDFs), you cannot specify the LIST or DICT type as the input or output data type for custom aggregations.

## HyperLogLog counting

PyODPS DataFrame provides the `hll_count`

operation. You can call this HyperLogLog operation to count the number of distinct
values in a column. This operation returns an estimated number. If large amounts of
data is calculated, you can call this operation to estimate the number of distinct
values.

```
>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
```

The following result is returned: `63270`

`>>> df.a.nunique()`

The following result is returned: `63250`

**Note**The

`splitter`

parameter is used to split columns and calculate the number of distinct values.