Window functions compute aggregations over partitions of a DataFrame while preserving the original row structure. Unlike groupby aggregations that collapse each group into a single summary row, window functions return one result per input row — making them the right tool for rankings, running totals, and row-offset lookups.
The PyODPS DataFrame API supports window functions through groupby combined with mutate or column selection.
Usage examples
The examples below use the pyodps_iris table. For details on the dataset, see Use DataFrame to process data.
Group by name and apply window functions with mutate
mutate applies window functions across each group and returns a DataFrame with the same number of rows as the original. The following example groups by name and computes a cumulative sum and a row number within each group.
iris = DataFrame(o.get_table('pyodps_iris'))
grouped = iris.groupby('name')
print(grouped.mutate(grouped.sepallength.cumsum(), grouped.sort('sepallength').row_number()).head(10))Output:
name sepallength_sum row_number
0 Iris-setosa 250.3 1
1 Iris-setosa 250.3 2
2 Iris-setosa 250.3 3
3 Iris-setosa 250.3 4
4 Iris-setosa 250.3 5
5 Iris-setosa 250.3 6
6 Iris-setosa 250.3 7
7 Iris-setosa 250.3 8
8 Iris-setosa 250.3 9
9 Iris-setosa 250.3 10Select columns using a window function
Pass a window function expression directly as a column in a selection to add computed columns alongside existing ones.
iris = DataFrame(o.get_table('pyodps_iris'))
print(iris['name', 'sepallength', iris.groupby('name').sort('sepallength').sepallength.cumcount()].head(5))Output:
name sepallength sepallength_count
0 Iris-setosa 4.3 1
1 Iris-setosa 4.4 2
2 Iris-setosa 4.4 3
3 Iris-setosa 4.4 4
4 Iris-setosa 4.5 5Aggregate by scalar
Grouping by Scalar(1) treats the entire DataFrame as a single group, applying the window function globally. The processing logic is identical to group-based aggregation.
from odps.df import Scalar
iris = DataFrame(o.get_table('pyodps_iris'))
iris.groupby(Scalar(1)).sort('sepallength').sepallength.cumcount()Supported window functions
Ranking functions
| Function | Description |
|---|---|
rank | Returns the rank of a row within its group. Rows with equal values receive the same rank, leaving gaps in the sequence. |
dense_rank | Returns the dense rank of a row within its group. Equal values receive the same rank with no gaps. |
percent_rank | Returns the relative rank of a row within its group. |
row_number | Returns the sequential row number within the group, starting from 1. |
Analytic functions
| Function | Description |
|---|---|
lag | Returns the value from the row that is offset rows before the current row. If no such row exists, returns default. |
lead | Returns the value from the row that is offset rows after the current row. If no such row exists, returns default. |
nth_value | Returns the Nth value within the group. |
cume_dist | Returns the proportion of rows in the group whose values are less than or equal to the current row's value. |
qcut | Divides the group into N bins based on the data sequence and returns the bin number for each row. If data cannot be evenly distributed, extra rows go into the first bin by default. |
Cumulative functions
| Function | Description |
|---|---|
cumsum | Calculates the cumulative sum within each group. |
cummean | Calculates the cumulative mean within each group. |
cummedian | Calculates the cumulative median within each group. |
cumstd | Calculates the cumulative standard deviation within each group. |
cummax | Calculates the cumulative maximum within each group. |
cummin | Calculates the cumulative minimum within each group. |
cumcount | Calculates the cumulative count within each group. |
Parameters
Ranking functions
rank, dense_rank, percent_rank, and row_number accept the following parameters.
| Parameter | Default | Description |
|---|---|---|
sort | "" (empty string) | The column to sort by within the group. |
ascending | True | Sort order. Set to False for descending order. |
lag and lead
In addition to the ranking function parameters, lag and lead accept:
| Parameter | Default | Description |
|---|---|---|
offset | — | Number of rows before (for lag) or after (for lead) the current row to retrieve. |
default | — | Value to return when the target row does not exist (for example, looking before the first row or after the last row). |
Cumulative functions
In addition to the ranking function parameters, cumsum, cummean, cummedian, cumstd, cummax, cummin, and cumcount accept:
| Parameter | Default | Description |
|---|---|---|
unique | False | Set to True to deduplicate values before computing. |
preceding | — | The start of the window frame (number of preceding rows to include). |
following | — | The end of the window frame (number of following rows to include). |