Perform aggregations with the Python SDK (sum, count, grouping statistics, percentiles) - Tablestore

The aggregation feature of a search index allows you to perform value calculations and grouping statistics on query results. Value aggregations include minimum value, maximum value, sum, average value, count, distinct count, and percentile statistics. Grouping aggregations support grouping by field value, range, geographical location, filter, histogram, and composite aggregation. You can combine multiple aggregations in a single request.

Note

The aggregation feature is supported in Python SDK V5.2.1 and later.

Features

The following tables describe the aggregation features.

Value aggregation

Feature	Description
Minimum value	Returns the minimum value from a field, similar to the SQL `min` function.
Maximum value	Returns the maximum value from a field, similar to the SQL `max` function.
Sum	Returns the sum of values in a numeric field, similar to the SQL `sum` function.
Average value	Returns the average of values in a numeric field, similar to the SQL `avg` function.
Count	Returns the number of values for a specified field or the total number of rows in a search index, similar to the SQL `count` function.
Distinct count	Returns the number of distinct values for a specified field, similar to the SQL `count(distinct)` function.
Percentile statistics	Calculates the percentile distribution of a dataset. For example, during routine system operations and maintenance (O&M), you can use this feature to analyze the distribution of request latencies, such as the P25, P50, P90, and P99 values.

Grouping aggregation

Feature	Description
Group by field value	Groups query results based on the values of a field. Rows with the same field value are placed in the same group. The response returns each group's value and its row count. Note When the number of groups is large, the results of this aggregation may be approximate.
Group by range	Groups query results based on specified ranges of a field. Rows with field values that fall into the same range are placed in the same group. The response returns the number of items in each range.
Group by geographical location	Groups query results based on the distance from a central point. Rows within the same distance range are placed in the same group. The response returns the number of items in each range.
Group by filter	Groups query results based on a set of filters. The response returns the number of documents that match each filter. The results are returned in the same order that the filters are specified.
Histogram aggregation	Groups query results based on a specified numeric interval. Rows with field values that fall into the same interval are placed in the same group. The response returns each group's value and its corresponding count.
Composite aggregation	Groups query results by multiple fields, similar to `GROUP BY column1, column2, ...` in SQL. You can use pagination to retrieve all the grouped results.

Value aggregation

Minimum value

Returns the minimum value from a field, similar to the SQL min function.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to aggregate. The data type of the field must be Long, Double, or Date.
missing	A default value for documents missing the specified field. If this parameter is set, its value is used in the aggregation. Otherwise, documents missing the field are ignored.

Example

Calculate the minimum score among people who are 18 years old.

query = TermQuery('age', 18)
agg = Min('score', name='min')

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, aggs=[agg]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print("name: %s, value: %s" % (agg_result.name, str(agg_result.value)))

Maximum value

Returns the maximum value from a field, similar to the SQL max function.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to aggregate. The data type of the field must be Long, Double, or Date.
missing	A default value for documents missing the specified field. If this parameter is set, its value is used in the aggregation. Otherwise, documents missing the field are ignored.

Example

Calculate the maximum score among people who are 18 years old. If a person does not have a score, a default value of 0 is used.

query = TermQuery('age', 18)
agg = Max('score', missing_value=0, name='max')

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, aggs=[agg]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print("name: %s, value: %s" % (agg_result.name, str(agg_result.value)))

Sum

Returns the sum of values in a numeric field, similar to the SQL sum function.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to aggregate. The data type of the field must be Long or Double.
missing	A default value for documents missing the specified field. If this parameter is set, its value is used in the aggregation. Otherwise, documents missing the field are ignored.

Example

Calculate the sum of all scores for people who are 18 years old.

query = TermQuery('age', 18)
agg = Sum('score', name='sum')

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, aggs=[agg]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print("name: %s, value: %s" % (agg_result.name, str(agg_result.value)))

Average value

Returns the average of values in a numeric field, similar to the SQL avg function.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to aggregate. The data type of the field must be Long, Double, or Date.
missing	A default value for documents missing the specified field. If this parameter is set, its value is used in the aggregation. Otherwise, documents missing the field are ignored.

Example

Calculate the average score for people who are 18 years old.

query = TermQuery('age', 18)
agg = Avg('score', name='avg')

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, aggs=[agg]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print("name: %s, value: %s" % (agg_result.name, str(agg_result.value)))

Count

Returns the number of values for a specified field or the total number of rows in a search index, similar to the SQL count function.

Note

You can count rows that match a query in two ways:

To get the total number of matching rows, set the get_total_count query parameter to True.
To count only the rows where a specific field has a value (useful for sparse columns), use the count aggregation on that field.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to be used for the aggregation. The data type of the field must be Long, Double, Boolean, Keyword, `GeoPoint`, or Date.

Example

Count the number of people who are 18 years old and have a recorded exam score.

query = TermQuery('age', 18)
agg = Count('score', name='count')

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, aggs=[agg]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print("name: %s, value: %s" % (agg_result.name, str(agg_result.value)))

Distinct count

Returns the number of distinct values for a specified field, similar to the SQL count(distinct) function.

Note

The result of a distinct count is an approximation.

When the distinct count is less than 10,000, the result is nearly exact.
When the distinct count reaches 100 million, the error rate is approximately 2%.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to be used for the aggregation. The data type of the field must be Long, Double, Boolean, Keyword, `GeoPoint`, or Date.
missing	A default value for documents missing the specified field. If this parameter is set, its value is used in the aggregation. Otherwise, documents missing the field are ignored.

Example

Count the number of distinct names among people who are 18 years old.

query = TermQuery('age', 18)
agg = DistinctCount('name', name='distinct_name')

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, aggs=[agg]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print("name: %s, value: %s" % (agg_result.name, str(agg_result.value)))

Percentile statistics

Calculates the percentile distribution of a dataset. For example, during routine system O&M, you can use this feature to analyze the distribution of request latencies, such as the P25, P50, P90, and P99 values.

Note

Percentile statistics are an approximation. Accuracy is higher for percentiles at the extremes of the distribution (such as 1% and 99%) than for those in the middle (such as 50%).

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to be used for the aggregation. The data type of the field must be Long, Double, or Date.
percentiles	A list of percentile values, such as `[50, 90, 99]`. You can specify one or more values.
missing_value	A default value for documents missing the specified field. If this parameter is set, its value is used in the aggregation. Otherwise, documents missing the field are ignored.

Example

query = TermQuery('product', '10010')
agg = Percentiles('latency', percentiles_list=[50, 90, 95])

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, aggs=[agg]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print("name: %s" % agg_result.name)
    for item in agg_result.value:
        print("  percentile: %s, value: %s" % (str(item.key), str(item.value)))

Grouping aggregation

Group by field value

Groups query results based on the values of a field. Rows with the same field value are placed in the same group. The response returns each group's value and its row count.

Note

When the number of groups is large, the results of this aggregation may be approximate.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to group by. The data type of the field must be Long, Double, Boolean, Keyword, or Date.
size	The number of groups to return. The default value is 10. The maximum value is 2,000. If the number of groups exceeds 2,000, only the first 2,000 groups are returned.
group_by_sort	The sorting rule for the returned groups. By default, groups are sorted by row count in descending order. If multiple rules are set, they are applied sequentially. You can sort by group key in lexicographical order, by row count, or by the result of a sub-aggregation in ascending or descending order.
sub_aggs and sub_group_bys	Sub-aggregations and sub-group-bys that perform further analysis on the grouped data. For example, after grouping products by category, you can add `Max` and `Min` sub-aggregations to find the highest and lowest price for each category.

Example 1

Group people who are 18 years old by score, and retrieve the 10 most common scores and the number of people for each score.

query = TermQuery('age', 18)
group_by = GroupByField('score', size=10)

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    for item in group_by_result.items:
        print("  key: %s, count: %d" % (item.key, item.row_count))

Example 2

Group people who are 18 years old by score, and retrieve the two least common scores and the number of people for each score.

group_by = GroupByField('score', size=2, group_by_sort=[RowCountSort(sort_order=SortOrder.ASC)])

search_response = client.search(table_name, index_name,
    SearchQuery(TermQuery('age', 18), limit=0, get_total_count=True, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    for item in group_by_result.items:
        print("  key: %s, count: %d" % (item.key, item.row_count))

Example 3

Group people who are 18 years old by score, retrieve the two most common scores and their counts, and then get information about the top three individuals in each group, sorted by primary key.

sort = RowCountSort(sort_order=SortOrder.DESC)
sub_agg = [TopRows(limit=3, sort=Sort([PrimaryKeySort(sort_order=SortOrder.DESC)]), name='top_rows')]
group_by = GroupByField('score', size=2, group_by_sort=[sort], sub_aggs=sub_agg)

search_response = client.search(table_name, index_name,
    SearchQuery(TermQuery('age', 18), limit=0, get_total_count=True, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    for item in group_by_result.items:
        print("  key: %s, count: %d" % (item.key, item.row_count))
        for sub_agg in item.sub_aggs:
            print("    sub_agg: %s" % sub_agg.name)
            for entry in sub_agg.value:
                print("      value: %s" % str(entry))

Example 4

Group people who are 18 years old by score and gender.

sort = RowCountSort(sort_order=SortOrder.ASC)
sub_group = GroupByField('sex', size=10, group_by_sort=[sort])
group_by = GroupByField('score', size=10, group_by_sort=[sort], sub_group_bys=[sub_group])

search_response = client.search(table_name, index_name,
    SearchQuery(TermQuery('age', 18), limit=0, get_total_count=True, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    for item in group_by_result.items:
        print("  key: %s, count: %d" % (item.key, item.row_count))
        for sub_group in item.sub_group_bys:
            print("    sub_group: %s" % sub_group.name)
            for sub_item in sub_group.items:
                print("      key: %s, count: %s" % (str(sub_item.key), str(sub_item.row_count)))

Group by range

Groups query results based on specified ranges of a field. Rows with field values that fall into the same range are placed in the same group. The response returns the count of items in each range.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to group by. The data type of the field must be Long or Double.
range[double_from, double_to)	The range for grouping. The start and end values can be set to represent negative and positive infinity.
sub_aggs and sub_group_bys	Sub-aggregations and sub-group-bys that perform further analysis on the grouped data. For example, to find which province contributes the most to sales within a certain sales volume range, you can group by sales volume and then add a `GroupByField` as a sub-group-by for the province.

Example

Count the number of people who are 18 years old and have scores in the [80, 90) and [90, 100) ranges.

query = TermQuery('age', 18)
group_by = GroupByRange(field_name='score', ranges=[(80, 90), (90, 100)])

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    for item in group_by_result.items:
        print("  range: %.1f~%.1f, count: %d" % (item.range_from, item.range_to, item.row_count))

Group by geographical location

Groups query results based on the distance from a central point. Rows within the same distance range are placed in the same group. The response returns the count of items in each range.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field must be of the `GeoPoint` type.
origin(double lat, double lon)	The latitude and longitude of the central point. `lat` is the latitude and `lon` is the longitude.
range[double_from, double_to)	The range for grouping, in meters. The start and end values can be set to represent negative and positive infinity.
sub_aggs and sub_group_bys	Sub-aggregations and sub-group-bys that perform further analysis on the grouped data.

Example

Count the number of people who are 18 years old and whose homes are within 1 kilometer, and between 1 and 2 kilometers, from their school. The school's coordinates are (31, 116).

query = TermQuery('age', 18)
group_by = GroupByGeoDistance(field_name='address', origin=GeoPoint(31, 116), ranges=[(0, 1000), (1000, 2000)])

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    for item in group_by_result.items:
        print("  range: %.1f~%.1f, count: %d" % (item.range_from, item.range_to, item.row_count))

Group by filter

Groups query results based on a set of filters. The response returns the number of documents that match each filter. The results are returned in the same order that the filters are specified.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
filter	The filter conditions. Results are returned in the same order that the filters are specified.
sub_aggs and sub_group_bys	Sub-aggregations and sub-group-bys that perform further analysis on the grouped data.

Example

Count the number of people who are 18 years old and scored 100 in math, and the number of people who scored 100 in Chinese.

query = TermQuery('age', 18)
filter1 = TermQuery('math', 100)
filter2 = TermQuery('chinese', 100)
filters = [filter1, filter2]
group_by = GroupByFilter(filters)

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    i = 0
    for item in group_by_result.items:
        print("  filter: %s=%s, count: %d" % (str(filters[i].field_name), str(filters[i].column_value), item.row_count))
        i += 1

Histogram aggregation

Groups query results based on a specified numeric interval. Rows with field values that fall into the same interval are placed in the same group. The response returns each interval and its corresponding document count.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to retrieve its result.
field	The field to be used for the aggregation. The data type of the field must be Long or Double.
interval	The interval for the aggregation.
field_range[min,max]	The range for the aggregation. This parameter works with `interval` to limit the number of groups. The value of `(fieldRange.max - fieldRange.min) / interval` cannot exceed 2,000.
min_doc_count	The minimum document count for a group to be included in the results. Groups with fewer documents than this value are omitted.
missing_value	A default value for documents missing the specified field. If this parameter is set, its value is used in the aggregation. Otherwise, documents missing the field are ignored.

Example

query = TermQuery('product', '10010')
group_by = GroupByHistogram(field_name='latency', interval=100, field_range=FieldRange(0, 10000), missing_value=0)

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    for item in group_by_result.items:
        print("  key: %s, value: %s" % (item.key, item.value))

Composite aggregation

Composite aggregation groups query results based on multiple fields, similar to the GROUP BY column1, column2, ... clause in SQL. Unlike group by field value, composite aggregation supports grouping across multiple sources simultaneously and supports pagination to retrieve all grouped results. A maximum of 32 sources are allowed in sources.

Note

Python SDK 6.4.4 and later supports composite aggregation.

Parameters

Parameter	Description
name	A custom name for the aggregation, used to distinguish different aggregations and obtain their corresponding results. The default value is `group_by_composite`.
sources	The list of sources. Multiple GroupBy types, such as `GroupByField`, can be used as sources. The sources list must be non-empty and can contain a maximum of 32 sources. Note The size parameter cannot be set for GroupBy in a source. Simply specify `field_name` and `name`. The number of groups is controlled by the `size` or `suggested_size` parameter of `GroupByComposite`. Set the `name` parameter for GroupBy in a source. The name is returned in the `source_group_by_names` field, which helps identify each source.
size	The number of groups to return in each response. This is an optional parameter with a default value of 10 and a maximum value of 2000. It precisely controls the number of groups to return. If the specified value exceeds the maximum value, the server returns an error. This parameter cannot be specified at the same time as `suggested_size`.
next_token	The next token. You do not need to set this for the first request. For subsequent requests, use the `next_token` from the previous response.
suggested_size	The suggested number of groups. This is an optional parameter that accepts any positive integer or -1. In suggestion mode, if the specified value exceeds the maximum value (2000), it is automatically adjusted to the maximum value without returning an error. A value of -1 also defaults to the maximum value. This is suitable for scenarios such as data exploration and batch processing where you want to retrieve as much data as possible without triggering an out-of-bounds error. This parameter cannot be specified at the same time as `size`.
sub_aggs and sub_group_bys	Sub-aggregations and sub-group-bys that perform further analysis on the grouped data.

Note

In the returned result, the keys for each group is a list of strings that corresponds one-to-one with the sources. When the field value corresponding to a source is empty, the value at the corresponding position in keys is None.
When the number of groups is large, we recommend that you set the size or suggested_size parameter and use next_token to paginate through the results to avoid returning too much data at once. When the next_token in the response is None, it indicates that all group results have been retrieved.

Example 1

Group people who are 18 years old by the score field, and get the number of rows for each score value.

query = TermQuery('age', 18)
# When a GroupByField is used as a source, you only need to specify the field_name and name parameters. The size parameter cannot be set.
source = GroupByField('score', name='group_by_score')
group_by = GroupByComposite(sources=[source])

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, get_total_count=True, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.NONE))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    print("source_names: %s" % str(group_by_result.source_group_by_names))
    for item in group_by_result.items:
        print("  keys: %s, count: %d" % (str(item.keys), item.row_count))

Example 2

Group by the score and sex fields for individuals aged 18 to obtain the number of rows for each score and sex combination.

query = TermQuery('age', 18)
source1 = GroupByField('score', name='group_by_score')
source2 = GroupByField('sex', name='group_by_sex')
group_by = GroupByComposite(sources=[source1, source2])

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, get_total_count=True, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.NONE))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    print("source_names: %s" % str(group_by_result.source_group_by_names))
    for item in group_by_result.items:
        # The keys parameter is a list of strings that corresponds to the sources in order. If a field value is empty, the corresponding element is None.
        print("  keys: %s, count: %d" % (str(item.keys), item.row_count))

Example 3

Use next_token to retrieve all grouped results in a paginated manner. The results are grouped by the score field, and you can retrieve 2 groups at a time to obtain all results through pagination.

query = TermQuery('age', 18)
source = GroupByField('score')
group_by = GroupByComposite(sources=[source], size=2)

# Initial request
search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, get_total_count=True, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.NONE))

group_by_result = search_response.group_by_results[0]
all_items = list(group_by_result.items)

# Paginate to retrieve the remaining results. The operation is complete when next_token is None.
while group_by_result.next_token is not None:
    group_by = GroupByComposite(sources=[source], size=2, next_token=group_by_result.next_token)
    search_response = client.search(table_name, index_name,
        SearchQuery(query, limit=0, get_total_count=True, group_bys=[group_by]),
        ColumnsToGet(return_type=ColumnReturnType.NONE))

    group_by_result = search_response.group_by_results[0]
    all_items.extend(group_by_result.items)

# Print all the grouped results.
for item in all_items:
    print("keys: %s, count: %d" % (str(item.keys), item.row_count))

Example 4

You can use a sub-aggregation to group data by the score field and calculate the maximum value of score for each group.

query = TermQuery('age', 18)
source = GroupByField('score')
sub_agg = Max('score')
group_by = GroupByComposite(sources=[source], sub_aggs=[sub_agg])

search_response = client.search(table_name, index_name,
    SearchQuery(query, limit=0, get_total_count=True, group_bys=[group_by]),
    ColumnsToGet(return_type=ColumnReturnType.NONE))

for group_by_result in search_response.group_by_results:
    print("name: %s" % group_by_result.name)
    for item in group_by_result.items:
        print("  keys: %s, count: %d" % (str(item.keys), item.row_count))
        for sub_agg in item.sub_aggs:
            print("    sub_agg: %s, value: %s" % (sub_agg.name, str(sub_agg.value)))

Tablestore:Aggregation

Features

Value aggregation

Minimum value

Maximum value

Sum

Average value

Count

Distinct count

Percentile statistics

Grouping aggregation

Group by field value

Group by range

Group by geographical location

Group by filter

Histogram aggregation

Composite aggregation

Related documentation