You can perform aggregation operations to obtain the minimum value, maximum value, sum, average, and count and distinct count of rows. You can also perform aggregation operations to group results by field value, range, geographical location, or filter. You can perform multiple aggregation operations for complex queries.

Note Tablestore SDK for Python V5.2.1 or later supports the aggregation feature.

Background information

The following table describes the aggregation methods.
Method Description
Minimum value

The aggregation method that can be used to return the minimum value of a field. This method can be used in a similar manner as the SQL MIN function.

Maximum value

The aggregation method that can be used to return the maximum value of a field. This method can be used in a similar manner as the SQL MAX function.

Sum

The aggregation method that can be used to return the sum of all values for a numeric field. This method can be used in a similar manner as the SQL SUM function.

Average value

The aggregation method that can be used to return the average of all values for a numeric field. This method is used in a similar manner as the SQL AVG function.

Count The aggregation method that can be used to return the total number of values for a specified field or the total number of rows in a search index. This method can be used in a similar manner as the SQL COUNT function.
Distinct count The aggregation method that can be used to return the number of distinct values for a field. This method can be used in a similar manner as the SQL COUNT(DISTINCT) function.
Percentile statistics

A percentile value indicates the relative position of a value in a dataset. For example, when you collect statistics for the response time of each request during the routine O&M of your system, you must analyze the response time distribution by using percentiles such as p25, p50, p90, and p99.

Group by field value
The aggregation method that can be used to group query results based on field values. The values that are the same are grouped together. The identical value of each group and the number of identical values in each group are returned.
Note The calculated number may be different from the actual number if the number of values in a group is very large.
Group by range

The aggregation method that can be used to group query results based on the value ranges of a field. Field values that are within a specified range are grouped together. The number of values in each range is returned.

Group by geographical location

The aggregation method that can be used to group query results based on geographical locations to a central point. Query results in distances that are within a specified range are grouped together. The number of items in each range is returned.

Group by filter

The aggregation method that can be used to filter the query results and group them together to obtain the number of results that match each filter. Results are returned in the order in which the filters are specified.

Query by histogram

The aggregation method that can be used to group query results based on specific data intervals. Field values that are within the same range are grouped together. The value range of each group and the number of values in each group are returned.

Prerequisites

  • The OTSClient is initialized. For more information, see Initialization.
  • A data table is created. Data is written to the table.
  • A search index is created for the data table. For more information, see Create search indexes.

Minimum value

The aggregation method that can be used to return the minimum value of a field. This method can be used in a similar manner as the SQL MIN function.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE data types are supported.
    missing

    The default value for the field that is used to perform the aggregation operation on a row when the field value is empty.

    • If you do not specify a value for the missing parameter, the row is ignored.
    • If you specify a value for the missing parameter, the value of this parameter is used as the field value of the row.
  • Examples

    The following code provides an example on how to use aggregation to obtain the lowest score from the scores of people aged 18:

    query = TermQuery('age', 18)
    agg = Min('score', name = 'min')
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=0, aggs=[agg]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for agg_result in search_response.agg_results:
        print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Maximum value

The aggregation method that can be used to return the maximum value of a field. This method can be used in a similar manner as the SQL MAX function.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE data types are supported.
    missing

    The default value for the field that is used to perform the aggregation operation on a row when the field value is empty.

    • If you do not specify a value for the missing parameter, the row is ignored.
    • If you specify a value for the missing parameter, the value of this parameter is used as the field value of the row.
  • Examples

    The following code provides an example on how to use aggregation to obtain the highest score from the scores of people aged 18. By default, if a person has no score, a value of 0 is used as the score for aggregation.

    query = TermQuery('age', 18)
    agg = Max('score', missing_value = 0, name = 'max')
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=0, aggs=[agg]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for agg_result in search_response.agg_results:
        print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Sum

The aggregation method that can be used to return the sum of all values for a numeric field. This method can be used in a similar manner as the SQL SUM function.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE data types are supported.
    missing

    The default value for the field that is used to perform the aggregation operation on a row when the field value is empty.

    • If you do not specify a value for the missing parameter, the row is ignored.
    • If you specify a value for the missing parameter, the value of this parameter is used as the field value of the row.
  • Examples

    The following code provides an example on how to use aggregation to obtain the sum of the scores of people aged 18:

    query = TermQuery('age', 18)
    agg = Sum('score', name = 'sum')
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for agg_result in search_response.agg_results:
        print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Average value

The aggregation method that can be used to return the average of all values for a numeric field. This method is used in a similar manner as the SQL AVG function.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE data types are supported.
    missing

    The default value for the field that is used to perform the aggregation operation on a row when the field value is empty.

    • If you do not specify a value for the missing parameter, the row is ignored.
    • If you specify a value for the missing parameter, the value of this parameter is used as the field value of the row.
  • Examples

    The following code provides an example on how to use aggregation to obtain the average of the scores of people aged 18:

    query = TermQuery('age', 18)
    agg = Avg('score', name = 'avg')
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for agg_result in search_response.agg_results:
        print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Count

The aggregation method that can be used to return the total number of values for a specified field or the total number of rows in a search index. This method can be used in a similar manner as the SQL COUNT function.
Note You can use the following methods to query the total number of rows in a search index or the total number of rows that match the query conditions:
  • Use the count feature of aggregation. Set the count parameter to * in the request.
  • Use the query feature to obtain the number of rows that match the query conditions. Set the setGetTotalCount parameter to true in the query. Use MatchAllQuery to obtain the total number of rows in a search index.

You can use the name of a column as the value of the count expression to query the number of rows that contain the column in a search index. This method is suitable for scenarios that involve sparse columns.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used for the aggregation operation. Only the following data types are supported: LONG, DOUBLE, BOOLEAN, KEYWORD, and GEOPOINT.
  • Examples

    The following code provides an example on how to use aggregation to obtain the number of people aged 18 who have taken the exam and have scores:

    
    query = TermQuery('age', 18)
    agg = Count('score', name = 'count')
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for agg_result in search_response.agg_results:
        print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Distinct count

The aggregation method that can be used to return the number of distinct values for a field. This method can be used in a similar manner as the SQL COUNT(DISTINCT) function.
Note The number of distinct values is an approximate number.
  • If the total number of rows before the distinct count feature is used is less than 10,000, the calculated result is an exact value.
  • If the total number of rows before the distinct count feature is used is greater than or equal to 100 million, the error rate is approximately 2%.
  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used for the aggregation operation. Only the following data types are supported: LONG, DOUBLE, BOOLEAN, KEYWORD, and GEOPOINT.
    missing

    The default value for the field that is used to perform the aggregation operation on a row when the field value is empty.

    • If you do not specify a value for the missing parameter, the row is ignored.
    • If you specify a value for the missing parameter, the value of this parameter is used as the field value of the row.
  • Examples

    The following code provides an example on how to use aggregation to obtain the number of distinct names of people aged 18:

    query = TermQuery('age', 18)
    agg = DistinctCount('name', name = 'distinct_name')
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for agg_result in search_response.agg_results:
        print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Percentile statistics

A percentile value indicates the relative position of a value in a dataset. For example, when you collect statistics for the response time of each request during the routine O&M of your system, you must analyze the response time distribution by using percentiles such as p25, p50, p90, and p99.

Note To improve the accuracy of the results, we recommend that you specify extreme percentile values such as p1 and p99. If you use extreme percentile values instead of other values such as p50, the returned results are more accurate.
  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE data types are supported.
    percentiles The percentiles such as p50, p90, and p99. You can specify one or more percentiles.
    missing_value The default value for the field that is used to perform the aggregation operation on a row when the field value is empty.
    • If you do not specify a value for the missing_value parameter, the row is ignored.
    • If you specify a value for the missing_value parameter, the value of this parameter is used as the field value of the row.
  • Examples
    query = TermQuery('product', '10010')
    agg = Percentiles('latency', percentiles_list = [50, 90, 95])
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for agg_result in search_response.agg_results:
        print('percentiles:%s' % agg_result.name)
        for item in agg_result.value:
            print('%s:%s' % (str(item.key), str(item.value)))

Group by field value

The aggregation method that can be used to group query results based on field values. The values that are the same are grouped together. The identical value of each group and the number of identical values in each group are returned.
Note The calculated number may be different from the actual number if the number of values in a group is very large.
  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used for the aggregation operation. Only the following data types are supported: LONG, DOUBLE, BOOLEAN, and KEYWORD.
    size The number of returned groups.
    group_by_sort The sorting rules for groups. By default, groups are sorted based on the number of items in the groups in descending order. If you configure multiple sorting rules, the groups are sorted based on the order in which the rules are configured. Supported parameters:
    • Sort by value in alphabetical order
    • Sort by value in reverse alphabetical order
    • Sort by row count in ascending order
    • Sort by row count in descending order
    • Sort by the values that are obtained from sub-aggregation results in ascending order
    • Sort by the values that are obtained from sub-aggregation results in descending order
    sub_aggs and sub_group_bys The sub-aggregation operation. You can perform the sub-aggregation operation based on the grouping results.
    • Scenario

      Query the number of products in each category, and the maximum and minimum product prices in each category.

    • Methods

      Group query results by product category to obtain the number of products in each category. Then, perform two sub-aggregation operations to obtain the maximum and minimum product prices in each category.

    • Sample results
      • Fruits: 5. The maximum price is CNY 15. The minimum price is CNY 3.
      • Toiletries: 10. The maximum price is CNY 98. The minimum price is CNY 1.
      • Electronic devices: 3. The maximum price is CNY 8,699. The minimum price is CNY 2,300.
      • Other products: 15. The maximum price is CNY 1,000. The minimum price is CNY 80.
  • Example 1

    The following code provides an example on how to group people aged 18 by score and obtain the top 10 common scores and the number of people in each group:

    query = TermQuery('age', 18)
    group_by = GroupByField('score', size = 10)
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=20, group_bys = [group_by]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for group_by in search_response.group_by_results:
        print("name:%s" % group_by.name)
        print("groups:")
        for item in group_by.items:
            print("key:%s, count:%d" % (item.key, item.row_count))
  • Example 2

    The following code provides an example on how to group people aged 18 by score and obtain the two least common scores and the number of people in each group:

    group_by = GroupByField('score', size = 2, group_by_sort = [RowCountSort(sort_order=SortOrder.ASC)])
    search_response = client.search(table_name, index_name,
                                    SearchQuery(TermQuery('age', 18), limit=100, get_total_count=True, group_bys = [group_by]),
                                    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))
    
    for group_by in search_response.group_by_results:
        print("name:%s" % group_by.name)
        print("groups:")
        for item in group_by.items:
            print("key:%s, count:%d" % (item.key, item.row_count))
  • Example 3

    The following code provides an example on how to group people aged 18 by score and obtain the top two common scores and the number of people in each group. Then, sort each group of people by primary key and obtain the information of the top three people in each group:

    sort = RowCountSort(sort_order = SortOrder.DESC)
    sub_agg = [TopRows(limit=3,sort=Sort([PrimaryKeySort(sort_order=SortOrder.DESC)]), name = 't1')]
    
    group_by = GroupByField('l', size = 2, group_by_sort = [sort], sub_aggs = sub_agg)
    search_response = client.search(table_name, index_name,
                                    SearchQuery(TermQuery('age', 18), limit=100, get_total_count=True, group_bys = [group_by]),
                                    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))
    
    for group_by in search_response.group_by_results:
        print("name:%s" % group_by.name)
        print("groups:")
        for item in group_by.items:
            print("\tkey:%s, count:%d" % (item.key, item.row_count))
            for sub_agg in item.sub_aggs:
                print("\t\tname:%s:" % sub_agg.name)
                for entry in sub_agg.value:
                    print("\t\t\tvalue:%s" % str(entry))
  • Example 4

    The following code provides an example on how to group people aged 18 by score and gender:

    sort = RowCountSort(sort_order = SortOrder.ASC)
    sub_group = GroupByField('sex', size = 10, group_by_sort = [sort])
    
    group_by = GroupByField('score', size = 10, group_by_sort = [sort], sub_group_bys = [sub_group])
    search_response = client.search(table_name, index_name,
                                    SearchQuery(TermQuery('age', 18), limit=100, get_total_count=True, group_bys = [group_by]),
                                    ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))
    
    for group_by in search_response.group_by_results:
        print("name:%s" % group_by.name)
        print("groups:")
        for item in group_by.items:
            print("\tkey:%s, count:%d" % (item.key, item.row_count))
            for sub_group in item.sub_group_bys:
                print("\t\tname:%s:" % sub_group.name)
                for sub_item in sub_group.items:
                    print("\t\t\tkey:%s, count:%s" % (str(sub_item.key), str(sub_item.row_count)))

Group by range

The aggregation method that can be used to group query results based on the value ranges of a field. Field values that are within a specified range are grouped together. The number of values in each range is returned.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE data types are supported.
    range[double_from, double_to) The value ranges for grouping.

    The value range can start from Double.MIN_VALUE and end with Double.MAX_VALUE.

    sub_aggs and sub_group_bys The sub-aggregation operation. You can perform the sub-aggregation operation based on the grouping results.

    For example, after you group query results by sales volume and by province, you can obtain the province that has the largest proportion of sales volume in a specified range. You must specify a value for GroupByField in GroupByRange to perform this query.

  • Examples

    The following code provides an example on how to group people aged 18 by score range and obtain the numbers of people with scores in the [80, 90) and [90, 100) intervals:

    query = TermQuery('age', 18)
    group_by = GroupByRange(field_name = 'score', ranges = [(80, 90),(90, 100)])
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=0, group_bys = [group_by]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for group_by in search_response.group_by_results:
        print("name:%s" % group_by.name)
        print("groups:")
        for item in group_by.items:
            print("range:%.1f~%.1f, count:%d" % (item.range_from, item.range_to, item.row_count))

Group by geographical location

The aggregation method that can be used to group query results based on geographical locations to a central point. Query results in distances that are within a specified range are grouped together. The number of items in each range is returned.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used for the aggregation operation. Only the GEOPOINT data type is supported.
    origin(double lat, double lon) The longitude and latitude of the central point.

    lat indicates the latitude of the central point. lon indicates the longitude of the central point.

    range[double_from, double_to) The distance ranges that are used for grouping. Unit: meters.

    The value range can start from Double.MIN_VALUE and end with Double.MAX_VALUE.

    sub_aggs and sub_group_bys The sub-aggregation operation. You can perform the sub-aggregation operation based on the grouping results.
  • Examples

    The following code provides an example on how to group people aged 18 by the geographical distance from their school to their home locations and obtain the numbers of people who live within one kilometer from the school and people who live one to two kilometers away from school: The latitude and longitude of the school is (31,116).

    query = TermQuery('age', 18)
    group_by = GroupByGeoDistance(field_name = 'address', origin=GeoPoint(31, 116), ranges = [(0, 1000), (1000,2000)])
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=2, group_bys = [group_by]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for group_by in search_response.group_by_results:
        print("name:%s" % group_by.name)
        print("groups:")
        for item in group_by.items:
            print("range:%.1f~%.1f, count:%d" % (item.range_from, item.range_to, item.row_count))

Group by filter

The aggregation method that can be used to filter the query results and group them together to obtain the number of results that match each filter. Results are returned in the order in which the filters are specified.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    filter The filters that can be used for the query. Results are returned in the order in which the filters are specified.
    sub_aggs and sub_group_bys The sub-aggregation operation. You can perform the sub-aggregation operation based on the grouping results.
  • Examples

    The following code provides an example on how to group people aged 18 by filter and obtain the numbers of people who scored 100 in the math exam and who scored 100 in the Chinese exam:

    query = TermQuery('age', 18)
    filter1 = TermQuery('math', 100)
    filter2 = TermQuery('chinese', 100)
    filters = [filter1, filter2]
    group_by = GroupByFilter(filters)
    
    search_response = client.search(
        table_name, index_name,
        SearchQuery(query, next_token = None, limit=2, group_bys = [group_by]),
        columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for group_by in search_response.group_by_results:
        print("name:%s" % group_by.name)
        print("groups:")
        i = 0
        for item in group_by.items:
            print("filter:%s=%s, count:%d" % (str(filters[i].field_name), str(filters[i].column_value), item.row_count))
            i=i+1

Query by histogram

The aggregation method that can be used to group query results based on specific data intervals. Field values that are within the same range are grouped together. The value range of each group and the number of values in each group are returned.

  • Parameters
    Parameter Description
    name The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
    field The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE data types are supported.
    interval The data interval that is used to obtain aggregation results.
    field_range[min,max] The range that is used together with the interval parameter to limit the number of groups. The value that is calculated by using the (fieldRange.max-fieldRange.min)/interval formula cannot exceed 2,000.
    min_doc_count The minimum number of rows. If the number of rows in a group is less than the minimum number of rows, the aggregation results for the group are not returned.
    missing_value The default value for the field that is used to perform the aggregation operation on a row when the field value is empty.
    • If you do not specify a value for the missing_value parameter, the row is ignored.
    • If you specify a value for the missing_value parameter, the value of this parameter is used as the field value of the row.
  • Examples
    query = TermQuery('product', '10010')
    group_by = GroupByHistogram(field_name = 'latency', interval = 100, field_range = FieldRange(0, 10000), missing_value = 0)
    
    search_response = client.search(table_name, index_name,
                                    SearchQuery(query, next_token = None, limit=2, group_bys = [group_by]),
                                    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))
    
    for group_by in search_response.group_by_results:
        print("name:%s" % group_by.name)
        print("groups:")
        for item in group_by.items:
            print("%s:%s" % (item.key, item.value))