Group statistics - aggregate clause - OpenSearch - Alibaba Cloud Documentation Center

Tens of thousands of documents may be retrieved based on a single search query. However, you may not want to view all the retrieved documents to obtain the required information. If you want to view some statistics of the retrieved documents, you can use an aggregate clause to obtain the statistics.

Syntax

Syntax of an aggregate clause:

group_key:field, range:number1~number2, agg_fun:func1#func2, max_group:number2, 
agg_filter:filter_clause, max_group:number

Parameters:

Parameter	Type	Required	Valid value	Default value	Description
group_key:field	field: an attribute field	Yes	Fields of the INT, LITERAL, INT_ARRAY, or LITERAL_ARRAY type. If the attribute field is of the INT_ARRAY or LITERAL_ARRAY type and an item in the array is repeated, the number of its occurrences is counted.		Specifies the name of the field for which you want to collect statistics. You must configure an attribute field in this parameter.
agg_fun		Yes	The built-in functions count(), sum(id), max(id), min(id), and distinct_count(id).		You can set func to the count(), sum(id), max(id), min(id), or distinct_count(id) built-in function to calculate the number of documents, the sum of field values, the maximum field value, the minimum field value, or the number of unique values in a field. You can use multiple functions at a time by separating them with number signs (#). You can reference multiple fields in the sum(), max(), or min() function by using basic arithmetic operators.
range		No	The values between Number 1 and Number 2 and values greater than Number 2. Values of fields of the STRING type cannot be aggregated to collect statistics.		Generates statistics based on value ranges. This parameter can be used for data distribution. You can specify only one range parameter in the aggregate clause.
agg_filter		No			Retrieves documents that meet the specified conditions.
agg_sampler_threshold	INT	No			Specifies the threshold for document sampling. The retrieved documents whose ranks are higher than the threshold are counted in sequence in statistics, whereas those whose ranks are lower than the threshold are sampled based on the value of the agg_sampler_step parameter.
agg_sampler_step	INT	No			Specifies the sampling step size. The value indicates the intervals at which the documents whose ranks are lower than the threshold specified by the agg_sampler_threshold parameter are sampled. Statistics that are collected by the sum() and count() functions are processed in the following way: The system multiplies the statistics of documents whose ranks are lower than the threshold by the sampling step size to generate the estimated statistics. Then, the system adds the estimated statistics to the statistics of documents whose ranks are higher than the threshold to generate the final statistics.
max_group	INT	No		1000	Specifies the maximum number of key-items pairs that can be returned.

Usage notes

An aggregate clause is optional.
The fields that are referenced in the preceding parameters must be configured as attribute fields when you define the application schema.
The result of an aggregate clause is returned to the facet node, which is a node that is used for searches. The functions, such as sum() and count(), that are specified by the agg_fun parameter display the statistics.
You can specify multiple group_key parameters in an aggregate clause to collect statistics for different fields at the same time. Separate calculations for the fields with semicolons (;).

Example:

group_key:field1,agg_fun:func1;group_key:field2,agg_fun:func2

The result of the aggregate clause is returned to the facet node. To display statistics in the return result, you must set the format of the config clause to full JSON.
The distinct_count feature is supported only in an exclusive cluster. To use this feature, you must add the enable_accurate_statistics parameter into a kvpairs clause and set this parameter to true. When this feature is used, the system returns only the statistics in the facet node for a query.
The count(), max(), min(), and sum() functions are supported in an exclusive cluster. To use these functions, you must add the enable_accurate_statistics parameter into a kvpairs clause and set this parameter to true.
The system can return accurate statistics of up to 100,000 documents. If the number of documents that match the specified conditions exceeds 100,000, the statistics that are returned may be inaccurate due to the limits on engine performance. For an exclusive cluster, you can add the enable_accurate_statistics parameter into a kvpairs clause and set this parameter to true. This way, the system can return more accurate statistics.

Examples

Use the following query clause to obtain the statistics of documents that contain "Zhejiang University". Statistics are calculated based on the group_id and company_id fields. For the group_id fields, the statistics include the value sum and maximum value of the price field. For the company_id field, the statistics include the number of times each company occurs.

query=default:'Zhejiang University'&&aggregate=group_key:group_id,agg_fun:sum(price)#max(price);group_key:company_id,agg_fun:count()

Sample return result:

{
　　status: "OK",
　　result: {
　　　　searchtime: 0.015634,
　　　　total: 5,
　　　　num: 1,
　　　　viewtotal: 5,
　　　　items: [        // The return result.
　　　　　　{ ... }
　　　　],
　　　　facet: [
　　　　　　{
　　　　　　　　key: "group_id",
　　　　　　　　items: [
　　　　　　　　　　{
　　　　　　　　　　　　value: 43,
　　　　　　　　　　　　sum: 81,
　　　　　　　　　　　　max: 20,
　　　　　　　　　　},
　　　　　　　　　　{
　　　　　　　　　　　　value: 63,
　　　　　　　　　　　　sum: 91,
　　　　　　　　　　　　max: 50,
　　　　　　　　　　},
　　　　　　　　],
　　　　　　},
　　　　　　{
　　　　　　　　key: "company_id",
　　　　　　　　items: [
　　　　　　　　　　{
　　　　　　　　　　　　value: 13,
　　　　　　　　　　　　count: 4,
　　　　　　　　　　},
　　　　　　　　　　{
　　　　　　　　　　　　value: 10,
　　　　　　　　　　　　count: 1,
　　　　　　　　　　},
　　　　　　　　],
　　　　　　},
　　　　],
　　},
　　errors: [ ],
　　tracer: "",
},

Use the following query clause to obtain the statistics of documents that contain "Zhejiang University" based on the group_id field. The value sum of the price field is calculated. Documents whose ranks are lower than 10,000 are sampled. The sampling step size is set to 5.
```
query=default:'Zhejiang University'&&aggregate=group_key:group_id,agg_fun:sum(price), agg_sampler_threshold:10000, agg_sampler_step:5
```
Use the following query clause to obtain the statistics of documents that contain "Zhejiang University" based on the group_id field. The aggregate clause counts the number of documents whose values of the group_id field are in the range from 10 to 50.
```
query=default:'Zhejiang University'&&aggregate=group_key:group_id,agg_fun:count(),range:10~50
```
Use the following query clause to obtain the statistics of documents that contain "Zhejiang University" based on the group_id field. The maximum value sum of the hits and replies fields is calculated among documents whose values of the create_timestamp field are greater than 1423456781.
```
query=default:'Zhejiang University'&&aggregate=group_key:group_id,agg_fun:max(hits+replies),agg_filter:create_timestamp>1423456781
```