aggregate clause - OpenSearch - Alibaba Cloud Documentation Center

Overview

The system may retrieve tens of thousands of documents based on a single keyword in a search query. You may not want to view all the documents to obtain the required information. In specific scenarios, you may need statistics of the documents. In this case, you can use an aggregate clause to aggregate data.

Syntax

aggregate=group_key:field, range:number1~number2, agg_fun:func1#func2, max_group:number2, agg_filter:filter_clause, max_group:number

group_key: required. The field based on which you want to obtain statistics. The field that you specify must be an attribute field of the INTEGER or String type.
agg_fun: required. The following built-in functions are supported: count(), sum(), max(), and min(). You can use count() to calculate the number of documents, use sum(id) to obtain the sum of values in the id field, use max(id) to obtain the maximum value in the id field, and use min(id) to obtain the minimum value in the id field. You can include multiple functions in an aggregate clause and separate functions with number signs (#). You can use basic arithmetic operators in the sum(), max(), and min() functions to define relationships between fields.
range: the range in which the system queries data. If you want to obtain information about data distribution, you can configure this parameter. You can specify only one range. In the sample code, the values between Number 1 and Number 2 and values greater than Number 2 are queried. You cannot specify a value of the STRING type as the value of the range parameter.
agg_filter: optional. The filter conditions. The system returns only the documents that match the specified conditions.
agg_sampler_threshold: optional. The threshold value for document sampling. The system collects data of documents that are ranked higher than the threshold value and samples documents that are ranked lower than the threshold value.
agg_sampler_step: optional. The step size for statistical sampling. The system samples documents that are ranked lower than the threshold value based on the step size that you specified. If you use the sum() or count() function to aggregate data of documents, the system calculates the result by using the following method: The system multiplies the number of documents that are sampled after the threshold value is reached by the sampling step size to generate an estimated result. Then, the system calculates the sum of the estimated result and the number of documents that are ranked higher than the threshold value to generate the final result.
max_group: the maximum number of groups that the system can return. Default value: 1000.

Examples:

Simple statistics

aggregate=group_key:group_id,agg_fun:sum(price)
Sample statistical results:
{
　　result: {
　　　　facet: [
　　　　　　{
　　　　　　　　key: "group_id",
　　　　　　　　items: [
　　　　　　　　　　{
　　　　　　　　　　　　value: 43,
　　　　　　　　　　　　sum: 81
　　　　　　　　　　},
　　　　　　　　　　{
　　　　　　　　　　　　value: 63,
　　　　　　　　　　　　sum: 91
　　　　　　　　　　}
　　　　　　　　]
　　　　　　}
　　　　]
　　}
},

Sampling statistics

aggregate=group_key:company_id,agg_fun:count(),agg_sampler_threshold:5,agg_sampler_step:2

In this example, the threshold value is set to 5 and the step size is set to 2. The system collects data of the first five documents that match the specified conditions and collects data of every other document after the 5th document. If you use the sum() or count() function to aggregate data, the system calculates the result by using the following method: The system multiplies the number of documents that are collected after the threshold value is reached by 2 to generate an estimated result. Then, the system calculates the sum of 5 and the estimated result to generate the final result.

Multi-dimension statistics

aggregate=group_key:company_id,agg_fun:sum(id)#max(id)#min(id)
In this example, the sum(), max(), and min() functions are used to aggregate values in the company_id field. When you specify multiple functions, separate the functions with number signs (#).

Multi-group key statistics

aggregate=group_key:id,agg_fun:sum(price)&&aggregate=group_key:company_id,agg_fun:count(),agg_sampler_threshold:5, agg_sampler_step:2
You can use multiple aggregate clauses in a statement. You need to separate aggregate clauses with two ampersands (&&).

Accurate statistics

config=cluster:general.default_agg&&aggregate=group_key:company_id,agg_fun:count()
If you specify the .default_agg suffix for the cluster name, the accurate statistics feature is enabled. The system provides accurate statistics if the number of documents that match the conditions is smaller than the value of the rank_size parameter. If the number of documents that match the conditions is much larger than the value of the rank_size parameter, the system does not collect data of the documents whose ranks are larger than the value of the rank_size parameter.

Semi-accurate statistics

aggregate=group_key:company_id,agg_fun:distinct_count(brand)
If you include the distinct_count function in an aggregate clause, the semi-accurate statistics feature is enabled. The semi-accurate statistics feature uses the HyperLogLog (HLL) algorithm to obtain statistics. In most cases, the accuracy ratio of results that are obtained by using the semi-accurate statistics feature is higher than 99%.

Usage notes

An aggregate clause is optional.
The fields that you specify in an aggregate clause must be attribute fields that you specified in the schema.json file.
The execution result of an aggregate clause is returned to the facet node on the Searcher node. The result includes the agg_fun parameter that indicates the functions that you specified in the aggregate clause, such as sum() and count().
You can specify multiple group keys in an aggregate clause and separate the group keys with semicolons (;).
The execution result of an aggregate clause is returned to the facet node on the Searcher node. To obtain the data on the facet node, specify fulljson as the value of the format parameter in the config clause.
The system can return accurate statistics of up to 100,000 documents. If the number of documents that match the specified conditions exceeds 100,000, the statistics that are returned may be inaccurate due to the limits on engine performance.