All Products
Search
Document Center

:aggregate clause

Last Updated:Aug 27, 2024

Overview

Tens of thousands of documents may be retrieved based on a single search query. However, you may not want to view all the retrieved documents to obtain the required information. If you want to view only some statistics of the retrieved documents, you can use an aggregate clause to obtain the statistics.

Syntax

aggregate=group_key:field, range:number1~number2, agg_fun:func1#func2, max_group:number2, agg_filter:filter_clause, max_group:number
  • group_key: mandatory. The group_key parameter specifies the field based on which you want to obtain statistics. The field that you specify must be an attribute field of the INTEGER or String type.

  • agg_fun: mandatory. The following built-in functions are supported: count(), sum(id), max(id), and min(id). For example, you can use count() to calculate the number of documents, use sum(id) to obtain the sum of values in the id field, use max(id) to obtain the maximum value in the id field, and use min(id) to obtain the minimum value in the id field. You can include multiple functions in the aggregate clause and separate these functions with number signs (#). You can use basic arithmetic operators in the sum(id), max(id), and min(id) functions to define relationships between fields.

  • range: specifies a range. The system only queries data in the specified range. If you want to obtain information about data distribution, you can configure this parameter. You can specify only one range. In the sample code, the values between Number 1 and Number 2 and values greater than Number 2 are queried. You cannot specify a value of the STRING type as the value of the range parameter.

  • agg_filter: optional. This parameter specifies filter conditions. The system returns only the documents that match the specified conditions.

  • agg_sampler_threshold: optional. This parameter specifies a threshold value for document sampling. The system collects data of documents that are ranked higher than the threshold value and samples documents that are ranked lower than the threshold value.

  • agg_sampler_step: optional. This parameter specifies the step size for statistical sampling. The system samples documents that are ranked lower than the threshold value based on the step size that you specified. If you use the sum() or count() function to aggregate data of documents, the system calculates the result by using the following method: The system multiplies the number of documents that are sampled after the threshold value is reached by the sampling step size to generate an estimated result, and then calculates the sum of the estimated result and the number of documents that are ranked higher than the threshold value to generate the final result.

  • max_group: specifies the maximum number of groups that the system can return. Default value: 1000.

Example:

  • Simple statistics

aggregate=group_key:group_id,agg_fun:sum(price)
Example of statistical results:
{
  result: {
    facet: [
      {
        key: "group_id",
        items: [
          {
            value: 43,
            sum: 81
          },
          {
            value: 63,
            sum: 91
          }
        ]
      }
    ]
  }
},
  • Sampling statistics

aggregate=group_key:company_id,agg_fun:count(),agg_sampler_threshold:5,agg_sampler_step:2

In this example, the threshold value is set to 5 and the step size is set to 2. The system collects data of the first five documents that match the specified conditions and collects data of every other document after the 5th document. If you use the sum() or count() function to aggregate data, the system calculates the result by using the following method: The system multiplies the number of documents that are collected after the threshold value is reached by 2 to generate an estimated result, and then calculates the sum of 5 and the estimated result to generate the final result.
  • Multi-dimensional statistics

aggregate=group_key:company_id,agg_fun:sum(id)#max(id)#min(id)
In this example, the sum(), max(), and min() functions are used to aggregate values in the company_id field. When you specify multiple functions, separate the functions with the number sign (#).
  • Multi-group-key statistics

aggregate=group_key:id,agg_fun:sum(price)&&aggregate=group_key:company_id,agg_fun:count(),agg_sampler_threshold:5, agg_sampler_step:2
You can use multiple aggregate clauses in a statement. You need to separate the aggregate clauses with two ampersands (&&).
  • Exact statistics

config=cluster:general.default_agg&&aggregate=group_key:company_id,agg_fun:count()
If you specify the .default_agg suffix for the cluster name, the exact statistics feature is enabled. The system provides exact statistics if the number of documents that match the conditions is smaller than the value of the rank_size parameter. If the number of documents that match the conditions is much larger than the value of the rank_size parameter, the system does not collect data of the documents whose ranks are larger than the value of the rank_size parameter.
  • Semi-exact statistics

aggregate=group_key:company_id,agg_fun:distinct_count(brand)
If you include the distinct_count function in an aggregate clause, the semi-exact statistics feature is enabled. The semi-exact statistics feature uses the HyperLogLog (HLL) algorithm to obtain statistics. In most cases, the exact ratio of results that are obtained by using the semi-exact statistics feature is higher than 99%.

Usage notes

  • The aggregate clause is optional.

  • The fields that you specify in the aggregate clause must be the attribute fields that you specify in the schema.json file.

  • The execution result of the aggregate clause is returned to the facet node on the Searcher node. The result includes the agg_fun parameter that indicates the functions that you specified in the aggregate clause, such as sum() and count().

  • You can specify multiple group keys in the aggregate clause and separate them with semicolons (;).

  • The execution result of the aggregate clause is returned to the facet node on the Searcher node. To obtain the data on the facet node, specify fulljson as the value of the format parameter in the config clause.

  • The system can return accurate statistics of up to 100,000 documents. If the number of documents that match the specified conditions exceeds 100,000, the statistics that are returned may be inaccurate due to the limits on engine performance.