distinct clause - - Alibaba Cloud Documentation Center

Overview

You can include a distinct clause in a statement to disperse documents that are obtained based on the statement. This helps ensure that the system returns distinct results and improves user experience. For example, if a large number of documents are retrieved in a query but multiple documents of a specific user are highly scored and ranked in the front, most of the results displayed on the same page are from the same user. This affects the display effect and user experience. In this case, you can include a distinct clause in the statement to extract specific documents from the set of documents that are obtained based on the rules that you specify in the distinct clause. This can disperse the documents and sort the documents in a new order to ensure that documents from each website are displayed.

Syntax

"distinct" : {
     "default": {  
      "dist_key" : "field",
      "dist_count":number,
      "dist_times" : number,
      "dist_filter" : "filter_expression",
      "reserved" : boolean,
      "max_item_count" : number,
      "grade" : []
    },
    "rank": {  
      "dist_key" : "field",
      "dist_count":number,
      "dist_times" : number,
      "dist_filter" : "filter_expression",
      "reserved" : boolean,
      "max_item_count" : number,
      "grade" : []
    },
    "rerank": {
      "dist_key" : "field",
      "dist_count":number,
      "dist_times" : number,
      "dist_filter" : "filter_expression",
      "reserved" : boolean,
      "max_item_count" : number,
      "grade" : []
    }
  }
}

By default, documents are dispersed in both the rough sort and fine sort phases. This ensures the dispersing effect. You can use the same dispersing rule or specify different dispersing rules. Dispersing rules take effect in different phases in the following way:

If you specify only the default rule, documents are dispersed in both the rough sort and fine sort phases by using the default rule.
If you specify only the rank rule, documents are dispersed only in the rough sort phase by using the rank rule.
If you specify only the rerank rule, documents are dispersed only in the fine sort phase by using the rerank rule.
If you specify both the default and rank rules, documents are dispersed in the rough sort phase by using the rank rule and in the fine sort phase by using the default rule.
If you specify both the default and rerank rules, documents are dispersed in the rough sort phase by using the default rule and in the fine sort phase by using the rerank rule.
If you specify both the rank and rerank rules, documents are dispersed in the rough sort phase by using the rank rule and in the fine sort phase by using the rerank rule.
If you specify the default, rank, and rerank rules at the same time, documents are dispersed in the rough sort phase by using the rank rule and in the fine sort phase by using the rerank rule.
You must specify at least one of the default, rank, and rerank rules.

Parameters

dist_key: required. The attribute field based on which you want to disperse the documents that are obtained.
dist_count: optional. The number of documents that you want to extract each time document extraction is performed. Default value: 1.
dist_times: optional. The number of times that you want to perform document extraction. Default value: 1.
dist_filter: optional. The filter conditions. The system does not use the documents that are filtered out based on the specified conditions as dispersing objects. When the system performs fine sorting, the system sorts the documents that are extracted by using the distinct clause together with the documents that are filtered out. By default, the system uses all documents that are obtained as dispersing objects.
reserved: optional. This parameter specifies whether to retain the remaining documents that are not extracted. Valid values: true and false. Default value: true. If you set the value of this parameter to false, the system discards the documents that are not extracted. In this case, the value of the total_hit response parameter may be inaccurate.
max_item_count: optional. The maximum number of documents that can be retained during dispersing. The maximum number of documents that are retained is max(max_item_count, hit).

To ensure that the final results are stable in page turning, you can set this parameter to the maximum number of documents that can be queried. For example, if 10 results are returned per page and up to 100 pages can be returned, you can set this parameter to 1000 (10 × 100).

grade: optional. The threshold values based on which the system classifies documents into different grades. The system extracts documents from each grade based on the threshold value that you specify for the grade. If you do not include the grade parameter in the distinct clause, the system classifies all documents into one grade by default. The system classifies documents into grades based on the relevance scores that are calculated for rough sorting. If you specify multiple grades, separate the threshold values with vertical bars (|). The number of grades that you can specify is not limited. Example 1: grade:3.0. In this case, documents are classified into two categories based on the specified threshold. The documents with a score less than 3.0 are classified into the first category. The documents with a score greater than or equal to 3.0 are classified into the second category. Example 2: grade:3.0|5.0. In this case, documents are classified into three categories. The documents with a score less than 3.0 are classified into the first category. The documents with a score greater than or equal to 3.0 but less than 5.0 are classified into the second category. The documents with a score greater than or equal to 5.0 are classified into the third category. The sorting method that you use for grades must be the same as the sorting method that you use for documents during rough sorting. If the documents are sorted in descending order during rough sorting, the grades are also sorted in descending order. If the documents are sorted in ascending order during rough sorting, the grades are also sorted in ascending order.

Example:

"distinct" : {
     "default": {  
      "dist_key" : "company_id",
      "dist_count":2,
      "dist_times" : 10
    }
}
In this example, the system performs 10 rounds of document extraction based on the company_id field and extracts two documents during each round of document extraction. The system assigns lower ranks for documents that are not extracted.

Description about the dist_count and dist_times parameters

The following examples show how to specify the dist_count and dist_times parameters in a distinct clause and how the system obtains distinct results based on the values of these parameters:

For example, the system obtains six documents for a query. The documents contain the following attributes: id and name. The id field is the primary key field. You can specify the name field as the distinct key.

doc 1: id:1 name:a

doc 2: id:2 name:a

doc 3: id:3 name:a

doc 4: id:4 name:b

doc 5: id:5 name:c

doc 6: id:6 name:c

case1:

"distinct" : {
     "default": {  
      "dist_key" : "name",
      "dist_count":2,
      "dist_times" : 1
    }
}
# The following results are obtained after dispersing: doc1, doc2, doc4, doc5, and doc6.

case2:

"distinct" : {
     "default": {  
      "dist_key" : "name",
      "dist_count":1,
      "dist_times" : 2
    }
}
# The following results are obtained after dispersing: doc1, doc4, doc5, doc2, and doc6.

case3:

"distinct" : {
     "default": {  
      "dist_key" : "name",
      "dist_count":1,
      "dist_times" : 1
    }
}
# The following results are obtained after dispersing: doc1, doc4, and doc5.

distinct uniq plug-in

As described above, if the reserved parameter is set to false, the values of the total and viewtotal parameters related to search results are inaccurate. In this case, if you want to implement paging or perform other processing based on these values, errors may occur. To this end, OpenSearch provides the distinct uniq plug-in to ensure that the values of the total and viewtotal parameters are accurate when the dist_times, dist_count, and reserved parameters are set to 1, 1, and false.

To use the distinct uniq plug-in, include duniqfield:field in the kvpairs clause.

Take note of the following items:

The value of the duniqfield parameter in the kvpairs clause must be the same as the value of the dist_key parameter in the distinct clause.
This plug-in works only if the dist_times parameter is set to 1, the dist_count parameter is set to 1, and the reserved parameter is set to false. If you change the values of these parameters to other values, this plug-in does not work.
For performance reasons, this plug-in can return up to 5,000 query results for each query even if the number of query results is greater than 5,000.

Example:

{
  "distinct" : {
    "default": {  
      "dist_key" : "company_id",
      "dist_count":1,
      "dist_times" : 1,
      "reserved" : false
    }
  },
  "kvpairs" : {
    "duniqfield":"company_id"
  }
}

Usage notes

The fields that you specify in a distinct clause must be the attribute fields that you specify in the schema.json file.
The ARRAY type is not supported. Only the INT and LITERAL types are supported.