distinct clause - OpenSearch - Alibaba Cloud Documentation Center

Description

You can include a distinct clause in a statement to disperse documents that are obtained based on the statement. This can help ensure that the system returns distinct results and improve user experience. For example, a great many documents are retrieved in a query. However, multiple documents of a specific user are highly scored and ranked in the front. As a result, most of the results displayed on the same page are from the same user. This affects the display effect and user experience. In this case, you can include a distinct clause in the statement to extract specific documents from the set of documents that are obtained based on the rules that you specified in the distinct clause. This can disperse the documents and sort the documents in a new order to ensure that documents from each website are displayed.

Syntax

distinct=dist_key:field,dist_count:1,dist_times:1,reserved:false

Parameters:

dist_key: required. The attribute field based on which you want to disperse the documents that are obtained.
dist_times: optional. The number of times that you want to perform document extraction. Default value: 1.
dist_count: optional. The number of documents that you want to extract each time that document extraction is performed. Default value: 1.
reserved: optional. Specifies whether to retain the remaining documents that are not extracted. Valid values: true and false. Default value: true. If you set the value of this parameter to false, the system discards the documents that are not extracted. In this case, the value of the total_hit response parameter may be inaccurate.
update_total_hit: optional. Default value: false. If you set the value of the reserved parameter to false and the value of the update_total_hit parameter to true, the system calculates the difference between the number of discarded documents and the value of the total_hit parameter. The value of the total_hit response parameter may be inaccurate. If you set the value of the update_total_hit parameter to false, the value of the total_hit parameter includes the number of documents that are discarded.
dist_filter: optional. The filter conditions. The system does not use the documents that are filtered based on the specified conditions as dispersing objects. When the system performs fine sorting, the system sorts the documents that are extracted by using the distinct clause together with the documents that are filtered. By default, the system uses all documents that are obtained as dispersing objects.
grade: optional. The threshold values based on which the system classifies documents into different grades. The system extracts documents from each grade based on the threshold value that you specify for the grade. If you do not include the grade parameter in the distinct clause, the system classifies all documents into one grade by default. Documents are classified based on the specified thresholds. Separate thresholds with vertical bars (|). The number of thresholds that you can specify is not limited. Example 1: grade:3.0. In this case, documents are classified into two categories based on the specified threshold. The documents with a score less than 3.0 are classified into the first category. The documents with a score greater than or equal to 3.0 are classified into the second category. Example 2: grade:3.0|5.0. In this case, documents are classified in to three categories. The documents with a score less than 3.0 are classified into the first category. The documents with a score greater than or equal to 3.0 but less than 5.0 are classified into the second category. The documents with a score greater than or equal to 5.0 are classified into the third category. The sorting method that you use for grades must be the same as the sorting method that you use for documents during rough sorting. If the documents are sorted in descending order during rough sorting, the grades are also sorted in descending order. If the documents are sorted in ascending order during rough sorting, the grades are also sorted in ascending order.

Example:

distinct=dist_key:company_id,dist_count:2,dist_times:10
In this example, the system performs 10 rounds of document extraction based on the company_id field and extracts 2 documents during each round of document extraction. The system assigns lower ranks for documents that are not extracted.

dist_count and dist_times

The following examples show how to specify the dist_count and dist_times parameters in a distinct clause and how the system obtains distinct results based on the values of these parameters.

For example, the system obtains six documents for a query. The documents contain the following attributes: id and name. The id field is the primary key field. You can specify the name field as the distinct key.

doc 1: id:1 name:a

doc 2: id:2 name:a

doc 3: id:3 name:a

doc 4: id:4 name:b

doc 5: id:5 name:c

doc 6: id:6 name:c

Case 1:

distinct=dist_key:name,dist_count:2,dist_times:1 returns doc 1, doc 2, doc 4, doc 5, and doc 6 in sequence.

Case 2:

distinct=dist_key:name,dist_count:1,dist_times:2 returns doc 1, doc 4, doc 5, doc 2, and doc 6 in sequence.

Case 3:

distinct=dist_key:name,dist_count:1,dist_times:1 returns doc 1, doc 4, and doc 5 in sequence.

Disperse documents in multiple phases

An OpenSearch Retrieval Engine Edition instance supports two phases of sorting on the Searcher node: rough sorting and fine sorting. The system determines the number of times that fine sorting needs to be performed based on the number of scorers that you specify. You can specify two sub-distinct clauses in a distinct clause. The system executes the first sub-distinct clause during rough sorting and executes the second sub-distinct clause to disperse the query results after rough sorting and fine sorting are complete. You can refer to the two phases of document dispersing as in-sorting dispersing and after-sorting dispersing. For example, you can specify a large dist_count value in the first sub-distinct clause and a small dist_count value in the second sub-distinct clause. The large value of the dist_count parameter in the first sub-distinct clause is used to ensure that the system extracts a sufficient number of documents during rough sorting. You can also specify to not disperse documents during rough sorting and disperse documents after rough sorting and fine sorting are completed.

Note
If document dispersing is required after rough sorting and fine sorting are complete, and the value that is calculated based on the start + hit expression is larger than the value of the rank_size parameter of scorers, the pagination feature may become unstable.

Syntax: distinct = sub_distinct_clause_when_sort;sub_distinct_clause_after_sort. In this clause, the sub_distinct_clause_when_sort parameter specifies the sub-distinct clause that is used for document dispersing during rough sorting, and the sub_distinct_clause_after_sort parameter specifies the sub-distinct clause that is used for document dispersing after rough sorting and fine sorting are complete.
You can specify a distinct clause in the following formats based on the syntax:
distinct = sub_dist_clause;none_sub_dist_clause: specifies that the system disperses documents based on the configuration of the sub_dist_clause parameter during rough sorting and does not disperse documents after rough sorting and fine sorting are complete. Example: "distinct=dist_key:company_id,dist_count:1,dist_times:1;none_dist".
distinct = sub_dist_clause1;sub_dist_clause2: specifies that the system disperses documents based on the configuration of the sub_dist_clause1 parameter during rough sorting and disperses documents based on the configuration of the sub_dist_clause2 parameter after rough sorting and fine sorting are complete. Example:

"distinct=dist_key:company_id,dist_count:2,dist_times:1;dist_key:company_id,dist_count:1,dist_times:1"

distinct = none_sub_dist_clause;sub_dist_clause: specifies that the system does not disperse documents during rough sorting and disperses documents based on the configuration of the sub_dist_clause parameter after rough sorting and fine sorting are completed. Example: "distinct=none_dist;dist_key:company_id,dist_count:1,dist_times:1".
distinct = sub_dist_clause: specifies that the system disperses documents based on the configuration of the sub_dist_clause parameter during rough sorting and after rough sorting and fine sorting are complete. Example: "distinct=dist_key:company_id,dist_count:1,dist_times:1". The distinct = none_sub_dist_clause; none_sub_dist_clause clause is invalid. If you want the system to not disperse documents that are obtained, do not include the distinct clause in the statement.

distinct uniq plug-in

If you set the value of the reserved parameter to false, the values of the total and viewtotal parameters may be inaccurate. If you want to implement pagination or perform other operations based on the values of these parameters, errors may occur. To this end, OpenSearch provides the distinct uniq plug-in to ensure that the values of the total and viewtotal parameters are accurate when the dist_times, dist_count, and reserved parameters are set to 1, 1, and false.

To use the distinct uniq plug-in, include duniqfield:field in a kvpairs clause. For information about how to specify a kvpairs clause, see kvpairs clause. Example: kvpairs=duniqfield:name

Note:

The value of the duniqfield parameter in a kvpairs clause must be the same as the value of the dist_key parameter in a distinct clause.
This plug-in works only if the dist_times parameter is set to 1, the dist_count parameter is set to 1, and the reserved parameter is set to false. If you change the values of these parameters to other values, this plug-in does not work.
For performance reasons, this plug-in can return up to 5,000 query results for each query. If the number of query results is greater than 5,000, this plug-in returns 5,000 results.

Example:

distinct=dist_key:company_id,dist_count:1,dist_times:1,reserved:false&&kvpairs=duniqfield:company_id

Usage notes

A distinct clause is optional.
The fields that you specify in a distinct clause must be attribute fields that you specified in the schema.json file.
The ARRAY type is not supported. Only the INT and LITERAL types are supported.
You can specify only one field to disperse documents.