distinct clause - OpenSearch - Alibaba Cloud Documentation Center

The `distinct` clause diversifies search results by limiting how many documents with the same field value appear together. This topic covers the clause syntax, the `dist_count` and `dist_times` parameters, and the `distinct uniq` plugin.

Function overview

The `distinct` clause improves result diversity. When multiple documents from the same user have high scores, they may dominate the top results, filling the page with content from a single source. The `distinct` clause addresses this by sampling documents from each user, giving documents from a wider range of users a chance to appear.

Syntax

The syntax for the `distinct` clause is: `dist_key:field,dist_count:1,dist_times:1,reserved:false`

Parameter	Type	Required	Valid values	Default value	Description
dist_key	string	Yes			The field used for diversification.
dist_times	int	No		1	The number of sampling rounds.
dist_count	int	No		1	The number of documents to sample in each round.
reserved	true/false	No	true/false	true	Whether to keep the remaining documents after sampling. If set to false, the remaining documents are discarded, which makes the `total` count of matched results inaccurate.
update_total_hit	true/false	No	true/false	false	When `reserved` is false, if you set `update_total_hit` to true, the final `total_hit` is reduced by the number of documents discarded by `distinct`. This number may not be exact. If false, the number is not reduced.
dist_filter	string	No			A filter condition. Documents that are filtered out do not participate in the `distinct` operation and are sorted together with the first group of results returned by `distinct`. By default, all documents participate in the `distinct` operation.
grade	float	No			Specifies thresholds to divide documents into grades. The `distinct` operation is then performed on documents within each grade. If this parameter is not specified, all documents are treated as a single grade. Documents are graded based on their scores from the first-dimension sort criterion. Use a pipe (\|) to separate multiple thresholds. The number of grades is not limited. For example: 1. `grade:3.0`: Divides documents into two grades based on the first-dimension sort score. Documents with scores less than 3.0 are in the first grade. Documents with scores of 3.0 or higher are in the second grade. 2. `grade:3.0\|5.0`: Divides documents into three grades. The first grade is for scores less than 3.0. The second grade is for scores from 3.0 up to 5.0. The third grade is for scores of 5.0 or higher. The order of the grades follows the sort order of the first-dimension criterion. For example, if the sort order is descending, the grades are also ordered from highest to lowest.

Explanation of dist_count and dist_times

The following examples illustrate how `dist_count` and `dist_times` work. Assume you have six documents where `id` is the primary key and `name` is the diversification field:

doc1: id:11 name:a

doc2: id:22 name:a

doc3: id:33 name:a

doc4: id:44 name:b

doc5: id:55 name:c

doc6: id:66 name:c

Case 1: The setting distinct=dist_key:name,dist_count:2,dist_times:1,reserved:false specifies one sampling round and two documents per round. The diversified result is: doc1, doc2, doc4, doc5, and doc6.

Case 2: The setting distinct=dist_key:name,dist_count:1,dist_times:2,reserved:false specifies two sampling rounds and one document per round. The diversified result is: doc1, doc4, doc5, doc2, and doc6.

Case 3: The setting distinct=dist_key:name,dist_count:1,dist_times:1,reserved:false specifies one sampling round and one document per round. The diversified result is: doc1, doc4, and doc5.

Notes

The `distinct` clause is optional.
Fields used in the `distinct` clause must be configured as property fields in the application schema.
Only int and literal field types are supported. Array types are not.
You cannot specify multiple `dist_key` parameters.
Sorting does not automatically remove duplicates. However, you can use the `distinct` clause to deduplicate results. For example, to remove duplicate articles that have the same title, you can set `dist_key` to `title`, `dist_times` to `1`, and `dist_count` to `1`.

The distinct uniq plugin

When `reserved` is `false`, the `total` and `viewtotal` values may become inaccurate, which can cause issues for pagination or other operations. The `distinct uniq` plugin corrects these values when `dist_times` is `1`, `dist_count` is `1`, and `reserved` is `false`. To use the plugin, add `duniqfield:field` to the `kvpair` clause. For more information, see the kvpair clause. For example: kvpairs=duniqfield:name

Note:

`field` must be the same as the `dist_key` in the `distinct` clause.
The plugin is effective only for queries where `dist_times` is `1`, `dist_count` is `1`, and `reserved` is `false`.
For performance reasons, the plugin returns a maximum `total` value of 5,000. If the actual number of search results is greater than 5,000, the `total` value is still returned as 5,000.
This limit of 5,000 for the `total` value applies only when the `distinct uniq` plugin is used. If the plugin is not used, the `total` value is not capped at 5,000.
If you use this plugin and the query hits a large volume of data, such as millions of records, the query may time out.

Examples

Search for documents that contain "Zhejiang University" and were created after the timestamp 1402301230. Diversify the results based on the `company_id` field. Perform 10 sampling rounds and sample 2 results per round.
```
query=default:'Zhejiang University'&&filter=create_time>1402301230&&distinct=dist_key:company_id,dist_count:2,dist_times:10
```
Search for documents that contain "Zhejiang University". Use the `company_id` field to deduplicate the results, keeping only one document per company. Discard the remaining documents and update the total hit count.
```
query=default:'Zhejiang University'&&distinct=dist_key:company_id,dist_count:1,dist_times:1,reserved:false&&kvpairs=duniqfield:company_id
```