Distinct Clause Overview for Search Result Diversification - OpenSearch - Alibaba Cloud - OpenSearch

The distinct clause diversifies search results by limiting how many documents from the same field value appear in a result set. This prevents one dominant value from monopolizing the results page.

Common use cases:

Deduplication: Return only one result per title, company, or product SKU. Set dist_count:1 and dist_times:1 to keep exactly one document per field value.
Correcting skew: When one seller or author dominates your top results, use dist_count and dist_times to cap their share and surface documents from other values.

Syntax

distinct=dist_key:<field>,dist_count:<n>,dist_times:<n>,reserved:<true|false>

Parameters

Parameter	Type	Required	Default	Description
`dist_key`	string	Yes	—	The field to scatter results by. Must be an attribute field of INT or LITERAL type.
`dist_count`	int	No	1	Number of documents to extract per scatter operation.
`dist_times`	int	No	1	Number of scatter operations to perform.
`reserved`	true/false	No	true	Whether to retain documents that were not extracted. Set to `false` to discard them.
`update_total_hit`	true/false	No	false	Applies only when `reserved` is `false`. When `true`, the `total_hit` value is adjusted by subtracting the number of discarded documents—but may still be inaccurate. When `false`, `total_hit` includes discarded documents.
`dist_filter`	string	No	—	A filter condition that exempts matching documents from scattering. Exempted documents are sorted alongside the first group of scattered documents. By default, all documents are scattered.
`grade`	float	No	—	One or more score thresholds (separated by `\|`) that split documents into categories before scattering. Each category is scattered independently using the same `dist_count` and `dist_times` values. Categories are sorted in the same order as the first category. If omitted, all documents are treated as one category.

Warning

When reserved is set to false, the total and viewtotal response values become inaccurate. If your application uses these values for pagination or display, see distinct uniq plug-in.

How dist_count and dist_times work

dist_count controls how many documents to extract per operation. dist_times controls how many operations to run. The extracted documents are placed at the front of the result set, in the order of each operation.

Test data:

Document	id	name
doc1	11	a
doc2	22	a
doc3	33	a
doc4	44	b
doc5	55	c
doc6	66	c

Example 1 — Extract 2 documents per operation, 1 operation:

distinct=dist_key:name,dist_count:2,dist_times:1,reserved:false

Result: doc1, doc2, doc4, doc5, doc6

Each name value contributes up to 2 documents. name:a contributes doc1 and doc2; name:b contributes doc4; name:c contributes doc5 and doc6. doc3 (third document with name:a) is discarded.

Example 2 — Extract 1 document per operation, 2 operations:

distinct=dist_key:name,dist_count:1,dist_times:2,reserved:false

Result: doc1, doc4, doc5, doc2, doc6

Operation 1 takes the top document from each value: doc1 (a), doc4 (b), doc5 (c). Operation 2 takes the next document from each value: doc2 (a), doc6 (c). doc3 is discarded.

Example 3 — Extract 1 document per operation, 1 operation:

distinct=dist_key:name,dist_count:1,dist_times:1,reserved:false

Result: doc1, doc4, doc5

One document per value is kept. All remaining documents are discarded.

grade parameter

Use grade to classify documents into score-based categories before scattering. Each category is scattered independently using the same dist_count and dist_times values. Categories are sorted in the same order as the first category.

Single threshold — grade:3.0 creates two categories:

Category	Score range
First	score < 3.0
Second	score >= 3.0

Two thresholds — grade:3.0|5.0 creates three categories:

Category	Score range
First	score < 3.0
Second	3.0 <= score < 5.0
Third	score >= 5.0

There is no limit on the number of thresholds.

Usage notes

The distinct clause is optional.
Fields referenced in dist_key must be configured as attribute fields in the application schema.
dist_key supports only INT and LITERAL field types. ARRAY fields are not supported.
Specify only one field per distinct clause.
The sort feature does not remove duplicates. To deduplicate by a field (for example, title), use a distinct clause with dist_count:1 and dist_times:1.

distinct uniq plug-in

When reserved is false, the total and viewtotal response values are inaccurate. The distinct uniq plug-in corrects these values when dist_times, dist_count, and reserved are set to 1, 1, and false.

To enable the plug-in, add duniqfield:<field> to the kvpairs clause:

kvpairs=duniqfield:<field>

The <field> value must match dist_key.

Limitations:

Works only when dist_times=1, dist_count=1, and reserved=false. Changing any of these values disables the plug-in.
Returns a maximum of 5,000 results per query, even if more results match.
May time out on queries that hit millions of records.

Examples

Diversify results by company, keep all documents:

Search for documents containing "Zhejiang University" with create_time > 1402301230, scatter by company_id with 10 operations of 2 documents each. Non-extracted documents are retained and ranked at the back.

query=default:'Zhejiang University'&&filter=create_time>1402301230&&distinct=dist_key:company_id,dist_count:2,dist_times:10

Deduplicate by company with accurate result count:

Search for documents containing "Zhejiang University", keep only one document per company_id, and use the distinct uniq plug-in to get accurate total and viewtotal values.

query=default:'Zhejiang University'&&distinct=dist_key:company_id,dist_count:1,dist_times:1,reserved:false&&kvpairs=duniqfield:company_id