This topic describes how to group the return values of a vector search.
Background
In some scenarios, the return values of a vector search need to be grouped. Here are some examples:
In a retrieval-augmented generation (RAG) system, a document is often split into multiple segments, and each segment is vectorized and stored in DashVector. To obtain diverse results in a vector search, users hope that the most similar segments from different documents are returned.
In the product image retrieval scenario, multiple images are often created for a product, and each image is vectorized and stored in DashVector. To obtain diverse results in a vector search, users hope that the most similar images of different products are returned.
DashVector supports the grouped vector search feature, which allows you to call the grouped document search operation and set group_by_field to the document ID and product ID respectively in the above scenarios. For more information, see Grouped document search.
Example
Prerequisites
A cluster is created. For more information, see Create a cluster.
An API key is obtained. For more information, see Manage API keys.
The SDK of the latest version is installed. For more information, see Install DashVector SDK.
Insert documents with fields
You need to replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with the endpoint of your cluster in the sample code for the code to run properly.
import dashvector
import numpy as np
client = dashvector.Client(
api_key='YOUR_API_KEY',
endpoint='YOUR_CLUSTER_ENDPOINT'
)
ret = client.create(
name='group_by_demo',
dimension=4,
fields_schema={'document_id': str, 'chunk_id': int}
)
assert ret
collection = client.get(name='group_by_demo')
ret = collection.insert([
('1', np.random.rand(4), {'document_id': 'paper-01', 'chunk_id': 1, 'content': 'xxxA'}),
('2', np.random.rand(4), {'document_id': 'paper-01', 'chunk_id': 2, 'content': 'xxxB'}),
('3', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 1, 'content': 'xxxC'}),
('4', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 2, 'content': 'xxxD'}),
('5', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 3, 'content': 'xxxE'}),
('6', np.random.rand(4), {'document_id': 'paper-03', 'chunk_id': 1, 'content': 'xxxF'}),
])
assert ret
Perform a grouped vector search
ret = collection.query_group_by(
vector=[0.1, 0.2, 0.3, 0.4],
group_by_field='document_id', # Group return results by the value of the document_id field.
group_count=2, # Return two groups.
group_topk=2, # Return up to two documents from each group.
)
# Check whether the operation is successful.
if ret:
print('query_group_by success')
print(len(ret))
print('------------------------')
for group in ret:
print('group key:', group.group_id)
for doc in group.docs:
prefix = ' -'
print(prefix, doc)
The sample output is as follows:
query_group_by success
4
------------------------
group key: paper-01
- {"id": "2", "fields": {"document_id": "paper-01", "chunk_id": 2, "content": "xxxB"}, "score": 0.6807}
- {"id": "1", "fields": {"document_id": "paper-01", "chunk_id": 1, "content": "xxxA"}, "score": 0.4289}
group key: paper-02
- {"id": "3", "fields": {"document_id": "paper-02", "chunk_id": 1, "content": "xxxC"}, "score": 0.6553}
- {"id": "5", "fields": {"document_id": "paper-02", "chunk_id": 3, "content": "xxxE"}, "score": 0.4401}Limitations
For the
group_by_fieldparameter, you can specify only a field that is defined by using thefields_schemaparameter when you create a collection. Schema-free fields do not support grouped search. For more information, see Create a collection and Schema-free.group_countandgroup_topkare best-effort parameters. Their return valuesmay be smaller than the specified values. DashVector gives a higher priority togroup_count.Larger values of
group_countandgroup_topkincrease the index scan workload, thereby increasing the time required for the API call. At present, the maximum value ofgroup_countis 64 and that ofgroup_topkis 16.