Grouped similarity search - DashVector - Alibaba Cloud Documentation Center

This topic describes how to group the return values of a vector search.

Background

In some scenarios, the return values of a vector search need to be grouped. Here are some examples:

In a retrieval-augmented generation (RAG) system, a document is often split into multiple segments, and each segment is vectorized and stored in DashVector. To obtain diverse results in a vector search, users hope that the most similar segments from different documents are returned.
In the product image retrieval scenario, multiple images are often created for a product, and each image is vectorized and stored in DashVector. To obtain diverse results in a vector search, users hope that the most similar images of different products are returned.

DashVector supports the grouped vector search feature, which allows you to call the grouped document search operation and set group_by_field to the document ID and product ID respectively in the above scenarios. For more information, see Grouped document search.

Example

Prerequisites

A cluster is created. For more information, see Create a cluster.
An API key is obtained. For more information, see Manage API keys.
The SDK of the latest version is installed. For more information, see Install DashVector SDK.

Insert documents with fields

Note

You need to replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with the endpoint of your cluster in the sample code for the code to run properly.

import dashvector
import numpy as np

client = dashvector.Client(
    api_key='YOUR_API_KEY',
    endpoint='YOUR_CLUSTER_ENDPOINT'
)
ret = client.create(
    name='group_by_demo',
    dimension=4,
    fields_schema={'document_id': str, 'chunk_id': int}
)
assert ret

collection = client.get(name='group_by_demo')

ret = collection.insert([
    ('1', np.random.rand(4), {'document_id': 'paper-01', 'chunk_id': 1, 'content': 'xxxA'}),
    ('2', np.random.rand(4), {'document_id': 'paper-01', 'chunk_id': 2, 'content': 'xxxB'}),
    ('3', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 1, 'content': 'xxxC'}),
    ('4', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 2, 'content': 'xxxD'}),
    ('5', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 3, 'content': 'xxxE'}),
    ('6', np.random.rand(4), {'document_id': 'paper-03', 'chunk_id': 1, 'content': 'xxxF'}),
])
assert ret

Perform a grouped vector search

ret = collection.query_group_by(
    vector=[0.1, 0.2, 0.3, 0.4],
    group_by_field='document_id',  # Group return results by the value of the document_id field.
    group_count=2,  # Return two groups.
    group_topk=2,   # Return up to two documents from each group.
)
# Check whether the operation is successful.
if ret:
    print('query_group_by success')
    print(len(ret))
    print('------------------------')
    for group in ret:
        print('group key:', group.group_id)
        for doc in group.docs:
            prefix = ' -'
            print(prefix, doc)

The sample output is as follows:

query_group_by success
4
------------------------
group key: paper-01
 - {"id": "2", "fields": {"document_id": "paper-01", "chunk_id": 2, "content": "xxxB"}, "score": 0.6807}
 - {"id": "1", "fields": {"document_id": "paper-01", "chunk_id": 1, "content": "xxxA"}, "score": 0.4289}
group key: paper-02
 - {"id": "3", "fields": {"document_id": "paper-02", "chunk_id": 1, "content": "xxxC"}, "score": 0.6553}
 - {"id": "5", "fields": {"document_id": "paper-02", "chunk_id": 3, "content": "xxxE"}, "score": 0.4401}

Limitations

Important

For the group_by_field parameter, you can specify only a field that is defined by using the fields_schema parameter when you create a collection. Schema-free fields do not support grouped search. For more information, see Create a collection and Schema-free.
group_count and group_topk are best-effort parameters. Their return values may be smaller than the specified values. DashVector gives a higher priority to group_count.
Larger values of group_count and group_topk increase the index scan workload, thereby increasing the time required for the API call. At present, the maximum value of group_count is 64 and that of group_topk is 16.