Perform grouped document search using the Python SDK - DashVector

This topic describes how to perform grouped similarity searches in a collection by using the DashVector SDK for Python.

Prerequisites

A cluster is created. For more information, see Create a cluster.
An API key is obtained. For more information, see Manage API keys.
The SDK of the latest version is installed. For more information, see Install DashVector SDK.

API definition

Python

Collection.query_group_by(
        self,
        vector: Optional[Union[List[Union[int, float]], np.ndarray]] = None,
        *,
        group_by_field: str,
        group_count: int = 10,
        group_topk: int = 10,
        id: Optional[str] = None,
        filter: Optional[str] = None,
        include_vector: bool = False,
        partition: Optional[str] = None,
        output_fields: Optional[List[str]] = None,
        sparse_vector: Optional[Dict[int, float]] = None,
        async_req: bool = False,
    ) -> DashVectorResponse:

Example

Note

You need to replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with the endpoint of your cluster in the sample code for the code to run properly.

Python

import dashvector
import numpy as np

client = dashvector.Client(
    api_key='YOUR_API_KEY',
    endpoint='YOUR_CLUSTER_ENDPOINT'
)
ret = client.create(
    name='group_by_demo',
    dimension=4,
    fields_schema={'document_id': str, 'chunk_id': int}
)
assert ret

collection = client.get(name='group_by_demo')

ret = collection.insert([
    ('1', np.random.rand(4), {'document_id': 'paper-01', 'chunk_id': 1, 'content': 'xxxA'}),
    ('2', np.random.rand(4), {'document_id': 'paper-01', 'chunk_id': 2, 'content': 'xxxB'}),
    ('3', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 1, 'content': 'xxxC'}),
    ('4', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 2, 'content': 'xxxD'}),
    ('5', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 3, 'content': 'xxxE'}),
    ('6', np.random.rand(4), {'document_id': 'paper-03', 'chunk_id': 1, 'content': 'xxxF'}),
])
assert ret

Perform a grouped similarity search by using a vector

Python

ret = collection.query_group_by(
    vector=[0.1, 0.2, 0.3, 0.4],
    group_by_field='document_id',  # Group return results by the value of the document_id field.
    group_count=2,  # Return two groups.
    group_topk=2,   # Return up to two documents from each group.
)
# Check whether the operation is successful.
if ret:
    print('query_group_by success')
    print(len(ret))
    print('------------------------')
    for group in ret:
        print('group key:', group.group_id)
        for doc in group.docs:
            prefix = ' -'
            print(prefix, doc)

The sample output is as follows:

query_group_by success
4
------------------------
group key: paper-01
 - {"id": "2", "fields": {"document_id": "paper-01", "chunk_id": 2, "content": "xxxB"}, "score": 0.6807}
 - {"id": "1", "fields": {"document_id": "paper-01", "chunk_id": 1, "content": "xxxA"}, "score": 0.4289}
group key: paper-02
 - {"id": "3", "fields": {"document_id": "paper-02", "chunk_id": 1, "content": "xxxC"}, "score": 0.6553}
 - {"id": "5", "fields": {"document_id": "paper-02", "chunk_id": 3, "content": "xxxE"}, "score": 0.4401}

Perform a grouped similarity search by using the vector associated with the primary key

Python

ret = collection.query_group_by(
    id='1',
    group_by_field='name',
)
# Check whether the query method is successfully called.
if ret:
    print('query_group_by success')
    print(len(ret))
    for group in ret:
        print('group:', group.group_id)
        for doc in group.docs:
            print(doc)
            print(doc.id)
            print(doc.vector)
            print(doc.fields)

Perform a grouped similarity search by using the vector or primary key and a conditional filter

Python

# Perform a grouped similarity search by using the vector or primary key and a conditional filter.
ret = collection.query(
    vector=[0.1, 0.2, 0.3, 0.4],   # Specify a vector for search. Alternatively, you can specify a primary key for search.
    group_by_field='name',
    filter='age > 18',             # Specify a filter to perform a match query on documents whose value of the age field is greater than 18.
    output_fields=['name', 'age'], # Return only the name and age fields.
    include_vector=True
)

Perform a grouped search by using both dense and sparse vectors

Note

You can use a sparse vector to represent the keyword weight to implement a keyword-aware semantic vector search.

Python

# Perform a grouped similarity search by using both dense and sparse vectors.
ret = collection.query(
    vector=[0.1, 0.2, 0.3, 0.4],   # Specify a vector for search.
    sparse_vector={1: 0.3, 20: 0.7},
    group_by_field='name',
)

Request parameters

Note

You must specify the vector or id parameter.

Parameter	Type	Default value	Description
group_by_field	str	None	Required. The name of the field by which a grouped search is performed. Schema-free fields are not supported.
vector	Optional[Union[List[Union[int, float]], np.ndarray]]	None	Optional. The vector.
id	Optional[str]	None	Optional. The primary key. The similarity search is performed based on the vector associated with the primary key.
group_count	int	10	Optional. The maximum number of groups to be returned. This is a best-effort parameter. In general, the specified number of groups can be returned.
group_topk	int	10	Optional. The number of similar results to be returned per group. This is a best-effort parameter and has a lower priority than group_count.
filter	Optional[str]	None	Optional. The conditional filter, which must comply with the syntax of an SQL WHERE clause. For more information, see Conditional filtering.
include_vector	bool	False	Optional. Specifies whether to return vector data.
partition	Optional[str]	None	Optional. The name of the partition.
output_fields	Optional[List[str]]	None	Optional. The fields to be returned. By default, all fields are returned.
sparse_vector	Optional[Dict[int, float]]	None	Optional. The sparse vector.
async_req	bool	False	Optional. Specifies whether to enable the asynchronous mode.

Response parameters

Note

A DashVectorResponse object is returned, which contains the operation result, as described in the following table.

Parameter	Type	Description	Example
code	int	The returned status code. For more information, see Status codes.	0
message	str	The returned message.	success
request_id	str	The unique ID of the request.	19215409-ea66-4db9-8764-26ce2eb5bb99
output	List[Group]	Grouped similar results.