All Products
Search
Document Center

DashVector:Grouped document search

Last Updated:Apr 22, 2024

This topic describes how to perform grouped similarity searches in a collection by using the DashVector SDK for Python.

Prerequisites

API definition

Collection.query_group_by(
        self,
        vector: Optional[Union[List[Union[int, float]], np.ndarray]] = None,
        *,
        group_by_field: str,
        group_count: int = 10,
        group_topk: int = 10,
        id: Optional[str] = None,
        filter: Optional[str] = None,
        include_vector: bool = False,
        partition: Optional[str] = None,
        output_fields: Optional[List[str]] = None,
        sparse_vector: Optional[Dict[int, float]] = None,
        async_req: bool = False,
    ) -> DashVectorResponse:

Example

Note

You need to replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with the endpoint of your cluster in the sample code for the code to run properly.

import dashvector
import numpy as np

client = dashvector.Client(
    api_key='YOUR_API_KEY',
    endpoint='YOUR_CLUSTER_ENDPOINT'
)
ret = client.create(
    name='group_by_demo',
    dimension=4,
    fields_schema={'document_id': str, 'chunk_id': int}
)
assert ret

collection = client.get(name='group_by_demo')

ret = collection.insert([
    ('1', np.random.rand(4), {'document_id': 'paper-01', 'chunk_id': 1, 'content': 'xxxA'}),
    ('2', np.random.rand(4), {'document_id': 'paper-01', 'chunk_id': 2, 'content': 'xxxB'}),
    ('3', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 1, 'content': 'xxxC'}),
    ('4', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 2, 'content': 'xxxD'}),
    ('5', np.random.rand(4), {'document_id': 'paper-02', 'chunk_id': 3, 'content': 'xxxE'}),
    ('6', np.random.rand(4), {'document_id': 'paper-03', 'chunk_id': 1, 'content': 'xxxF'}),
])
assert ret

Perform a grouped similarity search by using a vector

ret = collection.query_group_by(
    vector=[0.1, 0.2, 0.3, 0.4],
    group_by_field='document_id',  # Group return results by the value of the document_id field.
    group_count=2,  # Return two groups.
    group_topk=2,   # Return up to two documents from each group.
)
# Check whether the operation is successful.
if ret:
    print('query_group_by success')
    print(len(ret))
    print('------------------------')
    for group in ret:
        print('group key:', group.group_id)
        for doc in group.docs:
            prefix = ' -'
            print(prefix, doc)

The sample output is as follows:

query_group_by success
4
------------------------
group key: paper-01
 - {"id": "2", "fields": {"document_id": "paper-01", "chunk_id": 2, "content": "xxxB"}, "score": 0.6807}
 - {"id": "1", "fields": {"document_id": "paper-01", "chunk_id": 1, "content": "xxxA"}, "score": 0.4289}
group key: paper-02
 - {"id": "3", "fields": {"document_id": "paper-02", "chunk_id": 1, "content": "xxxC"}, "score": 0.6553}
 - {"id": "5", "fields": {"document_id": "paper-02", "chunk_id": 3, "content": "xxxE"}, "score": 0.4401}

Perform a grouped similarity search by using the vector associated with the primary key

ret = collection.query_group_by(
    id='1',
    group_by_field='name',
)
# Check whether the query method is successfully called.
if ret:
    print('query_group_by success')
    print(len(ret))
    for group in ret:
        print('group:', group.group_id)
        for doc in group.docs:
            print(doc)
            print(doc.id)
            print(doc.vector)
            print(doc.fields)

Perform a grouped similarity search by using the vector or primary key and a conditional filter

# Perform a grouped similarity search by using the vector or primary key and a conditional filter.
ret = collection.query(
    vector=[0.1, 0.2, 0.3, 0.4],   # Specify a vector for search. Alternatively, you can specify a primary key for search.
    group_by_field='name',
    filter='age > 18',             # Specify a filter to perform a match query on documents whose value of the age field is greater than 18.
    output_fields=['name', 'age'], # Return only the name and age fields.
    include_vector=True
)

Perform a grouped search by using both dense and sparse vectors

Note

You can use a sparse vector to represent the keyword weight to implement a keyword-aware semantic vector search.

# Perform a grouped similarity search by using both dense and sparse vectors.
ret = collection.query(
    vector=[0.1, 0.2, 0.3, 0.4],   # Specify a vector for search.
    sparse_vector={1: 0.3, 20: 0.7},
    group_by_field='name',
)

Request parameters

Note

You must specify the vector or id parameter.

Parameter

Type

Default value

Description

group_by_field

str

None

Required. The name of the field by which a grouped search is performed. Schema-free fields are not supported.

vector

Optional[Union[List[Union[int, float]], np.ndarray]]

None

Optional. The vector.

id

Optional[str]

None

Optional. The primary key. The similarity search is performed based on the vector associated with the primary key.

group_count

int

10

Optional. The maximum number of groups to be returned. This is a best-effort parameter. In general, the specified number of groups can be returned.

group_topk

int

10

Optional. The number of similar results to be returned per group. This is a best-effort parameter and has a lower priority than group_count.

filter

Optional[str]

None

Optional. The conditional filter, which must comply with the syntax of an SQL WHERE clause. For more information, see Conditional filtering.

include_vector

bool

False

Optional. Specifies whether to return vector data.

partition

Optional[str]

None

Optional. The name of the partition.

output_fields

Optional[List[str]]

None

Optional. The fields to be returned. By default, all fields are returned.

sparse_vector

Optional[Dict[int, float]]

None

Optional. The sparse vector.

async_req

bool

False

Optional. Specifies whether to enable the asynchronous mode.

Response parameters

Note

A DashVectorResponse object is returned, which contains the operation result, as described in the following table.

Parameter

Type

Description

Example

code

int

The returned status code. For more information, see Status codes.

0

message

str

The returned message.

success

request_id

str

The unique ID of the request.

19215409-ea66-4db9-8764-26ce2eb5bb99

output

List[Group]

Grouped similar results.