perform a similarity search in a collection by using the python sdk - DashVector

Collection.query() searches a DashVector collection for documents similar to a given vector or a stored document's vector. You can also retrieve documents by metadata filter alone.

Query modes

Collection.query() supports five query modes depending on the parameters you provide:

Mode	Required parameters	Description
Vector search	`vector`	Find documents closest to a given dense vector
Primary key search	`id`	Find documents closest to the vector of an existing document
Filtered vector search	`vector` or `id` + `filter`	Combine similarity search with metadata filtering
Hybrid search	`vector` + `sparse_vector`	Combine dense and sparse vectors for keyword-aware semantic search
Match query	`filter` only	Retrieve documents by metadata filter without similarity ranking

If neither vector nor id is specified, query() performs a match query using only the conditional filter.

Prerequisites

Before you begin, make sure that you have:

A DashVector cluster. See Create a cluster
An API key. See Manage API keys
DashVector SDK (latest version). See Install DashVector SDK
A collection with documents already inserted. See Create a collection and Insert documents

API signature

Collection.query(
    vector: Optional[Union[List[Union[int, float]], np.ndarray]] = None,
    id: Optional[str] = None,
    topk: int = 10,
    filter: Optional[str] = None,
    include_vector: bool = False,
    partition: Optional[str] = None,
    output_fields: Optional[List[str]] = None,
    sparse_vector: Optional[Dict[int, float]] = None,
    async_req: False
) -> DashVectorResponse

Request parameters

Parameter	Type	Default	Description
`vector`	`Optional[Union[List[Union[int, float]], np.ndarray]]`	`None`	Dense vector for similarity search.
`id`	`Optional[str]`	`None`	Primary key of an existing document. The search uses that document's vector.
`topk`	`int`	`10`	Maximum number of results to return, ranked by similarity.
`filter`	`Optional[str]`	`None`	Conditional filter using SQL WHERE clause syntax. See Conditional filtering.
`include_vector`	`bool`	`False`	Whether to include vector data in the response.
`partition`	`Optional[str]`	`None`	Partition name. Limits the search scope to a specific partition.
`output_fields`	`Optional[List[str]]`	`None`	Fields to return. By default, all fields are returned.
`sparse_vector`	`Optional[Dict[int, float]]`	`None`	Sparse vector for keyword-aware semantic search. Each key is a dimension index, and each value is the weight.
`async_req`	`bool`	`False`	Whether to enable asynchronous mode.

Response

query() returns a DashVectorResponse object:

Field	Type	Description	Example
`code`	`int`	Status code. `0` indicates success. See Status codes.	`0`
`message`	`str`	Status message.	`success`
`request_id`	`str`	Unique request identifier.	`19215409-ea66-4db9-8764-26ce2eb5bb99`
`output`	`List[`<code data-tag="code" class="inline-code___exakR" id="code_d6787a3c">Doc</code>`]`	Similarity search results.	--

Examples

All examples below use the following client setup. Replace the placeholders with your actual values:

Placeholder	Description
`YOUR_API_KEY`	Your API key from the DashVector console
`YOUR_CLUSTER_ENDPOINT`	Your cluster endpoint URL

import dashvector
import numpy as np

client = dashvector.Client(
    api_key='YOUR_API_KEY',
    endpoint='YOUR_CLUSTER_ENDPOINT'
)

# Get the target collection
collection = client.get(name='quickstart')

Create the quickstart collection and insert documents before running these examples. See Create a collection and Insert documents.

Search by vector

Pass a dense vector to find the most similar documents.

ret = collection.query(
    vector=[0.1, 0.2, 0.3, 0.4]
)
# Check whether the query method is successfully called.
if ret:
    print('query success')
    print(len(ret))
    for doc in ret:
        print(doc)
        print(doc.id)
        print(doc.vector)
        print(doc.fields)

To customize the result set, specify topk, output_fields, and include_vector:

ret = collection.query(
    vector=[0.1, 0.2, 0.3, 0.4],
    topk=100,
    output_fields=['name', 'age'],  # Only the name and age fields need to be returned.
    include_vector=True
)

Search by primary key

Use the id parameter to search with a stored document's vector, without providing the vector values directly.

ret = collection.query(
    id='1'
)
# Check whether the query method is successfully called.
if ret:
    print('query success')
    print(len(ret))
    for doc in ret:
        print(doc)
        print(doc.id)
        print(doc.vector)
        print(doc.fields)

Combine id with topk and output_fields the same way as vector search:

ret = collection.query(
    id='1',
    topk=100,
    output_fields=['name', 'age'],  # Only the name and age fields need to be returned.
    include_vector=True
)

Search with a conditional filter

Add a filter parameter to narrow results by metadata. The filter follows SQL WHERE clause syntax.

# Perform a similarity search by using the vector or primary key and a conditional filter.
ret = collection.query(
    vector=[0.1, 0.2, 0.3, 0.4],   # Specify a vector for search. Alternatively, you can specify a primary key for search.
    topk=100,
    filter='age > 18',             # Specify a conditional filter to perform a match query on documents whose value of the age field is greater than 18.
    output_fields=['name', 'age'], # Only the name and age fields need to be returned.
    include_vector=True
)

Tip: Combine filter with id instead of vector to filter results from a primary key-based search.

Hybrid search with dense and sparse vectors

Combine a dense vector with a sparse vector to perform keyword-aware semantic search. The sparse vector represents keyword weights that complement the dense embedding.

# Perform a similarity search by using both dense and sparse vectors.
ret = collection.query(
    vector=[0.1, 0.2, 0.3, 0.4],   # Specify a vector for search.
    sparse_vector={1: 0.3, 20: 0.7}
)

See Keyword-aware semantic search for configuration details.

Match query with filter only

Omit both vector and id to retrieve documents based solely on metadata conditions, without similarity ranking.

# Perform a match query only by using a conditional filter without specifying a vector or primary key.
ret = collection.query(
    topk=100,
    filter='age > 18',             # Specify a conditional filter to perform a match query on documents whose value of the age field is greater than 18.
    output_fields=['name', 'age'], # Only the name and age fields need to be returned.
    include_vector=True
)

What's next

Keyword-aware semantic search -- Improve retrieval accuracy by combining dense and sparse vectors.
Conditional filtering -- Full filter expression syntax reference.