Implement keyword-aware semantic search with sparse vectors - DashVector

DashText is a sparse vector encoder recommended for DashVector. Semantic search captures meaning but can miss exact keyword matches -- domain-specific terms, product names, or error codes often fall through. DashText bridges this gap by converting text into sparse vectors using the BM25 algorithm. Combine these sparse vectors with dense vectors in DashVector to run keyword-aware semantic search -- retrieval that is both semantically relevant and keyword-precise.

Core example

The following Python snippet shows the minimal end-to-end flow: create a collection, encode a document, insert it, and run a hybrid query.

import dashvector
from dashtext import SparseVectorEncoder
from dashvector import Doc

# Connect to DashVector
client = dashvector.Client(
    api_key='YOUR_API_KEY',
    endpoint='YOUR_CLUSTER_ENDPOINT'
)

# Create a collection with dot product metric (required for sparse vectors)
client.create('hybrid_collection', dimension=4, metric='dotproduct')
collection = client.get('hybrid_collection')

# Build a sparse vector encoder and encode a document
encoder = SparseVectorEncoder.default()
doc_sparse = encoder.encode_documents("Your document text here.")

# Insert the document with both dense and sparse vectors
collection.insert(Doc(
    id='doc1',
    vector=[0.1, 0.2, 0.3, 0.4],
    sparse_vector=doc_sparse
))

# Search with both dense and sparse vectors
query_sparse = encoder.encode_queries("Your search query here.")
results = collection.query(
    vector=[0.1, 0.1, 0.1, 0.1],
    sparse_vector=query_sparse
)

The sections below walk through each step in detail, with both Python and Java examples.

Prerequisites

Before you begin, make sure you have:

A DashVector cluster with an available API key and endpoint. For more information, see Vector introduction
Python or Java SDK installed (dashvector and dashtext packages for Python)

Note

All examples in this guide use 4-dimensional dense vectors for simplicity. In production, set the dimension to match your embedding model output.

Step 1: Create a collection that supports sparse vectors

Sparse vectors require the dot product distance metric. Create a collection with metric='dotproduct' to enable sparse vector support.

Python

import dashvector

client = dashvector.Client(
    api_key='YOUR_API_KEY',
    endpoint='YOUR_CLUSTER_ENDPOINT'
)
assert client

# Create a collection with dot product metric (required for sparse vectors)
ret = client.create('hybrid_collection', dimension=4, metric='dotproduct')
assert ret

collection = client.get('hybrid_collection')
assert collection

Java

import com.aliyun.dashvector.DashVectorClient;
import com.aliyun.dashvector.DashVectorCollection;
import com.aliyun.dashvector.models.requests.CreateCollectionRequest;
import com.aliyun.dashvector.models.responses.Response;
import com.aliyun.dashvector.proto.CollectionInfo;

DashVectorClient client =
  new DashVectorClient("YOUR_API_KEY", "YOUR_CLUSTER_ENDPOINT");

CreateCollectionRequest request = CreateCollectionRequest.builder()
            .name("hybrid_collection")
            .dimension(4)
            .metric(CollectionInfo.Metric.dotproduct)
            .dataType(CollectionInfo.DataType.FLOAT)
            .build();

Response<Void> response = client.create(request);
System.out.println(response);

DashVectorCollection collection = client.get("hybrid_collection");

Important

Only collections that use the dot product metric (metric='dotproduct') support sparse vectors.

Replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with the endpoint of your cluster for the code to run properly.

Step 2: Build a sparse vector encoder

DashText provides two approaches for creating a SparseVectorEncoder: a built-in encoder for general-purpose text, and a custom encoder trained on your own corpus for higher accuracy with domain-specific terms.

Use the built-in encoder

The built-in encoder is pre-trained on the Chinese Wikipedia corpus and uses Jieba for Chinese text segmentation. No additional training is needed.

Python

from dashtext import SparseVectorEncoder

encoder = SparseVectorEncoder.default()

Java

import com.aliyun.dashtext.encoder.SparseVectorEncoder;

SparseVectorEncoder encoder = SparseVectorEncoder.getDefaultInstance();

Train a custom encoder on your own corpus

For data with many domain-specific terms, train an encoder on your full corpus to improve accuracy. For more information, see Advanced use.

Python

from dashtext import SparseVectorEncoder

encoder = SparseVectorEncoder()

# Your own corpus
corpus = [
    "向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核，提供具备水平拓展能力的云原生、全托管的向量检索服务",
    "DashVector将其强大的向量管理、向量查询等多样化能力，通过简洁易用的SDK/API接口透出，方便被上层AI应用迅速集成",
    "从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景，提供所需的高效向量检索能力",
    "简单灵活、开箱即用的SDK，使用极简代码即可实现向量管理",
    "自研向量相似性比对算法，快速高效稳定服务",
    "Schema-free设计，通过Schema实现任意条件下的组合过滤查询"
]

# Train the encoder on the full corpus
encoder.train(corpus)

Java

import com.aliyun.dashtext.encoder.SparseVectorEncoder;
import java.util.*;

SparseVectorEncoder encoder = new SparseVectorEncoder();

// Your own corpus
List<String> corpus = Arrays.asList(
  "向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核，提供具备水平拓展能力的云原生、全托管的向量检索服务",
  "DashVector将其强大的向量管理、向量查询等多样化能力，通过简洁易用的SDK/API接口透出，方便被上层AI应用迅速集成",
  "从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景，提供所需的高效向量检索能力",
  "简单灵活、开箱即用的SDK，使用极简代码即可实现向量管理",
  "自研向量相似性比对算法，快速高效稳定服务",
  "Schema-free设计，通过Schema实现任意条件下的组合过滤查询"
);

// Train the encoder on the full corpus
encoder.train(corpus);

Choose the right encoder

Encoder	Strengths	Limitations	Best for
Built-in	Ready to use, good generalization	Lower accuracy on specialized terms	General-purpose text, quick prototyping
Custom corpus	Higher accuracy for specialized vocabulary	Requires upfront training on the full corpus	Domain-specific data with field-specific terms

Select the encoder based on your business requirements. We recommend that you create an encoder based on your own corpus if your business involves a large number of terms specific to a certain field.

Step 3: Encode and insert a document

Encode document text into a sparse vector with encode_documents, then insert it alongside a dense vector into the collection.

Python

from dashvector import Doc

document = "向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核，提供具备水平拓展能力的云原生、全托管的向量检索服务。"
doc_sparse_vector = encoder.encode_documents(document)

print(doc_sparse_vector)
# Output based on the built-in encoder:
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}

collection.insert(Doc(
    id='A',
    vector=[0.1, 0.2, 0.3, 0.4],
    sparse_vector=doc_sparse_vector
))

Java

String document = "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核，提供具备水平拓展能力的云原生、全托管的向量检索服务。";
Map<Long, Float> sparseVector = encoder.encodeDocuments(document);

System.out.println(sparseVector);
// Output based on the built-in encoder:
// {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}

Vector vector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();

// Build a Doc with both dense and sparse vectors
Doc doc = Doc.builder()
  .id("28")
  .sparseVector(sparseVector)
  .vector(vector)
  .build();

// Insert the document
Response<Void> response = collection.insert(InsertDocRequest.builder().doc(doc).build());

Each sparse vector is a map of token hash IDs to BM25 weights. Higher weights indicate more distinctive terms in the document.

Step 4: Run a keyword-aware semantic search

Encode the search query into a sparse vector with encode_queries, then pass both dense and sparse vectors to query.

Python

query = "什么是向量检索服务？"
sparse_vector = encoder.encode_queries(query)

print(sparse_vector)
# Output based on the built-in encoder:
# {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

docs = collection.query(
    vector=[0.1, 0.1, 0.1, 0.1],
    sparse_vector=sparse_vector
)

Java

String query = "什么是向量检索服务？";

Map<Long, Float> sparseVector = encoder.encodeQueries(query);

System.out.println(sparseVector);
// Output based on the built-in encoder:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

Vector vector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();

// Build a query with both dense and sparse vectors
QueryDocRequest request = QueryDocRequest.builder()
  .vector(vector)
  .sparse_ector(sparseVector)
  .topk(100)
  .includeVector(true)
  .build();

Response<List<Doc>> response = collection.query(request);
System.out.println(response);

Step 5: Tune results with weighted vectors

Control the balance between semantic similarity (dense vectors) and keyword matching (sparse vectors) using the alpha parameter:

Alpha value	Behavior
`0.0`	Only sparse vectors are used for distance measurement (pure keyword search)
`1.0`	Only dense vectors are used for distance measurement (pure semantic search)

Apply combine_dense_and_sparse to scale both vectors before querying:

Python

from dashtext import combine_dense_and_sparse

query = "什么是向量检索服务？"
sparse_vector = encoder.encode_queries(query)

# Set alpha to control the balance: 0.0 = keywords only, 1.0 = semantics only
alpha = 0.7
dense_vector = [0.1, 0.1, 0.1, 0.1]
scaled_dense_vector, scaled_sparse_vector = combine_dense_and_sparse(dense_vector, sparse_vector, alpha)

docs = collection.query(
    vector=scaled_dense_vector,
    sparse_vector=scaled_sparse_vector
)

Java

String query = "什么是向量检索服务？";

Map<Long, Float> sparseVector = encoder.encodeQueries(query);

System.out.println(sparseVector);
// Output based on the built-in encoder:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}

Vector denseVector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();

// Scale dense and sparse vectors by the alpha weight factor
float alpha = 0.1f;
sparseVector.forEach((key, value) -> sparseVector.put(key, value * alpha));
denseVector = Vector.builder().value(
            denseVector.getValue().stream().map(number -> number.floatValue() * alpha).collect(Collectors.toList())
    ).build();

// Query with weighted vectors
QueryDocRequest request = QueryDocRequest.builder()
  .vector(denseVector)
  .sparse_ector(sparseVector)
  .topk(100)
  .includeVector(true)
  .build();

Response<List<Doc>> response = collection.query(request);
System.out.println(response);

API reference

For the complete DashText API, see DashText SDK for Python.