DashText is a sparse vector encoder recommended for DashVector. Semantic search captures meaning but can miss exact keyword matches -- domain-specific terms, product names, or error codes often fall through. DashText bridges this gap by converting text into sparse vectors using the BM25 algorithm. Combine these sparse vectors with dense vectors in DashVector to run keyword-aware semantic search -- retrieval that is both semantically relevant and keyword-precise.
Core example
The following Python snippet shows the minimal end-to-end flow: create a collection, encode a document, insert it, and run a hybrid query.
import dashvector
from dashtext import SparseVectorEncoder
from dashvector import Doc
# Connect to DashVector
client = dashvector.Client(
api_key='YOUR_API_KEY',
endpoint='YOUR_CLUSTER_ENDPOINT'
)
# Create a collection with dot product metric (required for sparse vectors)
client.create('hybrid_collection', dimension=4, metric='dotproduct')
collection = client.get('hybrid_collection')
# Build a sparse vector encoder and encode a document
encoder = SparseVectorEncoder.default()
doc_sparse = encoder.encode_documents("Your document text here.")
# Insert the document with both dense and sparse vectors
collection.insert(Doc(
id='doc1',
vector=[0.1, 0.2, 0.3, 0.4],
sparse_vector=doc_sparse
))
# Search with both dense and sparse vectors
query_sparse = encoder.encode_queries("Your search query here.")
results = collection.query(
vector=[0.1, 0.1, 0.1, 0.1],
sparse_vector=query_sparse
)The sections below walk through each step in detail, with both Python and Java examples.
Prerequisites
Before you begin, make sure you have:
A DashVector cluster with an available API key and endpoint. For more information, see Vector introduction
Python or Java SDK installed (
dashvectoranddashtextpackages for Python)
All examples in this guide use 4-dimensional dense vectors for simplicity. In production, set the dimension to match your embedding model output.
Step 1: Create a collection that supports sparse vectors
Sparse vectors require the dot product distance metric. Create a collection with metric='dotproduct' to enable sparse vector support.
Python
import dashvector
client = dashvector.Client(
api_key='YOUR_API_KEY',
endpoint='YOUR_CLUSTER_ENDPOINT'
)
assert client
# Create a collection with dot product metric (required for sparse vectors)
ret = client.create('hybrid_collection', dimension=4, metric='dotproduct')
assert ret
collection = client.get('hybrid_collection')
assert collectionJava
import com.aliyun.dashvector.DashVectorClient;
import com.aliyun.dashvector.DashVectorCollection;
import com.aliyun.dashvector.models.requests.CreateCollectionRequest;
import com.aliyun.dashvector.models.responses.Response;
import com.aliyun.dashvector.proto.CollectionInfo;
DashVectorClient client =
new DashVectorClient("YOUR_API_KEY", "YOUR_CLUSTER_ENDPOINT");
CreateCollectionRequest request = CreateCollectionRequest.builder()
.name("hybrid_collection")
.dimension(4)
.metric(CollectionInfo.Metric.dotproduct)
.dataType(CollectionInfo.DataType.FLOAT)
.build();
Response<Void> response = client.create(request);
System.out.println(response);
DashVectorCollection collection = client.get("hybrid_collection");Only collections that use the dot product metric (metric='dotproduct') support sparse vectors.
Replace YOUR_API_KEY with your API key and YOUR_CLUSTER_ENDPOINT with the endpoint of your cluster for the code to run properly.
Step 2: Build a sparse vector encoder
DashText provides two approaches for creating a SparseVectorEncoder: a built-in encoder for general-purpose text, and a custom encoder trained on your own corpus for higher accuracy with domain-specific terms.
Use the built-in encoder
The built-in encoder is pre-trained on the Chinese Wikipedia corpus and uses Jieba for Chinese text segmentation. No additional training is needed.
Python
from dashtext import SparseVectorEncoder
encoder = SparseVectorEncoder.default()Java
import com.aliyun.dashtext.encoder.SparseVectorEncoder;
SparseVectorEncoder encoder = SparseVectorEncoder.getDefaultInstance();Train a custom encoder on your own corpus
For data with many domain-specific terms, train an encoder on your full corpus to improve accuracy. For more information, see Advanced use.
Python
from dashtext import SparseVectorEncoder
encoder = SparseVectorEncoder()
# Your own corpus
corpus = [
"向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务",
"DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
"从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力",
"简单灵活、开箱即用的SDK,使用极简代码即可实现向量管理",
"自研向量相似性比对算法,快速高效稳定服务",
"Schema-free设计,通过Schema实现任意条件下的组合过滤查询"
]
# Train the encoder on the full corpus
encoder.train(corpus)Java
import com.aliyun.dashtext.encoder.SparseVectorEncoder;
import java.util.*;
SparseVectorEncoder encoder = new SparseVectorEncoder();
// Your own corpus
List<String> corpus = Arrays.asList(
"向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务",
"DashVector将其强大的向量管理、向量查询等多样化能力,通过简洁易用的SDK/API接口透出,方便被上层AI应用迅速集成",
"从而为包括大模型生态、多模态AI搜索、分子结构分析在内的多种应用场景,提供所需的高效向量检索能力",
"简单灵活、开箱即用的SDK,使用极简代码即可实现向量管理",
"自研向量相似性比对算法,快速高效稳定服务",
"Schema-free设计,通过Schema实现任意条件下的组合过滤查询"
);
// Train the encoder on the full corpus
encoder.train(corpus);Choose the right encoder
| Encoder | Strengths | Limitations | Best for |
|---|---|---|---|
| Built-in | Ready to use, good generalization | Lower accuracy on specialized terms | General-purpose text, quick prototyping |
| Custom corpus | Higher accuracy for specialized vocabulary | Requires upfront training on the full corpus | Domain-specific data with field-specific terms |
Select the encoder based on your business requirements. We recommend that you create an encoder based on your own corpus if your business involves a large number of terms specific to a certain field.
Step 3: Encode and insert a document
Encode document text into a sparse vector with encode_documents, then insert it alongside a dense vector into the collection.
Python
from dashvector import Doc
document = "向量检索服务DashVector基于阿里云自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务。"
doc_sparse_vector = encoder.encode_documents(document)
print(doc_sparse_vector)
# Output based on the built-in encoder:
# {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}
collection.insert(Doc(
id='A',
vector=[0.1, 0.2, 0.3, 0.4],
sparse_vector=doc_sparse_vector
))Java
String document = "向量检索服务DashVector基于达摩院自研的高效向量引擎Proxima内核,提供具备水平拓展能力的云原生、全托管的向量检索服务。";
Map<Long, Float> sparseVector = encoder.encodeDocuments(document);
System.out.println(sparseVector);
// Output based on the built-in encoder:
// {380823393: 0.7262431704356519, 414191989: 0.7262431704356519, 565176162: 0.7262431704356519, 904594806: 0.7262431704356519, 1005505802: 0.7262431704356519, 1169440797: 0.8883757984694465, 1240922502: 0.7262431704356519, 1313971048: 0.7262431704356519, 1317077351: 0.7262431704356519, 1490140460: 0.7262431704356519, 1574737055: 0.7262431704356519, 1760434515: 0.7262431704356519, 2045788977: 0.8414146776926797, 2141666983: 0.7262431704356519, 2509543087: 0.7262431704356519, 3180265193: 0.7262431704356519, 3845702398: 0.7262431704356519, 4106887295: 0.7262431704356519}
Vector vector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();
// Build a Doc with both dense and sparse vectors
Doc doc = Doc.builder()
.id("28")
.sparseVector(sparseVector)
.vector(vector)
.build();
// Insert the document
Response<Void> response = collection.insert(InsertDocRequest.builder().doc(doc).build());Each sparse vector is a map of token hash IDs to BM25 weights. Higher weights indicate more distinctive terms in the document.
Step 4: Run a keyword-aware semantic search
Encode the search query into a sparse vector with encode_queries, then pass both dense and sparse vectors to query.
Python
query = "什么是向量检索服务?"
sparse_vector = encoder.encode_queries(query)
print(sparse_vector)
# Output based on the built-in encoder:
# {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
docs = collection.query(
vector=[0.1, 0.1, 0.1, 0.1],
sparse_vector=sparse_vector
)Java
String query = "什么是向量检索服务?";
Map<Long, Float> sparseVector = encoder.encodeQueries(query);
System.out.println(sparseVector);
// Output based on the built-in encoder:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
Vector vector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();
// Build a query with both dense and sparse vectors
QueryDocRequest request = QueryDocRequest.builder()
.vector(vector)
.sparse_ector(sparseVector)
.topk(100)
.includeVector(true)
.build();
Response<List<Doc>> response = collection.query(request);
System.out.println(response);Step 5: Tune results with weighted vectors
Control the balance between semantic similarity (dense vectors) and keyword matching (sparse vectors) using the alpha parameter:
| Alpha value | Behavior |
|---|---|
0.0 | Only sparse vectors are used for distance measurement (pure keyword search) |
1.0 | Only dense vectors are used for distance measurement (pure semantic search) |
Apply combine_dense_and_sparse to scale both vectors before querying:
Python
from dashtext import combine_dense_and_sparse
query = "什么是向量检索服务?"
sparse_vector = encoder.encode_queries(query)
# Set alpha to control the balance: 0.0 = keywords only, 1.0 = semantics only
alpha = 0.7
dense_vector = [0.1, 0.1, 0.1, 0.1]
scaled_dense_vector, scaled_sparse_vector = combine_dense_and_sparse(dense_vector, sparse_vector, alpha)
docs = collection.query(
vector=scaled_dense_vector,
sparse_vector=scaled_sparse_vector
)Java
String query = "什么是向量检索服务?";
Map<Long, Float> sparseVector = encoder.encodeQueries(query);
System.out.println(sparseVector);
// Output based on the built-in encoder:
// {1169440797: 0.2947158712590364, 2045788977: 0.7052841287409635}
Vector denseVector = Vector.builder().value(Arrays.asList(0.1f, 0.2f, 0.3f, 0.4f)).build();
// Scale dense and sparse vectors by the alpha weight factor
float alpha = 0.1f;
sparseVector.forEach((key, value) -> sparseVector.put(key, value * alpha));
denseVector = Vector.builder().value(
denseVector.getValue().stream().map(number -> number.floatValue() * alpha).collect(Collectors.toList())
).build();
// Query with weighted vectors
QueryDocRequest request = QueryDocRequest.builder()
.vector(denseVector)
.sparse_ector(sparseVector)
.topk(100)
.includeVector(true)
.build();
Response<List<Doc>> response = collection.query(request);
System.out.println(response);API reference
For the complete DashText API, see DashText SDK for Python.