This topic describes how to vectorize text data by using Jina Embeddings v2 model and import the vector data to DashVector for vector search.
Prerequisites
DashVector:
A cluster is created. For more information, see Create a cluster.
An API key is obtained. For more information, see Manage API keys.
The SDK of the latest version is installed. For more information, see Install DashVector SDK.
Jina AI
An API key is obtained. For more information, see the homepage of Jina AI.
Jina Embeddings v2 model
Overview
Jina Embeddings v2 model is the only open source embedding model that supports a text length of 8,192. The functionality and performance of this model in terms of massive text embedding benchmark (MTEB) rivals the closed-source text-embedding-ada-002 model of OpenAI.
Model name | Vector dimensions | Distance metric | Vector data type | Remarks |
jina-embeddings-v2-small-en | 512 | Cosine | Float32 |
|
jina-embeddings-v2-base-en | 768 | Cosine | Float32 |
|
jina-embeddings-v2-base-zh | 768 | Cosine | Float32 |
|
For more information about the Jina Embeddings v2 model, see the homepage of Jina AI.
Example
You must perform the following operations for the code to run properly:
Replace {your-dashvector-api-key} in the sample code with your DashVector API key.
Replace {your-dashvector-cluster-endpoint} in the sample code with the endpoint of your DashVector cluster.
Replace {your-jina-api-key} in the following sample code with your Jina AI API key.
from dashvector import Client
import requests
from typing import List
# Use the Jina Embeddings v2 model to embed text data into vector data.
def generate_embeddings(texts: List[str]):
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer {your-jina-api-key}'
}
data = {'input': texts, 'model': 'jina-embeddings-v2-base-zh'}
response = requests.post('https://api.jina.ai/v1/embeddings', headers=headers, json=data)
return [record["embedding"] for record in response.json()["data"]]
# Create a DashVector client.
client = Client(
api_key='{your-dashvector-api-key}',
endpoint='{your-dashvector-cluster-endpoint}'
)
# Create a DashVector collection.
rsp = client.create('jina-text-embedding', 768)
assert rsp
collection = client.get('jina-text-embedding')
assert collection
# Convert text into a vector and store it in DashVector.
collection.insert(
('ID1', generate_embeddings(['Alibaba Cloud DashVector is one of the best vector databases in performance and cost-effectiveness.'])[0])
)
# Perform a vector search.
docs = collection.query(
generate_embeddings(['The best vector database'])[0]
)
print(docs)