Vectorize text data by using Jina Embeddings model - DashVector

This topic describes how to vectorize text data by using Jina Embeddings v2 model and import the vector data to DashVector for vector search.

Prerequisites

DashVector:
- A cluster is created. For more information, see Create a cluster.
- An API key is obtained. For more information, see Manage API keys.
- The SDK of the latest version is installed. For more information, see Install DashVector SDK.
Jina AI
- An API key is obtained. For more information, see the homepage of Jina AI.

Jina Embeddings v2 model

Overview

Jina Embeddings v2 model is the only open source embedding model that supports a text length of 8,192. The functionality and performance of this model in terms of massive text embedding benchmark (MTEB) rivals the closed-source text-embedding-ada-002 model of OpenAI.

Model name	Vector dimensions	Distance metric	Vector data type	Remarks
jina-embeddings-v2-small-en	512	Cosine	Float32	Maximum text length: 8,192
jina-embeddings-v2-base-en	768	Cosine	Float32	Maximum text length: 8,192
jina-embeddings-v2-base-zh	768	Cosine	Float32	Maximum text length: 8,192

Note

For more information about the Jina Embeddings v2 model, see the homepage of Jina AI.

Example

Note

You must perform the following operations for the code to run properly:

Replace {your-dashvector-api-key} in the sample code with your DashVector API key.
Replace {your-dashvector-cluster-endpoint} in the sample code with the endpoint of your DashVector cluster.
Replace {your-jina-api-key} in the following sample code with your Jina AI API key.

Python

from dashvector import Client
import requests
from typing import List


# Use the Jina Embeddings v2 model to embed text data into vector data.
def generate_embeddings(texts: List[str]):
    headers = {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer {your-jina-api-key}'
    }
    data = {'input': texts, 'model': 'jina-embeddings-v2-base-zh'}
    response = requests.post('https://api.jina.ai/v1/embeddings', headers=headers, json=data)
    return [record["embedding"] for record in response.json()["data"]]
    

# Create a DashVector client.
client = Client(
    api_key='{your-dashvector-api-key}',
    endpoint='{your-dashvector-cluster-endpoint}'
)

# Create a DashVector collection.
rsp = client.create('jina-text-embedding', 768)
assert rsp
collection = client.get('jina-text-embedding')
assert collection

# Convert text into a vector and store it in DashVector.
collection.insert(
    ('ID1', generate_embeddings(['Alibaba Cloud DashVector is one of the best vector databases in performance and cost-effectiveness.'])[0])
)

# Perform a vector search.
docs = collection.query(
    generate_embeddings(['The best vector database'])[0]
)
print(docs)