All Products
Search
Document Center

DashVector:Vectorize text data by using Jina Embeddings model

Last Updated:Apr 11, 2024

This topic describes how to vectorize text data by using Jina Embeddings v2 model and import the vector data to DashVector for vector search.

Prerequisites

Jina Embeddings v2 model

Overview

Jina Embeddings v2 model is the only open source embedding model that supports a text length of 8,192. The functionality and performance of this model in terms of massive text embedding benchmark (MTEB) rivals the closed-source text-embedding-ada-002 model of OpenAI.

Model name

Vector dimensions

Distance metric

Vector data type

Remarks

jina-embeddings-v2-small-en

512

Cosine

Float32

  • Maximum text length: 8,192

jina-embeddings-v2-base-en

768

Cosine

Float32

  • Maximum text length: 8,192

jina-embeddings-v2-base-zh

768

Cosine

Float32

  • Maximum text length: 8,192

Note

For more information about the Jina Embeddings v2 model, see the homepage of Jina AI.

Example

Note

You must perform the following operations for the code to run properly:

  1. Replace {your-dashvector-api-key} in the sample code with your DashVector API key.

  2. Replace {your-dashvector-cluster-endpoint} in the sample code with the endpoint of your DashVector cluster.

  3. Replace {your-jina-api-key} in the following sample code with your Jina AI API key.

from dashvector import Client
import requests
from typing import List


# Use the Jina Embeddings v2 model to embed text data into vector data.
def generate_embeddings(texts: List[str]):
    headers = {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer {your-jina-api-key}'
    }
    data = {'input': texts, 'model': 'jina-embeddings-v2-base-zh'}
    response = requests.post('https://api.jina.ai/v1/embeddings', headers=headers, json=data)
    return [record["embedding"] for record in response.json()["data"]]
    

# Create a DashVector client.
client = Client(
    api_key='{your-dashvector-api-key}',
    endpoint='{your-dashvector-cluster-endpoint}'
)

# Create a DashVector collection.
rsp = client.create('jina-text-embedding', 768)
assert rsp
collection = client.get('jina-text-embedding')
assert collection

# Convert text into a vector and store it in DashVector.
collection.insert(
    ('ID1', generate_embeddings(['Alibaba Cloud DashVector is one of the best vector databases in performance and cost-effectiveness.'])[0])
)

# Perform a vector search.
docs = collection.query(
    generate_embeddings(['The best vector database'])[0]
)
print(docs)