All Products
Search
Document Center

DashVector:Use open source embedding models from ModelScope to convert text into vectors

Last Updated:Apr 11, 2024

This topic describes how to use open source embedding models from ModelScope to convert text into vectors and store the vectors in DashVector to perform vector searches.

ModelScope seeks to build a next-generation open source model-as-a-service (MaaS) platform and provide pan-AI developers with flexible, easy-to-use, and cost-efficient one-stop models.

ModelScope aims to reduce repeated R&D costs and provide an environment-friendlier and opener AI development environment and model services by bringing together industry-leading pre-trained models. This way, ModelScope can contribute to the cause of the digital economy. ModelScope provides various high-quality open source models for you to try out and download for free.

On ModelScope, you can:

  • Use and download pre-trained models free of charge.

  • Perform command line-based model prediction to validate model effects simply and quickly.

  • Fine-tune models with your own data for customization.

  • Engage in theoretical and practical training to effectively improve your R&D abilities.

  • Share your ideas with the entire community.

Prerequisites

  • DashVector:

  • ModelScope:

    • The SDK of the latest version is installed by running the pip install -U modelscope command.

CoROM word embedding models

Overview

Model ID

Vector dimensions

Distance metric

Vector data type

Remarks

damo/nlp_corom_sentence-embedding_chinese-base

768

Cosine

Float32

  • Chinese-general domain-base

  • Maximum text length: 512

damo/nlp_corom_sentence-embedding_english-base

768

Cosine

Float32

  • English-general domain-base

  • Maximum text length: 512

damo/nlp_corom_sentence-embedding_chinese-base-ecom

768

Cosine

Float32

  • Chinese-eCommerce-base

  • Maximum text length: 512

damo/nlp_corom_sentence-embedding_chinese-base-medical

768

Cosine

Float32

  • Chinese-healthcare-base

  • Maximum text length: 512

damo/nlp_corom_sentence-embedding_chinese-tiny

256

Cosine

Float32

  • Chinese-general domain-tiny

  • Maximum text length: 512

damo/nlp_corom_sentence-embedding_english-tiny

256

Cosine

Float32

  • English-general domain-tiny

  • Maximum text length: 512

damo/nlp_corom_sentence-embedding_chinese-tiny-ecom

256

Cosine

Float32

  • Chinese-eCommerce-tiny

  • Maximum text length: 512

damo/nlp_corom_sentence-embedding_chinese-tiny-medical

256

Cosine

Float32

  • Chinese-healthcare-tiny

  • Maximum text length: 512

Note

For more information about the CoROM models, visit the CoROM model page.

Example

Note

Make the following replacements for the code to run properly:

  1. Replace {your-dashvector-api-key} in the sample code with your DashVector API key.

  2. Replace {your-dashvector-cluster-endpoint} in the sample code with the endpoint of your DashVector cluster.

  3. Replace {model_id} in the sample code with the model ID in the preceding table.

  4. Take note that the number of vector dimensions in a tiny model is 256.

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from typing import List
from dashvector import Client


pipeline_se = pipeline(Tasks.sentence_embedding, model='{model_id}')


def generate_embeddings(texts: List[str]):
    inputs = {'source_sentence': texts}
    result = pipeline_se(input=inputs)
    return result['text_embedding']


########### Universal sample code for storing vectors and performing vector searches in DashVector ###########
# Create a DashVector client.
client = Client(
    api_key='{your-dashvector-api-key}',
    endpoint='{your-dashvector-cluster-endpoint}'
)

# Create a DashVector collection.
# Note: You must set the dimension parameter to the number of dimensions specified in the model.
rsp = client.create('CoROM-text-embedding', dimension=768)
assert rsp
collection = client.get('CoROM-text-embedding')
assert collection

# Convert text into a vector and store it in DashVector.
collection.insert(
    ('ID1', generate_embeddings(['Alibaba Cloud DashVector is one of the best vector databases in performance and cost-effectiveness.'])[0])
)

# Perform a vector search.
docs = collection.query(
    generate_embeddings(['The best vector database'])[0]
)
print(docs)

GTE word embedding models

Overview

Model ID

Vector dimensions

Distance metric

Vector data type

Remarks

damo/nlp_gte_sentence-embedding_chinese-base

768

Cosine

Float32

  • Chinese-general domain-base

  • Maximum text length: 512

damo/nlp_gte_sentence-embedding_chinese-large

768

Cosine

Float32

  • Chinese-general domain-large

  • Maximum text length: 512

damo/nlp_gte_sentence-embedding_chinese-small

512

Cosine

Float32

  • Chinese-general domain-small

  • Maximum text length: 512

damo/nlp_gte_sentence-embedding_english-base

768

Cosine

Float32

  • English-general domain-base

  • Maximum text length: 512

damo/nlp_gte_sentence-embedding_english-large

768

Cosine

Float32

  • English-general domain-large

  • Maximum text length: 512

damo/nlp_gte_sentence-embedding_english-small

384

Cosine

Float32

  • English-general domain-small

  • Maximum text length: 512

Note

For more information about the GTE models, visit the GTE model page.

Example

The sample code is the same as that for CoROM word embedding models. Replace the corresponding code with the model ID and vector dimensions.

Udever multilingual universal word embedding models

Overview

Model ID

Vector dimensions

Distance metric

Vector data type

Remarks

damo/udever-bloom-560m

1,024

Cosine

Float32

  • Model parameters: 560m

  • Maximum text length: 2,048

damo/udever-bloom-1b1

1,536

Cosine

Float32

  • Model parameters: 1b1

  • Maximum text length: 2,048

damo/udever-bloom-3b

2,048

Cosine

Float32

  • Model parameters: 3b

  • Maximum text length: 2,048

damo/udever-bloom-7b1

4,096

Cosine

Float32

  • Model parameters: 7b1

  • Maximum text length: 2,048

Note

For more information about the Udever models, visit the Udever model page.

Example

The sample code is the same as that for CoROM word embedding models. Replace the corresponding code with the model ID and vector dimensions.

StructBERT FAQ models

Overview

Model ID

Vector dimensions

Distance metric

Vector data type

Remarks

damo/nlp_structbert_faq-question-answering_chinese-base

768

Cosine

Float32

  • Chinese-general domain-base

  • Maximum text length: unlimited

damo/nlp_structbert_faq-question-answering_chinese-finance-base

768

Cosine

Float32

  • Chinese-finance-base

  • Maximum text length: unlimited

damo/nlp_structbert_faq-question-answering_chinese-gov-base

768

Cosine

Float32

  • Chinese-eGovernment-base

  • Maximum text length: unlimited

Note

For more information about the StructBERT FAQ models, visit the StructBERT model page.

Example

Note

Make the following replacements for the code to run properly:

  1. Replace {model_id} in the sample code with the model ID in the preceding table.

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from typing import List

pipeline = pipeline(Tasks.faq_question_answering, model='{model_id}')


def generate_embeddings(texts: List[str], max_len=30):
    return pipeline.get_sentence_embedding(texts)
Note

For more information about the code for storing vectors and performing vector searches in DashVector, see the sample code in the "Example" section of CoROM word embedding models.

More word embedding models

Model name

Model ID

Vector dimensions

Distance metric

Vector data type

Remarks

Bert entity embedding model-Chinese-general domain-base

damo/nlp_bert_entity-embedding_chinese-base

768

Cosine

Float32

  • Default maximum text length: 128

  • Details

MiniLM word embedding model-English-TextRetrieval

damo/nlp_minilm_ibkd_sentence-embedding_english-msmarco

384

Cosine

Float32

  • Default maximum text length: 128

  • Details

MiniLM word embedding model-English-IBKD-STS

damo/nlp_minilm_ibkd_sentence-embedding_english-sts

384

Cosine

Float32

  • Default maximum text length: 128

  • Details

text2vec-base-chinese

thomas/text2vec-base-chinese

768

Cosine

Float32

  • Default maximum text length: unknown

  • Details

text2vec-large-chinese

thomas/text2vec-large-chinese

1,024

Cosine

Float32

  • Default maximum text length: unknown

  • Details

Note
  1. The sample code is the same as that for CoROM word embedding models. Replace the corresponding code with the model ID and vector dimensions.

  2. For more information about more open source word embedding models on ModelScope, visit this page.