All Products
Search
Document Center

:Model overview

Last Updated:Dec 16, 2024

Overview

Text Embedding is a multilingual unified text embedding model developed by Tongyi Lab based on large language models (LLMs). Text Embedding helps developers convert text in multiple mainstream languages into high-quality vectors.

Model

Name

Vector dimension

Maximum number of lines for a request

Maximum token length for a line

Supported languages

Text Embedding

text-embedding-v1

1,536

25

2,048

Chinese, English, Spanish, French, Portuguese, and Indonesian

text-embedding-async-v1

1,536

100,000

2,048

Chinese, English, Spanish, French, Portuguese, and Indonesian

text-embedding-v2

1,536

25

2,048

Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, and Russian

text-embedding-async-v2

1,536

100,000

2,048

Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, and Russian

text-embedding-v3

1,024, 768, or 512

6

8,192

Over 50 languages, including Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, and Russian

Note

Currently, only text-embedding-v3 is supported.

The text-embedding-v2 model incorporates the following updates based on text-embedding-v1:

  • More supported languages: text-embedding-v2 supports Japanese, Korean, German, and Russian.

  • Performance: Evaluation results from publicly accessible datasets show that the overall performance of text-embedding-v2 is improved by using a pre-trained model as a foundation and by applying Specific Fine-Tuning (SFT) strategies.

Note

The text-embedding-v3 model incorporates the following updates based on text-embedding-v2:

  • More supported languages: text-embedding-v3 supports more than 50 languages, including Italian, Polish, Vietnamese, and Thai.

  • Input token length: The maximum token length increases from 2,048 to 8,192.

  • Variable dense vector dimension: text-embedding-v3 allows you to select the dimension of dense vectors from 512, 768, or 1,024. To reduce the cost of downstream tasks while maintaining high performance, the maximum vector dimension of text-embedding-v3 is reduced to 1,024.

  • Unified treatment of query and document types: text-embedding-v3 does not differentiate between the types of input text and maintains high performance. You do not need to specify query or document for the text_type parameter.

  • Support for sparse vectors: text-embedding-v3 supports dense and sparse vectors. You can specify the output_type parameter to control whether the output is dense vectors, sparse vectors, or both.

  • Performance: Evaluation results from publicly accessible datasets show that the overall performance of text-embedding-v3 is improved by using a pre-trained model as a foundation and by applying SFT strategies.

Model

MTEB

MTEB (retrieval task)

CMTEB

CMTEB (retrieval task)

text-embedding-v1

58.30

45.47

59.84

56.59

text-embedding-v2

60.13

49.49

62.17

62.78

text-embedding-v3

63.39

55.41

68.92

73.23

  • Different vector dimensions of text-embedding-v3

Model

Vector dimension

MTEB

MTEB (retrieval task)

CMTEB

CMTEB (retrieval task)

text-embedding-v3

1,024

63.39

55.41

68.92

73.23

text-embedding-v3

768

62.43

54.74

67.90

72.29

text-embedding-v3

512

62.11

54.30

66.81

71.88

  • Normalization: By default, text-embedding-v2 and text-embedding-v3 normalize output vectors.