Overview
Text Embedding is a multilingual unified text embedding model developed by Tongyi Lab based on large language models (LLMs). Text Embedding helps developers convert text in multiple mainstream languages into high-quality vectors.
Model | Name | Vector dimension | Maximum number of lines for a request | Maximum token length for a line | Supported languages |
Text Embedding | text-embedding-v1 | 1,536 | 25 | 2,048 | Chinese, English, Spanish, French, Portuguese, and Indonesian |
text-embedding-async-v1 | 1,536 | 100,000 | 2,048 | Chinese, English, Spanish, French, Portuguese, and Indonesian | |
text-embedding-v2 | 1,536 | 25 | 2,048 | Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, and Russian | |
text-embedding-async-v2 | 1,536 | 100,000 | 2,048 | Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, and Russian | |
text-embedding-v3 | 1,024, 768, or 512 | 6 | 8,192 | Over 50 languages, including Chinese, English, Spanish, French, Portuguese, Indonesian, Japanese, Korean, German, and Russian |
Currently, only text-embedding-v3 is supported.
The text-embedding-v2 model incorporates the following updates based on text-embedding-v1:
More supported languages: text-embedding-v2 supports Japanese, Korean, German, and Russian.
Performance: Evaluation results from publicly accessible datasets show that the overall performance of text-embedding-v2 is improved by using a pre-trained model as a foundation and by applying Specific Fine-Tuning (SFT) strategies.
The text-embedding-v3 model incorporates the following updates based on text-embedding-v2:
More supported languages: text-embedding-v3 supports more than 50 languages, including Italian, Polish, Vietnamese, and Thai.
Input token length: The maximum token length increases from 2,048 to 8,192.
Variable dense vector dimension: text-embedding-v3 allows you to select the dimension of dense vectors from 512, 768, or 1,024. To reduce the cost of downstream tasks while maintaining high performance, the maximum vector dimension of text-embedding-v3 is reduced to 1,024.
Unified treatment of query and document types: text-embedding-v3 does not differentiate between the types of input text and maintains high performance. You do not need to specify query or document for the text_type parameter.
Support for sparse vectors: text-embedding-v3 supports dense and sparse vectors. You can specify the output_type parameter to control whether the output is dense vectors, sparse vectors, or both.
Performance: Evaluation results from publicly accessible datasets show that the overall performance of text-embedding-v3 is improved by using a pre-trained model as a foundation and by applying SFT strategies.
Model | MTEB | MTEB (retrieval task) | CMTEB | CMTEB (retrieval task) |
text-embedding-v1 | 58.30 | 45.47 | 59.84 | 56.59 |
text-embedding-v2 | 60.13 | 49.49 | 62.17 | 62.78 |
text-embedding-v3 | 63.39 | 55.41 | 68.92 | 73.23 |
Different vector dimensions of text-embedding-v3
Model | Vector dimension | MTEB | MTEB (retrieval task) | CMTEB | CMTEB (retrieval task) |
text-embedding-v3 | 1,024 | 63.39 | 55.41 | 68.92 | 73.23 |
text-embedding-v3 | 768 | 62.43 | 54.74 | 67.90 | 72.29 |
text-embedding-v3 | 512 | 62.11 | 54.30 | 66.81 | 71.88 |
Normalization: By default, text-embedding-v2 and text-embedding-v3 normalize output vectors.