A multimodal embedding model converts text, images, or videos into a unified, high-dimensional floating-point vector representation with customizable dimensionality, for use in cross-modal retrieval, text-to-image search, video classification, and image classification.
Core capabilities
Cross-modal retrieval: Enables semantic search across modalities, such as text-to-image, image-to-video, and image-to-image search.
Semantic similarity: Measures semantic similarity between content from different modalities within a unified vector space.
Content classification and clustering: Groups, tags, and clusters content based on their semantic embeddings.
Key feature: Embeddings generated from all modalities (text, images, and video) reside in the same semantic space. This enables direct cross-modal matching and comparison using methods like cosine similarity. For more information on model selection and application methods, see text and multimodal embedding.
This model service is only available in the China (Beijing) region. Use an API key from this region to call the service.
Vector types
The multimodal vector model supports two vector generation methods:
Multimodal independent vectors: The model generates an independent vector for each input (such as text, an image, a video, or multiple images) in the
contentsarray. For example, if you provide one piece of text and one image, the model returns two independent vectors. This method is suitable for scenarios that require comparing different types of content individually, such as image-to-image or text-to-image search.Multimodal fused embedding: Fuses all inputs in
contentsinto a single embedding to create a comprehensive, cross-modal semantic representation. This is ideal for scenarios that require a comprehensive understanding of multimodal content, such as fusing product images and descriptive text into a unified representation for retrieval. Forqwen3-vl-embedding, you can enable the fusion mode by settingenable_fusion=true. Fused embeddings support the following combinations:Text and image
Text and video
Multiple images and text (by passing multiple
imageentries)Image, video, and text
qwen2.5-vl-embeddingsupports only fused embeddings, not independent embeddings.tongyi-embedding-vision-plusandtongyi-embedding-vision-flashsupport only independent embeddings.
For model descriptions, selection guidance, and usage instructions, see text and multimodal vectorization.
Model overview
Singapore
Model | Embedding dimensions | Text length limit | Image size limit | Video size limit | Price | Free quota(Note) |
tongyi-embedding-vision-plus | 1152 (default), 1024, 512, 256, 128, 64 | 1,024 tokens | Up to 3 MB per image | Up to 10 MB per video file | Image/Video: $0.09 Text: $0.09 | 1 million tokens Validity Period: 90 days after activating Model Studio |
tongyi-embedding-vision-flash | 768 (default), 512, 256, 128, 64 | Image/Video: $0.03 Text: $0.09 |
Beijing
Model | Embedding dimensions | Text length limit | Image size limit | Video size limit | Price |
qwen3-vl-embedding | 2560 (default), 2048, 1536, 1024, 768, 512, 256 | 32,000 tokens | Max. 1 image, up to 5 MB | Up to 50 MB per video file | Image/Video: $0.258 Text: $0.1 |
multimodal-embedding-v1 | 1024 | 512 tokens | Up to 8 images, 3 MB each | Up to 10 MB per video file | Free trial |
Input format and language limits
Fused multimodal model | ||||
Model | Text | Image | Video | Request limit |
qwen3-vl-embedding | Supports 33 major languages, including Chinese, English, Japanese, Korean, French, and German. | JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, and SGI (URL or Base64 supported). | MP4, AVI, and MOV (URL only). | Max 20 elements per request (up to 5 images). |
Independent multimodal model | ||||
Model | Text | Image | Video | Request limit |
tongyi-embedding-vision-plus | Chinese and English | JPG, PNG, and BMP (URL or Base64 supported). | MP4, MPEG, AVI, MOV, MPG, WEBM, FLV, and MKV (URL only). | No element count limit; requests are limited by the token count per batch. |
tongyi-embedding-vision-flash | ||||
multimodal-embedding-v1 | Up to 20 content elements per request, including a maximum of 1 image and 1 video. | |||
All models support text, image, and video inputs, individually or in combination. Thetongyi-embedding-vision-plus,tongyi-embedding-vision-flashmodels also supportmulti_imagesinput for image sequences.
Model capabilities
Model | Default dimension | Type | Supported inputs | Description |
qwen3-vl-embedding | 2,560 | Independent / Fusion | text, image, video, multi_images | Fusion mode, enabled with the |
tongyi-embedding-vision-plus | 1,152 | Independent only | text, image, video, multi_images | Supports |
tongyi-embedding-vision-flash | 768 | |||
multimodal-embedding-v1 | 1,024 | text, image, video | The dimension parameter is not supported, and the dimension is fixed at 1024. |
Prerequisites
Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.
HTTP call
POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding
Request | Multimodal independent embeddingThe following example uses the Multimodal fused embeddingThe |
Request headers | |
Content-Type The content type of the request. Must be | |
Authorization The authentication credentials using a Model Studio API key. Example: | |
Request body | |
model The model name. Set this to a model name from the Model overview. | |
input The input content. parameters Parameters for embedding generation. For HTTP calls, these must be wrapped in the |
Response | Successful responseError response |
output The result of the task. | |
request_id Unique identifier for the request. Use for tracing and troubleshooting issues. | |
code The error code. Returned only when the request fails. See error codes for details. | |
message Detailed error message. Returned only when the request fails. See error codes for details. | |
usage Statistics on token usage. |
SDK usage
In this SDK version, the request body structure differs from that of native HTTP calls. The SDK'sinputparameter corresponds toinput.contentsin an HTTP request.
Code examples
Image embedding
Image URL
import dashscope
import json
from http import HTTPStatus
# Replace this with your image URL.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Local image
To generate an image embedding from a local image, first convert the image to Base64 format, as shown in the following example.
import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. Replace xxx.png with your image file name or path.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
# Read the file and convert it to Base64.
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png" # Modify this based on the actual format, such as jpg or bmp.
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data
input = [{'image': image_data}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Video embedding
The multimodal embedding model currently accepts video input only via URL. Passing local video files directly is not supported.
import dashscope
import json
from http import HTTPStatus
# Replace this with your video URL.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
Text embedding
import dashscope
import json
from http import HTTPStatus
text = "General multimodal representation model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Fused embedding
import dashscope
import json
import os
from http import HTTPStatus
# Multimodal fused embedding: Combines text, image, and video into a single fused embedding.
# Ideal for applications like cross-modal retrieval and image search.
text = "This is a test text for generating a multimodal fused embedding."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
# The input contains text, an image, and a video. Set enable_fusion=True to generate a single fused embedding.
input_data = [
{"text": text},
{"image": image},
{"video": video}
]
# Use qwen3-vl-embedding to generate a fused embedding.
resp = dashscope.MultiModalEmbedding.call(
# If the environment variable is not set, provide your DashScope API Key directly, e.g., api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-vl-embedding",
input=input_data,
enable_fusion=True,
# Optional parameter: Specify the embedding dimension. Supported values are 2560, 2048, 1536, 1024, 768, 512, and 256. The default is 2560.
# parameters={"dimension": 1024}
)
print(json.dumps(resp, ensure_ascii=False, indent=4))
Multi-image fused embedding
This example shows how to use the qwen3-vl-embedding model to fuse multiple images and text into a single embedding. To perform this multi-image fusion, pass multiple image items. This method is ideal for comprehensive semantic retrieval of products using multi-angle images and text descriptions.
import dashscope
import json
import os
from http import HTTPStatus
# Multi-image + text fused embedding: Fuses multiple product images and a description text into a single embedding.
# Suitable for comprehensive semantic retrieval using multi-angle product images and a text description.
text = "White sports shoes, lightweight and breathable, suitable for running and daily wear."
image1 = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
image2 = "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"
# Pass multiple image items for multi-image fusion. `enable_fusion=True` fuses all inputs into a single embedding.
input_data = [
{"text": text},
{"image": image1},
{"image": image2}
]
resp = dashscope.MultiModalEmbedding.call(
# If the environment variable is not set, provide your DashScope API Key directly, e.g., api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-vl-embedding",
input=input_data,
enable_fusion=True
)
print(json.dumps(resp, ensure_ascii=False, indent=4))
Output
{
"status_code": 200,
"request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
"code": "",
"message": "",
"output": {
"embeddings": [
{
"index": 0,
"embedding": [
-0.009490966796875,
-0.024871826171875,
-0.031280517578125,
...
],
"type": "text"
}
]
},
"usage": {
"input_tokens": 10,
"input_tokens_details": {
"image_tokens": 0,
"text_tokens": 10
},
"output_tokens": 1,
"total_tokens": 11
}
}Error codes
If the model call fails and returns an error message, see Error messages for resolution.