Multimodal embedding models transform text, images, or videos into unified vector representations suitable for tasks like video classification, image classification, and text-to-image search.
Core capabilities
Cross-modal retrieval: Search across different content types, like text-to-image, image-to-video, or image search.
Semantic similarity: Calculate similarity between different content types in a unified vector space.
Content classification and clustering: Group, tag, and cluster content based on semantic embeddings.
Key attribute: All modalities (text, image, video) generate embeddings in the same semantic space, enabling direct cross-modal matching and comparison using methods like cosine similarity. For more information about model selection and application methods, see text and multimodal embedding.
This model service is only available in the China (Beijing) region. Use an API key from this region to call the service.
Vector types
The multimodal embedding model supports two vector generation methods:
Multimodal independent vectors: The model generates an independent vector for each input (such as text, an image, a video, or multiple images) in the
contentsarray. For example, if you provide one piece of text and one image, the model returns two independent vectors. This method is suitable for scenarios that require comparing different types of content individually, such as image-to-image or text-to-image search.Multimodal fused embedding: Fuses all inputs in
contentsinto a single embedding for cross-modal semantic representation. Ideal for scenarios that require a comprehensive understanding of multimodal content, such as fusing product images and descriptive text into a unified representation for retrieval. Forqwen3-vl-embedding, enable fusion mode by settingenable_fusion=true. Fused embeddings support the following combinations:Text and image
Text and video
Multiple images and text (by passing multiple
imageentries)Image, video, and text
qwen2.5-vl-embeddingsupports only fused embeddings, not independent embeddings.tongyi-embedding-vision-plusandtongyi-embedding-vision-flashsupport only independent embeddings.
For model descriptions, selection guidance, and usage, see Text and multimodal embedding.
Model overview
Singapore
Model | Embedding dimensions | Text length limit | Image size limit | Video size limit | Price | Free quota(Note) |
tongyi-embedding-vision-plus | 1152 (default), 1024, 512, 256, 128, 64 | 1,024 tokens | Up to 3 MB per image | Up to 10 MB per video file | Image/Video: $0.09 Text: $0.09 | 1 million tokens Validity Period: 90 days after activating Model Studio |
tongyi-embedding-vision-flash | 768 (default), 512, 256, 128, 64 | Image/Video: $0.03 Text: $0.09 |
Beijing
Model | Embedding dimensions | Text length limit | Image size limit | Video size limit | Price |
qwen3-vl-embedding | 2560 (default), 2048, 1536, 1024, 768, 512, 256 | 32,000 tokens | Max. 1 image, up to 5 MB | Up to 50 MB per video file | Image/Video: $0.258 Text: $0.1 |
multimodal-embedding-v1 | 1024 | 512 tokens | Up to 8 images, 3 MB each | Up to 10 MB per video file | Free trial |
Input format and language limits
Fused multimodal model | ||||
Model | Text | Image | Video | Request limit |
qwen3-vl-embedding | Supports 33 major languages, including Chinese, English, Japanese, Korean, French, and German. | JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, and SGI (URL or Base64 supported). | MP4, AVI, and MOV (URL only). | Max 20 elements per request (up to 5 images). |
Independent multimodal model | ||||
Model | Text | Image | Video | Request limit |
tongyi-embedding-vision-plus | Chinese and English | JPG, PNG, and BMP (URL or Base64 supported). | MP4, MPEG, AVI, MOV, MPG, WEBM, FLV, and MKV (URL only). | No element count limit; requests are limited by the token count per batch. |
tongyi-embedding-vision-flash | ||||
multimodal-embedding-v1 | Up to 20 content elements per request, including a maximum of 1 image and 1 video. | |||
All models support text, image, and video inputs, individually or in combination. Thetongyi-embedding-vision-plus,tongyi-embedding-vision-flashmodels also supportmulti_imagesinput for image sequences.
Model capabilities
Model | Default dimension | Type | Supported inputs | Description |
qwen3-vl-embedding | 2,560 | Independent / Fusion | text, image, video, multi_images | Fusion mode, enabled with the |
tongyi-embedding-vision-plus | 1,152 | Independent only | text, image, video, multi_images | Supports |
tongyi-embedding-vision-flash | 768 | |||
multimodal-embedding-v1 | 1,024 | text, image, video | The dimension parameter is not supported. The dimension is fixed at 1024. |
Prerequisites
Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.
HTTP call
POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding
Request | Multimodal independent embeddingThe following example uses the Multimodal fused embeddingThe |
Request headers | |
Content-Type The content type of the request. Must be | |
Authorization The authentication credentials using a Model Studio API key. Example: | |
Request body | |
model The model name. Set this to a model name from the Model overview. | |
input The input content. parameters Parameters for vector processing. Wrap in the |
Response | Successful responseError response |
output The embedding results. | |
request_id Unique identifier for the request. Use for tracing and troubleshooting issues. | |
code The error code. Returned only when the request fails. See error codes for details. | |
message Detailed error message. Returned only when the request fails. See error codes for details. | |
usage Token usage statistics. |
SDK usage
SDK request body differs from HTTP calls:inputin SDK corresponds toinput.contentsin HTTP.
Code examples
Image embedding
Image URL
import dashscope
import json
from http import HTTPStatus
# Replace this with your image URL.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Local image
Convert a local image to Base64 and call the model for embedding:
import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. Replace xxx.png with your image file name or path.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
# Read the file and convert it to Base64.
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png" # Modify this based on the actual format, such as jpg or bmp.
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data
input = [{'image': image_data}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Video embedding
Video input must be a URL. Local videos are not supported.
import dashscope
import json
from http import HTTPStatus
# Replace this with your video URL.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
Text embedding
import dashscope
import json
from http import HTTPStatus
text = "General multimodal representation model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Fused embedding
import dashscope
import json
import os
from http import HTTPStatus
# Multimodal fused embedding: Combines text, image, and video into a single fused embedding.
# Ideal for applications like cross-modal retrieval and image search.
text = "This is a test text for generating a multimodal fused embedding."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
# The input contains text, an image, and a video. Set enable_fusion=True to generate a single fused embedding.
input_data = [
{"text": text},
{"image": image},
{"video": video}
]
# Use qwen3-vl-embedding to generate a fused embedding.
resp = dashscope.MultiModalEmbedding.call(
# If the environment variable is not set, provide your DashScope API Key directly, e.g., api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-vl-embedding",
input=input_data,
enable_fusion=True,
# Optional parameter: Specify the embedding dimension. Supported values are 2560, 2048, 1536, 1024, 768, 512, and 256. The default is 2560.
# parameters={"dimension": 1024}
)
print(json.dumps(resp, ensure_ascii=False, indent=4))
Multi-image fused embedding
This example shows how to use the qwen3-vl-embedding model to fuse multiple images and text into a single embedding. To perform this multi-image fusion, pass multiple image items. This method is ideal for comprehensive semantic retrieval of products using multi-angle images and text descriptions.
import dashscope
import json
import os
from http import HTTPStatus
# Multi-image + text fused embedding: Fuses multiple product images and a description text into a single embedding.
# Suitable for comprehensive semantic retrieval using multi-angle product images and a text description.
text = "White sports shoes, lightweight and breathable, suitable for running and daily wear."
image1 = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
image2 = "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"
# Pass multiple image items for multi-image fusion. `enable_fusion=True` fuses all inputs into a single embedding.
input_data = [
{"text": text},
{"image": image1},
{"image": image2}
]
resp = dashscope.MultiModalEmbedding.call(
# If the environment variable is not set, provide your DashScope API Key directly, e.g., api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-vl-embedding",
input=input_data,
enable_fusion=True
)
print(json.dumps(resp, ensure_ascii=False, indent=4))
Output
{
"status_code": 200,
"request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
"code": "",
"message": "",
"output": {
"embeddings": [
{
"index": 0,
"embedding": [
-0.009490966796875,
-0.024871826171875,
-0.031280517578125,
...
],
"type": "text"
}
]
},
"usage": {
"input_tokens": 10,
"input_tokens_details": {
"image_tokens": 0,
"text_tokens": 10
},
"output_tokens": 1,
"total_tokens": 11
}
}Error codes
If the model call fails and returns an error message, see Error messages for resolution.