Multimodal embedding models transform text, images, or videos into unified 1,024-dimensional floating-point vectors. These models are suitable for tasks such as video classification, image classification, and text-to-image search.
Core capabilities
Cross-modal retrieval: Perform cross-modal semantic searches, such as text-to-image search, image-to-video search, and search by image.
Semantic similarity calculation: Measure the semantic similarity between content of different modalities in a unified vector space.
Content classification and clustering: Perform intelligent grouping, tagging, and clustering analysis based on the semantic embeddings of content.
Key attribute: The embeddings generated from all modalities (text, image, and video) exist in the same semantic space. This allows for direct cross-modal matching and comparison using methods such as cosine similarity. For more information about model selection and application methods, see Text and multimodal embedding.
This model service is available only in the China (Beijing) region. You must use an API key from this region to call the service.
For an introduction to the models, selection recommendations, and usage methods, see Text and multimodal embedding.
Model overview
Singapore
Model | Embedding dimensions | Text length limit | Image size limit | Video size limit | Price (per 1M input tokens) | Free quota(Note) |
tongyi-embedding-vision-plus | 1152, 1024, 512, 256, 128, 64 | 1,024 tokens | A single file cannot be larger than 3 MB. | Video file size up to 10 MB | Image/Video: $0.09 Text: $0.09 | 1 million tokens Validity: 90 days after activating Model Studio |
tongyi-embedding-vision-flash | 768, 512, 256, 128, 64 | Image/Video: $0.03 Text: $0.09 |
Beijing
Model | Embedding dimensions | Text length limit | Image size limit | Video size limit | Price (per 1M input tokens) |
qwen3-vl-embedding | 2560, 2048, 1536, 1024, 768, 512, 256 | 32,000 tokens | Max 1 image, up to 5 MB | Video file size up to 50 MB | Image/Video: $0.258 Text: $0.1 |
multimodal-embedding-v1 | 1024 | 512 tokens | Max 8 images, up to 3 MB each | Video file size up to 10 MB | Free trial |
Input format and language limits
Fused multimodal embedding models | ||||
Model | Text | Image | Video | Max elements per request |
qwen3-vl-embedding | Supports 33 major languages, including Chinese, English, Japanese, Korean, French, and German | JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, SGI (URL or Base64 supported) | MP4, AVI, MOV (URL only) | Total content elements per request must not exceed 20. Images, text, and videos share this limit. |
Independent multimodal embedding models | ||||
Model | Text | Image | Video | Max elements per request |
tongyi-embedding-vision-plus | Chinese/English | JPG, PNG, BMP (URL or Base64 supported) | MP4, MPEG, AVI, MOV, MPG, WEBM, FLV, MKV (URL only) | No limit on the number of content elements. Total token count must not exceed the token limit. |
tongyi-embedding-vision-flash | ||||
multimodal-embedding-v1 | Total content elements per request must not exceed 20. Max 1 image, 1 video, and 20 text entries, sharing the total limit. | |||
The API supports uploading a single text segment, a single image, or a single video file. It also allows combinations of different types, such as text and an image. Some models support multiple inputs of the same content type, such as multiple images. For more information, see the limits for the specific model.
Prerequisites
You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.
HTTP call
POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding
Request | Multimodal independent embeddingsMultimodal fused embedding |
Request headers | |
Content-Type The content type of the request. Set this to application/json or text/event-stream (to enable Server-Sent Events (SSE) responses). Content-Type The content type of the request. Must be | |
Authorization The authentication credentials using a Model Studio API key. Example: | |
Request body | |
model The model name. Set this to a model name from the Model overview. | |
input The input content. parameters Parameters for vector processing. For HTTP calls, these parameters must be wrapped in the `parameters` object. For SDK calls, you can use these parameters directly. |
Response | Successful responseError response |
output The task output information. | |
request_id Unique identifier for the request. Use for tracing and troubleshooting issues. | |
code The error code. Returned only when the request fails. See error codes for details. | |
message Detailed error message. Returned only when the request fails. See error codes for details. | |
usage Statistics about the output. |
SDK usage
The request body structure for the current SDK version is different from that of a native HTTP call. The `input` parameter in the SDK corresponds to `input.contents` in the HTTP call.
Code examples
Generate image embeddings
Use an image URL
import dashscope
import json
from http import HTTPStatus
# Replace the URL with the URL of your image.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Use a local image
You can use the following sample code to convert a local image to Base64 format and then call the multimodal-embedding-v1 model for embedding.
import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. Replace xxx.png with your image file name or path.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
# Read the file and convert to Base64.
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png" # Modify this based on the actual format, such as jpg or bmp.
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data.
input = [{'image': image_data}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Generate video embeddings
The multimodal embedding model currently supports video file input only as a URL. Passing local videos directly is not supported.
import dashscope
import json
from http import HTTPStatus
# Replace the URL with the URL of your video.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))
Generate text embeddings
import dashscope
import json
from http import HTTPStatus
text = "General multimodal representation model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Generate fused embeddings
import dashscope
import json
import os
from http import HTTPStatus
# Multimodal fused embedding: Fuses text, an image, and a video into a single fused embedding.
# Suitable for scenarios such as cross-modal retrieval and image search.
text = "This is a test text for generating a multimodal fused embedding."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
# The input contains text, an image, and a video. The model will fuse them into a single fused embedding.
input_data = [
{
"text": text,
"image": image,
"video": video
}
]
# Use qwen3-vl-embedding to generate the fused embedding.
resp = dashscope.MultiModalEmbedding.call(
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-vl-embedding",
input=input_data,
# Optional parameter: Specify the vector dimension (supported values: 2560, 2048, 1536, 1024, 768, 512, 256; default: 2560)
# parameters={"dimension": 1024}
)
print(json.dumps(resp, ensure_ascii=False, indent=4))
Output example
{
"status_code": 200,
"request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
"code": "",
"message": "",
"output": {
"embeddings": [
{
"index": 0,
"embedding": [
-0.009490966796875,
-0.024871826171875,
-0.031280517578125,
...
],
"type": "text"
}
]
},
"usage": {
"input_tokens": 10,
"input_tokens_details": {
"image_tokens": 0,
"text_tokens": 10
},
"output_tokens": 1,
"total_tokens": 11
}
}Error codes
If a call fails, see Error messages for troubleshooting.