Multimodal embedding model converts text, images, or videos into vectors. It is suitable for tasks such as video classification, image classification, and cross-modal retrieval.
Core capabilities
Cross-modal retrieval: Perform cross-modal semantic searches, such as text-to-image, image-to-video, and image-to-image search.
Semantic similarity calculation: Measure the semantic similarity between content of different modalities in a unified vector space.
Content classification and clustering: Perform intelligent grouping, tagging, and clustering analysis based on the semantic vector embeddings.
Key attribute: The vector embeddings generated from all modalities, such as text, images, and videos, are in the same semantic space. You can directly perform cross-modal matching and comparison by calculating cosine similarity. For more information about how to select and use the models, see Text and multimodal embeddings.
This model service is available only in the China (Beijing) region. You must use an API key from the China (Beijing) region.
Overview
Model | Embedding dimensions | Text length limit | Image/video size limit | Price (1,000 input tokens) |
multimodal-embedding-v1 Available only in the China (Beijing) region | 1,024 | 512 tokens | Image size: ≤ 3 MB, Quantity: 1 | Free trial, with no token quota limit. |
The following input type and format limits apply when you call the general-purpose multimodal embedding API.
Input type | Language/Format limit |
Text | Chinese/English |
Image | JPG, PNG, and BMP. Supports input in Base64 format or as a URL. |
Multiple images | |
Video | MP4, MPEG, MPG, WEBM, AVI, FLV, MKV, MOV |
The API supports a single text segment, a single image, or a single video file. It also allows combinations of different types, such as text and an image. Only one combination is allowed per call, and files must meet the length and size requirements in the table.
Prerequisites
You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.
HTTP
POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding
Request | Multimodal embedding |
Headers | |
Content-Type The request content type. This can be set to `application/json` or `text/event-stream`. The `text/event-stream` value enables Server-Sent Events (SSE) responses. Content-Type The content type of the request. Set this parameter to | |
Authorization The identity authentication credentials for the request. This API uses an Model Studio API key for identity authentication. Example: Bearer sk-xxxx. | |
Request body | |
model The name of the model, from the Overview table. | |
input The input content. | |
parameters object (Optional) |
Response | Successful responseError response |
output The task output. | |
request_id The unique request ID. You can use this ID to trace and troubleshoot issues. | |
code The error code for a failed request. This parameter is not returned if the request is successful. For more information, see 429-Error messages. | |
message The detailed information about a failed request. This parameter is not returned if the request is successful. For more information, see 429-Error messages. | |
usage Statistics about the output. |
Use the SDK
The request body structure for SDK calls is different from that for native HTTP calls. The input parameter in the SDK corresponds to the input.contents parameter in HTTP calls.
Sample code
Generate an image embedding
Use an image URL
import dashscope
import json
from http import HTTPStatus
# In practice, replace the URL with your image URL.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Use a local image
The following sample code shows how to convert a local image to the Base64 format and then call the model to generate an embedding.
import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. In practice, replace xxx.png with your image file name or path.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
# Read the file and convert it to Base64.
base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png" # Modify this based on the actual format, such as jpg or bmp.
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data.
input = [{'image': image_data}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Generate a video embedding
The multimodal embedding model supports only video file URLs as input. Local video files are not supported.
import dashscope
import json
from http import HTTPStatus
# In practice, replace the URL with your video URL.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Generate a text embedding
import dashscope
import json
from http import HTTPStatus
text = "General multimodal embedding model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
model="tongyi-embedding-vision-plus",
input=input
)
if resp.status_code == HTTPStatus.OK:
result = {
"status_code": resp.status_code,
"request_id": getattr(resp, "request_id", ""),
"code": getattr(resp, "code", ""),
"message": getattr(resp, "message", ""),
"output": resp.output,
"usage": resp.usage
}
print(json.dumps(result, ensure_ascii=False, indent=4))Sample response
{
"status_code": 200,
"request_id": "b5623e99-ea0c-9509-9b25-20bcc99d03e9",
"code": "",
"message": "",
"output": {
"embeddings": [
{
"index": 0,
"embedding": [
-0.020782470703125,
-0.01399993896484375,
-0.0229949951171875,
...
],
"type": "text"
}
]
},
"usage": {
"input_tokens": 12,
"image_tokens": 0
}
}Error codes
If a call fails, see 429-Error messages for troubleshooting.