All Products
Search
Document Center

Alibaba Cloud Model Studio:Multimodal embedding API details

Last Updated:Nov 10, 2025

Multimodal embedding model converts text, images, or videos into vectors. It is suitable for tasks such as video classification, image classification, and cross-modal retrieval.

Core capabilities

  • Cross-modal retrieval: Perform cross-modal semantic searches, such as text-to-image, image-to-video, and image-to-image search.

  • Semantic similarity calculation: Measure the semantic similarity between content of different modalities in a unified vector space.

  • Content classification and clustering: Perform intelligent grouping, tagging, and clustering analysis based on the semantic vector embeddings.

Key attribute: The vector embeddings generated from all modalities, such as text, images, and videos, are in the same semantic space. You can directly perform cross-modal matching and comparison by calculating cosine similarity. For more information about how to select and use the models, see Text and multimodal embeddings.
Important

This model service is available only in the China (Beijing) region. You must use an API key from the China (Beijing) region.

Overview

Model

Embedding dimensions

Text length limit

Image/video size limit

Price (1,000 input tokens)

multimodal-embedding-v1

Available only in the China (Beijing) region

1,024

512 tokens

Image size: ≤ 3 MB, Quantity: 1
Video: ≤ 10 MB

Free trial, with no token quota limit.

The following input type and format limits apply when you call the general-purpose multimodal embedding API.

Input type

Language/Format limit

Text

Chinese/English

Image

JPG, PNG, and BMP. Supports input in Base64 format or as a URL.

Multiple images

Video

MP4, MPEG, MPG, WEBM, AVI, FLV, MKV, MOV

The API supports a single text segment, a single image, or a single video file. It also allows combinations of different types, such as text and an image. Only one combination is allowed per call, and files must meet the length and size requirements in the table.

Prerequisites

You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.

HTTP

POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding

Request

Multimodal embedding

curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "tongyi-embedding-vision-plus",
    "input": {
        "contents": [ 
            {"text": "Multimodal embedding model"},
            {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"},
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"},
            {"multi_images": [
                "https://img.alicdn.com/imgextra/i2/O1CN019eO00F1HDdlU4Syj5_!!6000000000724-2-tps-2476-1158.png",
                "https://img.alicdn.com/imgextra/i2/O1CN01dSYhpw1nSoamp31CD_!!6000000005089-2-tps-1765-1639.png"
                ]
              }
        ]
    }
}'

Headers

Content-Type string (Required)

The request content type. This can be set to `application/json` or `text/event-stream`. The `text/event-stream` value enables Server-Sent Events (SSE) responses.

Content-Type string (Required)

The content type of the request. Set this parameter to application/json.

Authorization string (Required)

The identity authentication credentials for the request. This API uses an Model Studio API key for identity authentication. Example: Bearer sk-xxxx.

Request body

model string (Required)

The name of the model, from the Overview table.

input object (Required)

The input content.

Property

contents array (Required)

A list of content to process. Each element is a dictionary or a string that specifies the type and value of the content. The format is {"modality_type": "input_string_or_image/video_url"}. The text, image, video, and multi_images modality types are supported.

  • Text: The key is text. The value is a string. You can also pass the string directly without using a dictionary.

  • Image: The key is image. The value can be a publicly accessible URL or a Base64-encoded Data URI. The format for a Data URI is data:image/{format};base64,{data}, where {format} is the image format, such as jpeg or png, and {data} is the Base64-encoded string.

  • Video: The key is video. The value must be a publicly accessible URL.

parameters object (Optional)

Property

top_n int (Optional)

The number of documents to return after sorting. If the specified value is greater than the total number of documents, all documents are returned. If you do not specify this parameter, all documents are returned.

return_documents bool (Optional)

Specifies whether to return the original documents in the sorting results. The default value is false to reduce network overhead.

Response

Successful response

{
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.026611328125,
                    -0.016571044921875,
                    -0.02227783203125,
                    ...
                ],
                "type": "text"
            },
            {
                "index": 1,
                "embedding": [
                    0.051544189453125,
                    0.007717132568359375,
                    0.026611328125,
                    ...
                ],
                "type": "image"
            },
            {
                "index": 2,
                "embedding": [
                    -0.0217437744140625,
                    -0.016448974609375,
                    0.040679931640625,
                    ...
                ],
                "type": "video"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "image_tokens": 896
    },
    "request_id": "1fff9502-a6c5-9472-9ee1-73930fdd04c5"
}

Error response

{
    "code":"InvalidApiKey",
    "message":"Invalid API-key provided.",
    "request_id":"fb53c4ec-1c12-4fc4-a580-cdb7c3261fc1"
}

output object

The task output.

Property

embeddings array

A list of embedding results. Each object corresponds to an element in the input list.

Property

index int

The index of the result in the input list.

embedding array

The generated 1024-dimension embedding.

type string

The input type corresponding to the result. Valid values: text, image, video, or multi_images.

request_id string

The unique request ID. You can use this ID to trace and troubleshoot issues.

code string

The error code for a failed request. This parameter is not returned if the request is successful. For more information, see 429-Error messages.

message string

The detailed information about a failed request. This parameter is not returned if the request is successful. For more information, see 429-Error messages.

usage object

Statistics about the output.

Property

input_tokens int

The number of tokens in the input content of the request.

image_tokens int

The number of tokens in the input image or video. For video input, the system extracts a specific number of frames, which is determined by the system configuration. The number of tokens is then calculated based on the processed image or video frames.

image_count int

The number of images in the request.

duration int

The duration of the input video in seconds.

Use the SDK

The request body structure for SDK calls is different from that for native HTTP calls. The input parameter in the SDK corresponds to the input.contents parameter in HTTP calls.

Sample code

Generate an image embedding

Use an image URL

import dashscope
import json
from http import HTTPStatus
# In practice, replace the URL with your image URL.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Use a local image

The following sample code shows how to convert a local image to the Base64 format and then call the model to generate an embedding.

import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. In practice, replace xxx.png with your image file name or path.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
    # Read the file and convert it to Base64.
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png"  # Modify this based on the actual format, such as jpg or bmp.
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data.
input = [{'image': image_data}]

# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)
if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Generate a video embedding

The multimodal embedding model supports only video file URLs as input. Local video files are not supported.
import dashscope
import json
from http import HTTPStatus
# In practice, replace the URL with your video URL.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Generate a text embedding

import dashscope
import json
from http import HTTPStatus

text = "General multimodal embedding model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Sample response

{
    "status_code": 200,
    "request_id": "b5623e99-ea0c-9509-9b25-20bcc99d03e9",
    "code": "",
    "message": "",
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.020782470703125,
                    -0.01399993896484375,
                    -0.0229949951171875,
                    ...
                ],
                "type": "text"
            }
        ]
    },
    "usage": {
        "input_tokens": 12,
        "image_tokens": 0
    }
}

Error codes

If a call fails, see 429-Error messages for troubleshooting.