All Products
Search
Document Center

Alibaba Cloud Model Studio:Multimodal embedding API details

Last Updated:Feb 05, 2026

Multimodal embedding models transform text, images, or videos into unified 1,024-dimensional floating-point vectors. These models are suitable for tasks such as video classification, image classification, and text-to-image search.

Core capabilities

  • Cross-modal retrieval: Perform cross-modal semantic searches, such as text-to-image search, image-to-video search, and search by image.

  • Semantic similarity calculation: Measure the semantic similarity between content of different modalities in a unified vector space.

  • Content classification and clustering: Perform intelligent grouping, tagging, and clustering analysis based on the semantic embeddings of content.

Key attribute: The embeddings generated from all modalities (text, image, and video) exist in the same semantic space. This allows for direct cross-modal matching and comparison using methods such as cosine similarity. For more information about model selection and application methods, see Text and multimodal embedding.
Important

This model service is available only in the China (Beijing) region. You must use an API key from this region to call the service.

For an introduction to the models, selection recommendations, and usage methods, see Text and multimodal embedding.

Model overview

Singapore

Model

Embedding dimensions

Text length limit

Image size limit

Video size limit

Price (per 1M input tokens)

Free quota(Note)

tongyi-embedding-vision-plus

1152, 1024, 512, 256, 128, 64

1,024 tokens

A single file cannot be larger than 3 MB.

Video file size up to 10 MB

Image/Video: $0.09

Text: $0.09

1 million tokens

Validity: 90 days after activating Model Studio

tongyi-embedding-vision-flash

768, 512, 256, 128, 64

Image/Video: $0.03

Text: $0.09

Beijing

Model

Embedding dimensions

Text length limit

Image size limit

Video size limit

Price (per 1M input tokens)

qwen3-vl-embedding

2560, 2048, 1536, 1024, 768, 512, 256

32,000 tokens

Max 1 image, up to 5 MB

Video file size up to 50 MB

Image/Video: $0.258

Text: $0.1

multimodal-embedding-v1

1024

512 tokens

Max 8 images, up to 3 MB each

Video file size up to 10 MB

Free trial

Input format and language limits

Fused multimodal embedding models

Model

Text

Image

Video

Max elements per request

qwen3-vl-embedding

Supports 33 major languages, including Chinese, English, Japanese, Korean, French, and German

JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, SGI (URL or Base64 supported)

MP4, AVI, MOV (URL only)

Total content elements per request must not exceed 20. Images, text, and videos share this limit.

Independent multimodal embedding models

Model

Text

Image

Video

Max elements per request

tongyi-embedding-vision-plus

Chinese/English

JPG, PNG, BMP (URL or Base64 supported)

MP4, MPEG, AVI, MOV, MPG, WEBM, FLV, MKV (URL only)

No limit on the number of content elements. Total token count must not exceed the token limit.

tongyi-embedding-vision-flash

multimodal-embedding-v1

Total content elements per request must not exceed 20. Max 1 image, 1 video, and 20 text entries, sharing the total limit.

The API supports uploading a single text segment, a single image, or a single video file. It also allows combinations of different types, such as text and an image. Some models support multiple inputs of the same content type, such as multiple images. For more information, see the limits for the specific model.

Prerequisites

You must obtain an API key and set the API key as an environment variable. If you use an SDK to make calls, you must also install the DashScope SDK.

HTTP call

POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding

Request

Multimodal independent embeddings

qwen3-vl-embedding can be used in two ways: if you input text, an image, and a video together, it generates one fused embedding. If you input them separately (as shown in the code example below), it generates an independent embedding for each item.
curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "tongyi-embedding-vision-plus",
    "input": {
        "contents": [ 
            {"text": "Multimodal embedding model"},
            {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"},
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"},
            {"multi_images": [
                "https://img.alicdn.com/imgextra/i2/O1CN019eO00F1HDdlU4Syj5_!!6000000000724-2-tps-2476-1158.png",
                "https://img.alicdn.com/imgextra/i2/O1CN01dSYhpw1nSoamp31CD_!!6000000005089-2-tps-1765-1639.png"
                ]
              }
        ]
    }
}'

Multimodal fused embedding

qwen3-vl-embedding can be used in two ways: if you input text, an image, and a video together (as shown in the code example below), it generates one fused embedding. If you input them separately, it generates an independent embedding for each item.
curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3-vl-embedding",
    "input": {
        "contents": [
            {"text": "This is a test text for generating a multimodal fused embedding",
             "image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png",
             "video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
            }
        ]
    },
    "parameters": {
        "dimension": 1024,
        "output_type": "dense",
        "fps": 0.5
    }
}'

Request headers

Content-Type string (Required)

The content type of the request. Set this to application/json or text/event-stream (to enable Server-Sent Events (SSE) responses).

Content-Type string (Required)

The content type of the request. Must be application/json.

Authorization string (Required)

The authentication credentials using a Model Studio API key.

Example: Bearer sk-xxxx

Request body

model string (Required)

The model name. Set this to a model name from the Model overview.

input object (Required)

The input content.

Properties

contents array (Required)

A list of content to process. Each element is a dictionary or a string that specifies the content type and value. The format is `{"modality_type": "input_string_or_image/video_url"}`. The textimagevideo, and multi_images modality types are supported.

qwen3-vl-embedding supports both fused and independent embedding generation. When `text`, `image`, and `video` are placed in the same object, a fused embedding is generated. When they are separated into independent elements, an embedding is generated for each element. qwen2.5-vl-embedding supports only fused embeddings and does not support independent embeddings.
  • Text: The key is text. The value is a string. You can also pass the string directly without a dictionary.

  • Image: The key is image. The value can be a publicly accessible URL or a Base64-encoded Data URI. The Base64 format is data:image/{format};base64,{data}, where {format} is the image format (such as jpeg or png) and {data} is the Base64-encoded string.

  • Multiple images: Only the tongyi-embedding-vision-plus and tongyi-embedding-vision-flash models support this type. The key is multi_images. The value is a list of images. Each item is an image that follows the format described above. You can include up to 8 images.

  • Video: The key is video. The value must be a publicly accessible URL.

parameters object (Optional)

Parameters for vector processing. For HTTP calls, these parameters must be wrapped in the `parameters` object. For SDK calls, you can use these parameters directly.

Properties

output_type string Optional

Specifies the output vector format. Currently, only `dense` is supported.

dimension integer Optional

Specifies the dimension of the output vector. The supported values vary by model:

  • qwen3-vl-embedding supports 2,560, 2,048, 1,536, 1,024, 768, 512, and 256. The default value is 2,560.

  • tongyi-embedding-vision-plus supports 64, 128, 256, 512, 1,024, and 1,152. The default value is 1,152.

  • tongyi-embedding-vision-flash supports 64, 128, 256, 512, and 768. The default value is 768.

  • multimodal-embedding-v1 does not support this parameter. It returns a fixed 1,024-dimensional vector.

fps float Optional

Controls the number of frames for a video. A smaller value means fewer frames are extracted. The range is [0, 1]. The default value is 1.0.

instruct string Optional

Adds a custom task description to help the model understand the query intent. Write this in English, because it can improve performance by about 1% to 5%.

Response

Successful response

{
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.026611328125,
                    -0.016571044921875,
                    -0.02227783203125,
                    ...
                ],
                "type": "text"
            },
            {
                "index": 1,
                "embedding": [
                    0.051544189453125,
                    0.007717132568359375,
                    0.026611328125,
                    ...
                ],
                "type": "image"
            },
            {
                "index": 2,
                "embedding": [
                    -0.0217437744140625,
                    -0.016448974609375,
                    0.040679931640625,
                    ...
                ],
                "type": "video"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "image_tokens": 896
    },
    "request_id": "1fff9502-a6c5-9472-9ee1-73930fdd04c5"
}

Error response

{
    "code":"InvalidApiKey",
    "message":"Invalid API-key provided.",
    "request_id":"fb53c4ec-1c12-4fc4-a580-cdb7c3261fc1"
}

output object

The task output information.

Properties

embeddings array

A list of vector results. Each object corresponds to an element in the input list.

Properties

index int

The index of the result in the input list.

embedding array

The generated embedding.

type string

The input type corresponding to the result: `text`, `image`, `video`, `multi_images`, or `vl` (this type is returned only when using qwen3-vl-embedding).

request_id string

Unique identifier for the request. Use for tracing and troubleshooting issues.

code string

The error code. Returned only when the request fails. See error codes for details.

message string

Detailed error message. Returned only when the request fails. See error codes for details.

usage object

Statistics about the output.

Properties

input_tokens int

The number of tokens in the input for this request.

image_tokens int

The number of tokens for the images or videos in this request. The system samples frames from the input video, with the maximum number of frames controlled by the system configuration. Tokens are then calculated based on the processing result.

image_count int

The number of images in this request.

duration int

The duration of the video in this request (in seconds).

SDK usage

The request body structure for the current SDK version is different from that of a native HTTP call. The `input` parameter in the SDK corresponds to `input.contents` in the HTTP call.

Code examples

Generate image embeddings

Use an image URL

import dashscope
import json
from http import HTTPStatus
# Replace the URL with the URL of your image.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Use a local image

You can use the following sample code to convert a local image to Base64 format and then call the multimodal-embedding-v1 model for embedding.

import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. Replace xxx.png with your image file name or path.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
    # Read the file and convert to Base64.
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png"  # Modify this based on the actual format, such as jpg or bmp.
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data.
input = [{'image': image_data}]

# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)
if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Generate video embeddings

The multimodal embedding model currently supports video file input only as a URL. Passing local videos directly is not supported.
import dashscope
import json
from http import HTTPStatus
# Replace the URL with the URL of your video.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))
    

Generate text embeddings

import dashscope
import json
from http import HTTPStatus

text = "General multimodal representation model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Generate fused embeddings

import dashscope
import json
import os
from http import HTTPStatus

# Multimodal fused embedding: Fuses text, an image, and a video into a single fused embedding.
# Suitable for scenarios such as cross-modal retrieval and image search.
text = "This is a test text for generating a multimodal fused embedding."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"

# The input contains text, an image, and a video. The model will fuse them into a single fused embedding.
input_data = [
    {
        "text": text,
        "image": image,
        "video": video
    }
]

# Use qwen3-vl-embedding to generate the fused embedding.
resp = dashscope.MultiModalEmbedding.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-embedding",
    input=input_data,
    # Optional parameter: Specify the vector dimension (supported values: 2560, 2048, 1536, 1024, 768, 512, 256; default: 2560)
    # parameters={"dimension": 1024}
)

print(json.dumps(resp, ensure_ascii=False, indent=4))

Output example

{
    "status_code": 200,
    "request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
    "code": "",
    "message": "",
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.009490966796875,
                    -0.024871826171875,
                    -0.031280517578125,
                    ...
                ],
                "type": "text"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "input_tokens_details": {
            "image_tokens": 0,
            "text_tokens": 10
        },
        "output_tokens": 1,
        "total_tokens": 11
    }
}

Error codes

If a call fails, see Error messages for troubleshooting.