All Products
Search
Document Center

Alibaba Cloud Model Studio:Multimodal embeddings API

Last Updated:Mar 26, 2026

A multimodal embedding model converts text, images, or videos into a unified, high-dimensional floating-point vector representation with customizable dimensionality, for use in cross-modal retrieval, text-to-image search, video classification, and image classification.

Core capabilities

  • Cross-modal retrieval: Enables semantic search across modalities, such as text-to-image, image-to-video, and image-to-image search.

  • Semantic similarity: Measures semantic similarity between content from different modalities within a unified vector space.

  • Content classification and clustering: Groups, tags, and clusters content based on their semantic embeddings.

Key feature: Embeddings generated from all modalities (text, images, and video) reside in the same semantic space. This enables direct cross-modal matching and comparison using methods like cosine similarity. For more information on model selection and application methods, see text and multimodal embedding.
Important

This model service is only available in the China (Beijing) region. Use an API key from this region to call the service.

Vector types

The multimodal vector model supports two vector generation methods:

  • Multimodal independent vectors: The model generates an independent vector for each input (such as text, an image, a video, or multiple images) in the contents array. For example, if you provide one piece of text and one image, the model returns two independent vectors. This method is suitable for scenarios that require comparing different types of content individually, such as image-to-image or text-to-image search.

  • Multimodal fused embedding: Fuses all inputs in contents into a single embedding to create a comprehensive, cross-modal semantic representation. This is ideal for scenarios that require a comprehensive understanding of multimodal content, such as fusing product images and descriptive text into a unified representation for retrieval. For qwen3-vl-embedding, you can enable the fusion mode by setting enable_fusion=true. Fused embeddings support the following combinations:

    • Text and image

    • Text and video

    • Multiple images and text (by passing multiple image entries)

    • Image, video, and text

qwen2.5-vl-embedding supports only fused embeddings, not independent embeddings. tongyi-embedding-vision-plus and tongyi-embedding-vision-flash support only independent embeddings.

For model descriptions, selection guidance, and usage instructions, see text and multimodal vectorization.

Model overview

Singapore

Model

Embedding dimensions

Text length limit

Image size limit

Video size limit

Price

Free quota(Note)

tongyi-embedding-vision-plus

1152 (default), 1024, 512, 256, 128, 64

1,024 tokens

Up to 3 MB per image

Up to 10 MB per video file

Image/Video: $0.09

Text: $0.09

1 million tokens

Validity Period: 90 days after activating Model Studio

tongyi-embedding-vision-flash

768 (default), 512, 256, 128, 64

Image/Video: $0.03

Text: $0.09

Beijing

Model

Embedding dimensions

Text length limit

Image size limit

Video size limit

Price

qwen3-vl-embedding

2560 (default), 2048, 1536, 1024, 768, 512, 256

32,000 tokens

Max. 1 image, up to 5 MB

Up to 50 MB per video file

Image/Video: $0.258

Text: $0.1

multimodal-embedding-v1

1024

512 tokens

Up to 8 images, 3 MB each

Up to 10 MB per video file

Free trial

Input format and language limits

Fused multimodal model

Model

Text

Image

Video

Request limit

qwen3-vl-embedding

Supports 33 major languages, including Chinese, English, Japanese, Korean, French, and German.

JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, and SGI (URL or Base64 supported).

MP4, AVI, and MOV (URL only).

Max 20 elements per request (up to 5 images).

Independent multimodal model

Model

Text

Image

Video

Request limit

tongyi-embedding-vision-plus

Chinese and English

JPG, PNG, and BMP (URL or Base64 supported).

MP4, MPEG, AVI, MOV, MPG, WEBM, FLV, and MKV (URL only).

No element count limit; requests are limited by the token count per batch.

tongyi-embedding-vision-flash

multimodal-embedding-v1

Up to 20 content elements per request, including a maximum of 1 image and 1 video.

All models support text, image, and video inputs, individually or in combination. The tongyi-embedding-vision-plus, tongyi-embedding-vision-flash models also support multi_images input for image sequences.

Model capabilities

Model

Default dimension

Type

Supported inputs

Description

qwen3-vl-embedding

2,560

Independent / Fusion

text, image, video, multi_images

Fusion mode, enabled with the enable_fusion parameter, combines multimodal inputs into a single vector.

tongyi-embedding-vision-plus

1,152

Independent only

text, image, video, multi_images

Supports multi_images sequences with up to 8 images.

tongyi-embedding-vision-flash

768

multimodal-embedding-v1

1,024

text, image, video

The dimension parameter is not supported, and the dimension is fixed at 1024.

Prerequisites

Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.

HTTP call

POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding

Request

Multimodal independent embedding

The following example uses the tongyi-embedding-vision-plus model to generate independent embeddings (one embedding for each input). You can replace this with other model names. The multi_images type is supported only by tongyi-embedding-vision-plus and tongyi-embedding-vision-flash. The qwen3-vl-embedding model also supports a fused embedding mode, which is enabled by setting enable_fusion=true. For details, see the "Multimodal fused embedding" tab.
curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "tongyi-embedding-vision-plus",
    "input": {
        "contents": [ 
            {"text": "Multimodal embedding model"},
            {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"},
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"},
            {"multi_images": [
                "https://img.alicdn.com/imgextra/i2/O1CN019eO00F1HDdlU4Syj5_!!6000000000724-2-tps-2476-1158.png",
                "https://img.alicdn.com/imgextra/i2/O1CN01dSYhpw1nSoamp31CD_!!6000000005089-2-tps-1765-1639.png"
                ]
              }
        ]
    }
}'

Multimodal fused embedding

The qwen3-vl-embedding model supports fused embedding generation. Set enable_fusion=true to fuse all inputs into a single embedding. This supports various combinations, such as text with an image, text with a video, multiple images with text, and a mix of an image, a video, and text. The following example demonstrates a mixed fusion of multiple images, a video, and text.
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3-vl-embedding",
    "input": {
        "contents": [
            {"text": "Product description text"},
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"},
            {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"},
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"}
        ]
    },
    "parameters": {
        "enable_fusion": true
    }
}'

Request headers

Content-Type string (Required)

The content type of the request. Must be application/json.

Authorization string (Required)

The authentication credentials using a Model Studio API key.

Example: Bearer sk-xxxx

Request body

model string(Required)

The model name. Set this to a model name from the Model overview.

input object (Required)

The input content.

Properties

contents array(Required)

A list of content to be processed. Each element is a dictionary or a string that specifies the content type and value. The format is {"modality type": "an input string, or an image or video URL"}. The supported modality types are textimagevideo, and multi_images.

The qwen3-vl-embedding model supports both independent and fused embedding generation. To generate a fused embedding, add the boolean field enable_fusion and set it to true. The qwen2.5-vl-embedding model supports only fused embeddings.
  • Text: The key is text and the value is a string. You can also pass the string directly without an object.

  • Image: The key is image. The value can be a publicly accessible URL or a Base64-encoded Data URI. The Base64 format is data:image/{format};base64,{data}, where {format} is the image format (e.g., jpeg or png), and {data} is the Base64-encoded string.

  • Multiple images: This type is supported only by the tongyi-embedding-vision-plus, tongyi-embedding-vision-flash models. The key is multi_images, and the value is a list of images, with each item following the image format described above.

  • Video: The key is video. The value must be a publicly accessible URL.

parameters object (Optional)

Parameters for embedding generation. For HTTP calls, these must be wrapped in the parameters object. For SDK calls, you can use these parameters directly.

Properties

output_type string Optional

Specifies the format of the output embedding. Currently, only dense is supported.

dimension integer Optional

Specifies the dimension of the output embedding. Supported values vary by model:

  • qwen3-vl-embedding: Supports 2,560, 2,048, 1,536, 1,024, 768, 512, and 256. The default is 2,560.

  • tongyi-embedding-vision-plus: Does not support this parameter. It returns an embedding with a fixed dimension of 1,152.

  • tongyi-embedding-vision-flash: Does not support this parameter. It returns an embedding with a fixed dimension of 768.

  • multimodal-embedding-v1: Does not support this parameter. It returns an embedding with a fixed dimension of 1,024.

fps float Optional

Controls the number of video frames. A smaller ratio extracts fewer frames. The range is [0, 1]. The default is 1.0.

instruct string Optional

Provide a custom task description to help the model understand the query intent. English descriptions are recommended and can improve performance by 1% to 5%.

enable_fusion bool Optional

Specifies whether to generate a fused embedding. This parameter is supported only by the qwen3-vl-embedding model. If set to true, all multimodal content in the contents array is fused into a single embedding. The default is false, which generates an independent embedding for each modality. Fused embeddings support combinations like text with an image, text with a video, multiple images with text (by passing multiple image items), and a mix of an image, a video, and text. This is suitable for retrieval use cases that require a comprehensive understanding of multimodal content.

Response

Successful response

{
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.026611328125,
                    -0.016571044921875,
                    -0.02227783203125,
                    ...
                ],
                "type": "text"
            },
            {
                "index": 1,
                "embedding": [
                    0.051544189453125,
                    0.007717132568359375,
                    0.026611328125,
                    ...
                ],
                "type": "image"
            },
            {
                "index": 2,
                "embedding": [
                    -0.0217437744140625,
                    -0.016448974609375,
                    0.040679931640625,
                    ...
                ],
                "type": "video"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "image_tokens": 896
    },
    "request_id": "1fff9502-a6c5-9472-9ee1-73930fdd04c5"
}

Error response

{
    "code":"InvalidApiKey",
    "message":"Invalid API-key provided.",
    "request_id":"fb53c4ec-1c12-4fc4-a580-cdb7c3261fc1"
}

output object

The result of the task.

Properties

embeddings array

A list of embedding results. Each object corresponds to an item in the input list.

Properties

index int

The index of the item in the input list.

embedding array

The generated embedding array. The dimension depends on the model and the dimension parameter.

type string

The type of content that this embedding represents: text, image, video, multi_images, fused, or vl. The vl type is returned only when using the qwen3-vl-embedding model.

request_id string

Unique identifier for the request. Use for tracing and troubleshooting issues.

code string

The error code. Returned only when the request fails. See error codes for details.

message string

Detailed error message. Returned only when the request fails. See error codes for details.

usage object

Statistics on token usage.

Properties

input_tokens int

The number of tokens in the input content.

image_tokens int

The number of tokens for the image or video input. For video inputs, the system samples frames up to a configured limit and then calculates tokens based on the processing results.

image_count int

The number of images in the input.

duration int

The duration of the video input in seconds.

SDK usage

In this SDK version, the request body structure differs from that of native HTTP calls. The SDK's input parameter corresponds to input.contents in an HTTP request.

Code examples

Image embedding

Image URL

import dashscope
import json
from http import HTTPStatus
# Replace this with your image URL.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Local image

To generate an image embedding from a local image, first convert the image to Base64 format, as shown in the following example.

import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. Replace xxx.png with your image file name or path.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
    # Read the file and convert it to Base64.
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png"  # Modify this based on the actual format, such as jpg or bmp.
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data
input = [{'image': image_data}]

# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)
if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Video embedding

The multimodal embedding model currently accepts video input only via URL. Passing local video files directly is not supported.
import dashscope
import json
from http import HTTPStatus
# Replace this with your video URL.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))
    

Text embedding

import dashscope
import json
from http import HTTPStatus

text = "General multimodal representation model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Fused embedding

import dashscope
import json
import os
from http import HTTPStatus

# Multimodal fused embedding: Combines text, image, and video into a single fused embedding.
# Ideal for applications like cross-modal retrieval and image search.
text = "This is a test text for generating a multimodal fused embedding."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"

# The input contains text, an image, and a video. Set enable_fusion=True to generate a single fused embedding.
input_data = [
    {"text": text},
    {"image": image},
    {"video": video}
]

# Use qwen3-vl-embedding to generate a fused embedding.
resp = dashscope.MultiModalEmbedding.call(
    # If the environment variable is not set, provide your DashScope API Key directly, e.g., api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-embedding",
    input=input_data,
    enable_fusion=True,
    # Optional parameter: Specify the embedding dimension. Supported values are 2560, 2048, 1536, 1024, 768, 512, and 256. The default is 2560.
    # parameters={"dimension": 1024}
)

print(json.dumps(resp, ensure_ascii=False, indent=4))

Multi-image fused embedding

This example shows how to use the qwen3-vl-embedding model to fuse multiple images and text into a single embedding. To perform this multi-image fusion, pass multiple image items. This method is ideal for comprehensive semantic retrieval of products using multi-angle images and text descriptions.

import dashscope
import json
import os
from http import HTTPStatus

# Multi-image + text fused embedding: Fuses multiple product images and a description text into a single embedding.
# Suitable for comprehensive semantic retrieval using multi-angle product images and a text description.
text = "White sports shoes, lightweight and breathable, suitable for running and daily wear."
image1 = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
image2 = "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"

# Pass multiple image items for multi-image fusion. `enable_fusion=True` fuses all inputs into a single embedding.
input_data = [
    {"text": text},
    {"image": image1},
    {"image": image2}
]

resp = dashscope.MultiModalEmbedding.call(
    # If the environment variable is not set, provide your DashScope API Key directly, e.g., api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-embedding",
    input=input_data,
    enable_fusion=True
)

print(json.dumps(resp, ensure_ascii=False, indent=4))

Output

{
    "status_code": 200,
    "request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
    "code": "",
    "message": "",
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.009490966796875,
                    -0.024871826171875,
                    -0.031280517578125,
                    ...
                ],
                "type": "text"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "input_tokens_details": {
            "image_tokens": 0,
            "text_tokens": 10
        },
        "output_tokens": 1,
        "total_tokens": 11
    }
}

Error codes

If the model call fails and returns an error message, see Error messages for resolution.