All Products
Search
Document Center

Alibaba Cloud Model Studio:Multimodal embeddings API

Last Updated:Mar 30, 2026

Multimodal embedding models transform text, images, or videos into unified vector representations suitable for tasks like video classification, image classification, and text-to-image search.

Core capabilities

  • Cross-modal retrieval: Search across different content types, like text-to-image, image-to-video, or image search.

  • Semantic similarity: Calculate similarity between different content types in a unified vector space.

  • Content classification and clustering: Group, tag, and cluster content based on semantic embeddings.

Key attribute: All modalities (text, image, video) generate embeddings in the same semantic space, enabling direct cross-modal matching and comparison using methods like cosine similarity. For more information about model selection and application methods, see text and multimodal embedding.
Important

This model service is only available in the China (Beijing) region. Use an API key from this region to call the service.

Vector types

The multimodal embedding model supports two vector generation methods:

  • Multimodal independent vectors: The model generates an independent vector for each input (such as text, an image, a video, or multiple images) in the contents array. For example, if you provide one piece of text and one image, the model returns two independent vectors. This method is suitable for scenarios that require comparing different types of content individually, such as image-to-image or text-to-image search.

  • Multimodal fused embedding: Fuses all inputs in contents into a single embedding for cross-modal semantic representation. Ideal for scenarios that require a comprehensive understanding of multimodal content, such as fusing product images and descriptive text into a unified representation for retrieval. For qwen3-vl-embedding, enable fusion mode by setting enable_fusion=true. Fused embeddings support the following combinations:

    • Text and image

    • Text and video

    • Multiple images and text (by passing multiple image entries)

    • Image, video, and text

qwen2.5-vl-embedding supports only fused embeddings, not independent embeddings. tongyi-embedding-vision-plus and tongyi-embedding-vision-flash support only independent embeddings.

For model descriptions, selection guidance, and usage, see Text and multimodal embedding.

Model overview

Singapore

Model

Embedding dimensions

Text length limit

Image size limit

Video size limit

Price

Free quota(Note)

tongyi-embedding-vision-plus

1152 (default), 1024, 512, 256, 128, 64

1,024 tokens

Up to 3 MB per image

Up to 10 MB per video file

Image/Video: $0.09

Text: $0.09

1 million tokens

Validity Period: 90 days after activating Model Studio

tongyi-embedding-vision-flash

768 (default), 512, 256, 128, 64

Image/Video: $0.03

Text: $0.09

Beijing

Model

Embedding dimensions

Text length limit

Image size limit

Video size limit

Price

qwen3-vl-embedding

2560 (default), 2048, 1536, 1024, 768, 512, 256

32,000 tokens

Max. 1 image, up to 5 MB

Up to 50 MB per video file

Image/Video: $0.258

Text: $0.1

multimodal-embedding-v1

1024

512 tokens

Up to 8 images, 3 MB each

Up to 10 MB per video file

Free trial

Input format and language limits

Fused multimodal model

Model

Text

Image

Video

Request limit

qwen3-vl-embedding

Supports 33 major languages, including Chinese, English, Japanese, Korean, French, and German.

JPEG, PNG, WEBP, BMP, TIFF, ICO, DIB, ICNS, and SGI (URL or Base64 supported).

MP4, AVI, and MOV (URL only).

Max 20 elements per request (up to 5 images).

Independent multimodal model

Model

Text

Image

Video

Request limit

tongyi-embedding-vision-plus

Chinese and English

JPG, PNG, and BMP (URL or Base64 supported).

MP4, MPEG, AVI, MOV, MPG, WEBM, FLV, and MKV (URL only).

No element count limit; requests are limited by the token count per batch.

tongyi-embedding-vision-flash

multimodal-embedding-v1

Up to 20 content elements per request, including a maximum of 1 image and 1 video.

All models support text, image, and video inputs, individually or in combination. The tongyi-embedding-vision-plus, tongyi-embedding-vision-flash models also support multi_images input for image sequences.

Model capabilities

Model

Default dimension

Type

Supported inputs

Description

qwen3-vl-embedding

2,560

Independent / Fusion

text, image, video, multi_images

Fusion mode, enabled with the enable_fusion parameter, combines multimodal inputs into a single vector.

tongyi-embedding-vision-plus

1,152

Independent only

text, image, video, multi_images

Supports multi_images sequences with up to 8 images.

tongyi-embedding-vision-flash

768

multimodal-embedding-v1

1,024

text, image, video

The dimension parameter is not supported. The dimension is fixed at 1024.

Prerequisites

Get an API key and export the API key as an environment variable. If you use an SDK to make calls, install the DashScope SDK.

HTTP call

POST https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding

Request

Multimodal independent embedding

The following example uses the tongyi-embedding-vision-plus model to generate independent embeddings (one embedding for each input). You can replace this with other model names. The multi_images type is supported only by tongyi-embedding-vision-plus and tongyi-embedding-vision-flash. The qwen3-vl-embedding model also supports a fused embedding mode, which is enabled by setting enable_fusion=true. For details, see the "Multimodal fused embedding" tab.
curl --silent --location --request POST 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "tongyi-embedding-vision-plus",
    "input": {
        "contents": [ 
            {"text": "Multimodal embedding model"},
            {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"},
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"},
            {"multi_images": [
                "https://img.alicdn.com/imgextra/i2/O1CN019eO00F1HDdlU4Syj5_!!6000000000724-2-tps-2476-1158.png",
                "https://img.alicdn.com/imgextra/i2/O1CN01dSYhpw1nSoamp31CD_!!6000000005089-2-tps-1765-1639.png"
                ]
              }
        ]
    }
}'

Multimodal fused embedding

The qwen3-vl-embedding model supports fused embedding generation. Set enable_fusion=true to fuse all inputs into a single embedding. This supports various combinations, such as text with an image, text with a video, multiple images with text, and a mix of an image, a video, and text. The following example demonstrates a mixed fusion of multiple images, a video, and text.
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3-vl-embedding",
    "input": {
        "contents": [
            {"text": "Product description text"},
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"},
            {"image": "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"},
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"}
        ]
    },
    "parameters": {
        "enable_fusion": true
    }
}'

Request headers

Content-Type string (Required)

The content type of the request. Must be application/json.

Authorization string (Required)

The authentication credentials using a Model Studio API key.

Example: Bearer sk-xxxx

Request body

model string(Required)

The model name. Set this to a model name from the Model overview.

input object(Required)

The input content.

Properties

contents array(Required)

A list of content to process. Each element is a dictionary or string specifying content type and value: {"modality_type": "input_string_or_image/video_url"}. Supported types: text, image, video, multi_images.

The qwen3-vl-embedding model supports both independent and fused embedding generation. To generate a fused embedding, add the boolean field enable_fusion and set it to true. The qwen2.5-vl-embedding model supports only fused embeddings.
  • Text: Key is text, value is a string. You can pass the string directly without a dictionary.

  • Image: Key is image. Value can be a public URL or Base64 Data URI (data:image/{format};base64,{data}), where {format} is the image format (like jpeg or png) and {data} is the Base64 string.

  • Multiple images: Supported by tongyi-embedding-vision-plus and tongyi-embedding-vision-flash only. Key is multi_images, value is a list of images (up to 8) following the format above.

  • Video: Key is video. Value must be a public URL.

parameters object (Optional)

Parameters for vector processing. Wrap in the parameters object for HTTP calls; use directly for SDK calls.

Properties

output_type string Optional

Output vector format. Currently only dense is supported.

dimension integer Optional

Output vector dimension. Supported values vary by model:

  • qwen3-vl-embedding: Supports 2,560, 2,048, 1,536, 1,024, 768, 512, and 256. The default is 2,560.

  • tongyi-embedding-vision-plus: Does not support this parameter. It returns an embedding with a fixed dimension of 1,152.

  • tongyi-embedding-vision-flash: Does not support this parameter. It returns an embedding with a fixed dimension of 768.

  • multimodal-embedding-v1: Does not support this parameter. It returns an embedding with a fixed dimension of 1,024.

fps float Optional

Controls video frame extraction rate. Range: [0, 1] (lower = fewer frames). Default: 1.0.

instruct string Optional

Custom task description to help the model understand query intent. Use English for 1-5% performance improvement.

enable_fusion bool Optional

Specifies whether to generate a fused embedding. This parameter is supported only by the qwen3-vl-embedding model. If set to true, all multimodal content in the contents array is fused into a single embedding. The default is false, which generates an independent embedding for each modality. Fused embeddings support combinations like text with an image, text with a video, multiple images with text (by passing multiple image items), and a mix of an image, a video, and text. This is suitable for retrieval use cases that require a comprehensive understanding of multimodal content.

Response

Successful response

{
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.026611328125,
                    -0.016571044921875,
                    -0.02227783203125,
                    ...
                ],
                "type": "text"
            },
            {
                "index": 1,
                "embedding": [
                    0.051544189453125,
                    0.007717132568359375,
                    0.026611328125,
                    ...
                ],
                "type": "image"
            },
            {
                "index": 2,
                "embedding": [
                    -0.0217437744140625,
                    -0.016448974609375,
                    0.040679931640625,
                    ...
                ],
                "type": "video"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "image_tokens": 896
    },
    "request_id": "1fff9502-a6c5-9472-9ee1-73930fdd04c5"
}

Error response

{
    "code":"InvalidApiKey",
    "message":"Invalid API-key provided.",
    "request_id":"fb53c4ec-1c12-4fc4-a580-cdb7c3261fc1"
}

output object

The embedding results.

Properties

embeddings array

List of vector results, one per input element.

Properties

index int

Index in the input list.

embedding array

The embedding vector.

type string

Input type: text, image, video, multi_images, or vl (only with qwen3-vl-embedding).

request_id string

Unique identifier for the request. Use for tracing and troubleshooting issues.

code string

The error code. Returned only when the request fails. See error codes for details.

message string

Detailed error message. Returned only when the request fails. See error codes for details.

usage object

Token usage statistics.

Properties

input_tokens int

Input tokens for this request.

image_tokens int

Image/video tokens for this request. For videos, the system samples frames (max controlled by configuration) and calculates tokens from the result.

image_count int

Image count for this request.

duration int

Video duration in seconds.

SDK usage

SDK request body differs from HTTP calls: input in SDK corresponds to input.contents in HTTP.

Code examples

Image embedding

Image URL

import dashscope
import json
from http import HTTPStatus
# Replace this with your image URL.
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Local image

Convert a local image to Base64 and call the model for embedding:

import dashscope
import base64
import json
from http import HTTPStatus
# Read the image and convert it to Base64. Replace xxx.png with your image file name or path.
image_path = "xxx.png"
with open(image_path, "rb") as image_file:
    # Read the file and convert it to Base64.
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')
# Set the image format.
image_format = "png"  # Modify this based on the actual format, such as jpg or bmp.
image_data = f"data:image/{image_format};base64,{base64_image}"
# Input data
input = [{'image': image_data}]

# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)
if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Video embedding

Video input must be a URL. Local videos are not supported.
import dashscope
import json
from http import HTTPStatus
# Replace this with your video URL.
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
input = [{'video': video}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))
    

Text embedding

import dashscope
import json
from http import HTTPStatus

text = "General multimodal representation model example"
input = [{'text': text}]
# Call the model API.
resp = dashscope.MultiModalEmbedding.call(
    model="tongyi-embedding-vision-plus",
    input=input
)

if resp.status_code == HTTPStatus.OK:
    result = {
        "status_code": resp.status_code,
        "request_id": getattr(resp, "request_id", ""),
        "code": getattr(resp, "code", ""),
        "message": getattr(resp, "message", ""),
        "output": resp.output,
        "usage": resp.usage
    }
    print(json.dumps(result, ensure_ascii=False, indent=4))

Fused embedding

import dashscope
import json
import os
from http import HTTPStatus

# Multimodal fused embedding: Combines text, image, and video into a single fused embedding.
# Ideal for applications like cross-modal retrieval and image search.
text = "This is a test text for generating a multimodal fused embedding."
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"

# The input contains text, an image, and a video. Set enable_fusion=True to generate a single fused embedding.
input_data = [
    {"text": text},
    {"image": image},
    {"video": video}
]

# Use qwen3-vl-embedding to generate a fused embedding.
resp = dashscope.MultiModalEmbedding.call(
    # If the environment variable is not set, provide your DashScope API Key directly, e.g., api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-embedding",
    input=input_data,
    enable_fusion=True,
    # Optional parameter: Specify the embedding dimension. Supported values are 2560, 2048, 1536, 1024, 768, 512, and 256. The default is 2560.
    # parameters={"dimension": 1024}
)

print(json.dumps(resp, ensure_ascii=False, indent=4))

Multi-image fused embedding

This example shows how to use the qwen3-vl-embedding model to fuse multiple images and text into a single embedding. To perform this multi-image fusion, pass multiple image items. This method is ideal for comprehensive semantic retrieval of products using multi-angle images and text descriptions.

import dashscope
import json
import os
from http import HTTPStatus

# Multi-image + text fused embedding: Fuses multiple product images and a description text into a single embedding.
# Suitable for comprehensive semantic retrieval using multi-angle product images and a text description.
text = "White sports shoes, lightweight and breathable, suitable for running and daily wear."
image1 = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
image2 = "https://img.alicdn.com/imgextra/i3/O1CN01rdstgY1uiZWt8gqSL_!!6000000006071-0-tps-1970-356.jpg"

# Pass multiple image items for multi-image fusion. `enable_fusion=True` fuses all inputs into a single embedding.
input_data = [
    {"text": text},
    {"image": image1},
    {"image": image2}
]

resp = dashscope.MultiModalEmbedding.call(
    # If the environment variable is not set, provide your DashScope API Key directly, e.g., api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-embedding",
    input=input_data,
    enable_fusion=True
)

print(json.dumps(resp, ensure_ascii=False, indent=4))

Output

{
    "status_code": 200,
    "request_id": "40532987-ba72-42aa-a178-bb58b52fb7f3",
    "code": "",
    "message": "",
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.009490966796875,
                    -0.024871826171875,
                    -0.031280517578125,
                    ...
                ],
                "type": "text"
            }
        ]
    },
    "usage": {
        "input_tokens": 10,
        "input_tokens_details": {
            "image_tokens": 0,
            "text_tokens": 10
        },
        "output_tokens": 1,
        "total_tokens": 11
    }
}

Error codes

If the model call fails and returns an error message, see Error messages for resolution.