All Products
Search
Document Center

Alibaba Cloud Model Studio:Visual understanding (Qwen-VL)

Last Updated:Nov 12, 2025

The Qwen-VL model can answer questions about the images or videos that you provide. It supports single-image and multi-image inputs for tasks such as image captioning, visual question answering, and object detection.

Try it online: Vision model (Singapore or Beijing)

Getting started

Prerequisites

The following examples show how to call the model to describe the content of an image. For more information about local files and image limits, see Upload local files and Input file limits.

Single-image input

OpenAI compatible

Python

from openai import OpenAI
import os

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
    model="qwen3-vl-plus",  # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                    },
                },
                {"type": "text", "text": "What is depicted in the image?"},
            ],
        },
    ],
)
print(completion.choices[0].message.content)

Response

This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand with the sea and sky in the background. The person and the dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight shines from the right side of the frame, adding a warm atmosphere to the scene.

Node.js

import OpenAI from "openai";

const openai = new OpenAI({
  // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

async function main() {
  const response = await openai.chat.completions.create({
    model: "qwen3-vl-plus",   // This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models 
    messages: [
      {
        role: "user",
        content: [{
            type: "image_url",
            image_url: {
              "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
            }
          },
          {
            type: "text",
            text: "What is depicted in the image?"
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}
main()

Response

This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand with the sea and sky in the background. The person and the dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight shines from the right side of the frame, adding a warm atmosphere to the scene.

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3-vl-plus",
  "messages": [
    {"role": "user",
     "content": [
        {"type": "image_url", "image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
        {"type": "text", "text": "What is depicted in the image?"}
    ]
  }]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand with the sea and sky in the background. The person and the dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight shines from the right side of the frame, adding a warm atmosphere to the scene.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1270,
    "completion_tokens": 54,
    "total_tokens": 1324
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen3-vl-plus",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
{
    "role": "user",
    "content": [
    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
    {"text": "What is depicted in the image?"}]
}]

response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',   # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Response

This is a photo taken on a beach. In the photo, there is a woman and a dog. The woman is sitting on the sand, smiling and interacting with the dog. The dog is wearing a collar and appears to be shaking hands with the woman. The background is the sea and the sky, and the sunlight shining on them creates a warm atmosphere.

Java

import java.util.Arrays;
import java.util.Collections;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    
    // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation(); 
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "What is depicted in the image?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  //  This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

This is a photo taken on a beach. In the photo, there is a person in a plaid shirt and a dog wearing a collar. The person and the dog are sitting face to face, seemingly interacting. The background is the sea and the sky, and the sunlight shining on them creates a warm atmosphere.

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"text": "What is depicted in the image?"}
                ]
            }
        ]
    }
}'

Response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "This is a photo taken on a beach. In the photo, there is a person in a plaid shirt and a dog wearing a collar. They are sitting on the sand with the sea and sky in the background. Sunlight shines from the right side of the frame, adding a warm atmosphere to the scene."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 55,
    "input_tokens": 1271,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

Multi-image input

The Qwen-VL model supports passing multiple images in a single request. This is useful for tasks such as product comparison and multi-page document processing. To do this, include multiple image objects in the content array of the user message.

Important

The number of images is limited by the model's maximum input tokens. The total number of tokens for all images and text must not exceed the model's token limit.

OpenAI compatible

Python

import os
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-vl-plus",  #  This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {"role": "user","content": [
            {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
            {"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},},
            {"type": "text", "text": "What do these images depict?"},
            ],
        }
    ],
)

print(completion.choices[0].message.content)

Response

Image 1 shows a scene of a woman and a Labrador retriever interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, shaking hands with the dog. The background is the ocean waves and the sky, and the whole picture is filled with a warm and pleasant atmosphere.

Image 2 shows a scene of a tiger walking in a forest. The tiger's fur is orange with black stripes. It is stepping forward, surrounded by dense trees and vegetation. The ground is covered with fallen leaves, and the whole picture gives a feeling of wild nature.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3-vl-plus",  // This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://help.aliyun.com/en/model-studio/models
        messages: [
          {role: "user",content: [
            {type: "image_url",image_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}},
            {type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"}},
            {type: "text", text: "What do these images depict?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

Response

In the first image, a person and a dog are interacting on a beach. The person is wearing a plaid shirt, and the dog is wearing a collar. They seem to be shaking hands or giving a high-five.

In the second image, a tiger is walking in a forest. The tiger's fur is orange with black stripes, and the background is green trees and vegetation.

curl

# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-vl-plus",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
          }
        },
        {
          "type": "text",
          "text": "What do these images depict?"
        }
      ]
    }
  ]
}'

Response

{
  "choices": [
    {
      "message": {
        "content": "Image 1 shows a scene of a woman and a Labrador retriever interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, shaking hands with the dog. The background is a sea view and a sunset sky, making the whole scene look very warm and harmonious.\n\nImage 2 shows a scene of a tiger walking in a forest. The tiger's fur is orange with black stripes. It is stepping forward, surrounded by dense trees and vegetation. The ground is covered with fallen leaves, and the whole picture is full of natural wildness and vitality.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 2497,
    "completion_tokens": 109,
    "total_tokens": 2606
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen3-vl-plus",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
            {"text": "What do these images depict?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus', #  This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Response

These images show some animals and natural scenes. In the first image, a person and a dog are interacting on a beach. The second image is of a tiger walking in a forest.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
                        Collections.singletonMap("text", "What do these images depict?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  //  This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

These images show some animals and natural scenes.

1. First image: A woman and a dog are interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, and the dog is wearing a collar and extending its paw to shake hands with the woman.
2. Second image: A tiger is walking in a forest. The tiger's fur is orange with black stripes, and the background is trees and leaves.

curl

# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
                    {"text": "What do these images show?"}
                ]
            }
        ]
    }
}'

Response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "These images show some animals and natural scenes. In the first image, a person and a dog are interacting on a beach. The second image is of a tiger walking in a forest."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 81,
    "input_tokens": 1277,
    "image_tokens": 2497
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

Model selection

  • For tasks such as high-precision object detection and localization (including 3D localization), agent tool calling, document and webpage parsing, complex problem-solving, and long video understanding, we recommend using Qwen3-VL. A comparison of the models in this series is as follows:

    • qwen3-vl-plus: The most powerful model.

    • qwen3-vl-flash: A faster and more cost-effective option. It provides a balance between performance and cost, making it suitable for scenarios that are sensitive to response speed.

  • For general tasks such as simple image captioning and short video summary extraction, you can choose Qwen2.5-VL. A comparison of the models in this series is as follows:

    • qwen-vl-max: The best-performing version in the Qwen2.5-VL series.

    • qwen-vl-plus: Faster and provides a good balance between performance and cost.

For more information about model names, context, prices, and snapshot versions, see Model list. For information about concurrent request limits, see Throttling.

Model feature comparison

Model

Deep thinking

Tool calling

Context cache

Structured output

Recognized languages

Qwen3-VL series

Supported

Supported

Supported by the stable versions of qwen3-vl-plus and qwen3-vl-flash

Supported in non-thinking mode

33 languages: Chinese, Japanese, Korean, Indonesian, Vietnamese, Thai, English, French, German, Russian, Portuguese, Spanish, Italian, Swedish, Danish, Czech, Norwegian, Dutch, Finnish, Turkish, Polish, Swahili, Romanian, Serbian, Greek, Kazakh, Uzbek, Cebuano, Arabic, Urdu, Persian, Hindi/Devanagari, and Hebrew.

Qwen2.5-VL series

Not supported

Not supported

Supported by the stable versions of qwen-vl-max and qwen-vl-plus

Supported by the stable and latest versions of qwen-vl-max and qwen-vl-plus

11 languages: Chinese, English, Japanese, Korean, Arabic, Vietnamese, French, German, Italian, Spanish, and Russian.

Feature overview

Image Q&A

Describe, classify, or tag the content of an image, such as identifying people, places, animals, and plants.

p1021359

If the sun is too bright, what item from this image should I use?

When the sun is too bright, you should use the pink sunglasses from the image. Sunglasses can effectively block strong light, reduce UV damage to your eyes, and help protect your vision and improve visual comfort in bright sunlight.

Creative writing

Generate vivid text descriptions from image or video content. This feature is suitable for creative scenarios such as story writing, copywriting, and short video scripts.

vl-问答

Please help me write an interesting social media post based on the content of the image.

Of course. This image is full of the rich atmosphere of the Mid-Autumn Festival, with a clever combination of classical and modern elements. Based on the main elements in the image, I have prepared several social media posts in different styles for you. You can choose the one you like.

Poetic and beautiful style

Tonight, the moon is bright for all to see, but where do the autumn thoughts fall? Chang'e flies to the moon, and the Jade Rabbit pounds medicine. The romance of the ancients is lit up tonight. May this bright moon light your way home and carry my deepest thoughts. Happy Mid-Autumn Festival!

Warm and blessing style

The moon is full, and people are reunited. The Mid-Autumn night is the gentlest. Watch the fireworks bloom, admire the full moon in the sky, eat a piece of mooncake, and want for good health. May all that you and I want for come true. Happy Mid-Autumn Festival to everyone, and may your family be happy!

Text recognition and information extraction

Recognize text and formulas in images, or extract information from receipts, certificates, and forms. This feature supports formatted text output. The Qwen3-VL model increases the number of supported languages to 33. For a list of supported languages, see Model feature comparison.

-q2cdz6jy89b6m3kp

Extract the following from the image: ['Invoice Code', 'Invoice Number', 'Destination', 'Fuel Surcharge', 'Fare', 'Travel Date', 'Departure Time', 'Train Number', 'Seat Number']. Please output in JSON format.

{

"Invoice Code": "221021325353",

"Invoice Number": "10283819",

"Destination": "Development Zone",

"Fuel Surcharge": "2.0",

"Fare": "8.00<Full>",

"Travel Date": "2013-06-29",

"Departure Time": "Serial",

"Train Number": "040",

"Seat Number": "371"

}

Multi-disciplinary problem-solving

Solve problems from subjects such as mathematics, physics, and chemistry that are presented in images. This feature is suitable for various educational levels, from primary school to university and adult education.

-5jwcstcvmdpqghaj

Please solve the math problem in the image step by step.

-答案

Visual coding

Generate code from images or videos. For example, you can generate HTML, CSS, and JS code from design drafts or website screenshots.

code

Create a webpage using HTML and CSS based on my sketch. The main color should be black.

code-预览

Webpage preview

Object detection

Supports 2D and 3D detection. You can use this feature to determine object orientation, perspective changes, and occlusion relationships. 3D detection is a new feature of the Qwen3-VL model.

The object detection performance of the Qwen2.5-VL model is robust within a resolution range of 480 × 480 to 2560 × 2560 pixels. Outside this range, the detection accuracy may decrease, and bounding box drift may occur. For more information about how to draw the detection results on the original image, see the FAQ.

2D detection

-530xdcos1lqkcfuy

  • Return Box (bounding box) coordinates: Detect all food items in the image and output their bbox coordinates in JSON format.

  • Return Point (centroid) coordinates: Locate all food items in the image as points and output their point coordinates in XML format.

Visualization of 2D detection results

-mu9podu1eyvph1zd

3D Positioning

3d

Detect the car in the image and predict its 3D position. Output JSON: [{"bbox_3d": [x_center, y_center, z_center, x_size, y_size, z_size, roll, pitch, yaw], "label": "category"}].

Visualization of 3D detection results

3d-结果

Document parsing

Parse image-based documents, such as scanned copies or image PDFs, into QwenVL HTML or QwenVL Markdown format. This format accurately recognizes text and preserves the position information of elements such as images and tables. The Qwen3-VL model adds support for parsing into Markdown format.

The recommended prompts are qwenvl html (to parse into HTML format) and qwenvl markdown (to parse into Markdown format).

image

qwenvl markdown.

-结果

Visualization of results

Video understanding

Analyze video content, such as locating specific events and obtaining timestamps, or generating summaries of key time periods.

Please describe the series of actions of the person in the video. Output the start time (start_time), end time (end_time), and event (event) in JSON format. Please use HH:mm:ss to represent the timestamp.

{

"events": [

{

"start_time": "00:00:00",

"end_time": "00:00:05",

"event": "A person walks towards a table holding a cardboard box and places it on the table."

},

{

"start_time": "00:00:05",

"end_time": "00:00:15",

"event": "The person picks up a scanner and scans the label on the cardboard box."

},

{

"start_time": "00:00:15",

"end_time": "00:00:21",

"event": "The person puts the scanner back in its place and then picks up a pen to write information in a notebook."}]

}

Core features

Enable or disable thinking mode

  • The qwen3-vl-plus and qwen3-vl-flash series models are hybrid thinking models that can either respond directly or after a thinking process. You can use the enable_thinking parameter to control this behavior:

    • true: Enables thinking mode.

    • false (default): Disables thinking mode.

  • Models with the `thinking` suffix, such as qwen3-vl-235b-a22b-thinking, are thinking-only models. These models always think before responding, and this feature cannot be disabled.

Important
  • Model configuration: In general conversation scenarios that do not involve agent tool calls, we recommend that you do not set a System Message to maintain optimal performance. You can pass instructions, such as model role settings and output format requirements, through the User Message.

  • Prioritize streaming output: When thinking mode is enabled, both streaming and non-streaming output are supported. To avoid timeouts caused by long responses, we recommend using streaming output.

  • Limit thinking length: Thinking models can sometimes output lengthy reasoning processes. You can use the thinking_budget parameter to limit the length of the thinking process. If the number of tokens generated during the model's thinking process exceeds the thinking_budget, the reasoning content is truncated, and the model immediately starts generating the final response. The default value of thinking_budget is the model's maximum chain-of-thought length. For more information, see Model list.

OpenAI compatible

enable_thinking is not a standard OpenAI parameter. If you use the OpenAI Python SDK, pass it through extra_body.

import os
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""  # Define the full thinking process
answer_content = ""     # Define the full response
is_answering = False   # Determine if the thinking process has ended and the response has started
enable_thinking = True
# Create a chat completion request
completion = client.chat.completions.create(
    model="qwen3-vl-plus",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
                    },
                },
                {"type": "text", "text": "How do I solve this problem?"},
            ],
        },
    ],
    stream=True,
    # The enable_thinking parameter enables the thinking process. The thinking_budget parameter sets the maximum number of tokens for the reasoning process.
    # For qwen3-vl-plus and qwen3-vl-flash, you can enable or disable thinking with enable_thinking. For models with the `thinking` suffix, such as qwen3-vl-235b-a22b-thinking, enable_thinking can only be set to true. This does not apply to other Qwen-VL models.
    extra_body={
        'enable_thinking': enable_thinking,
        "thinking_budget": 81920},

    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

if enable_thinking:
    print("\n" + "=" * 20 + "Thinking process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print the thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start the response
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Full response" + "=" * 20 + "\n")
                is_answering = True
            # Print the response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Full thinking process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Full response" + "=" * 20 + "\n")
# print(answer_content)
import OpenAI from "openai";

// Initialize the OpenAI client
const openai = new OpenAI({
  // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;
let enableThinking = true;

let messages = [
    {
        role: "user",
        content: [
        { type: "image_url", image_url: { "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg" } },
        { type: "text", text: "Solve this problem" },
    ]
}]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qwen3-vl-plus',
            messages: messages,
            stream: true,
          // Note: In the Node.js SDK, non-standard parameters like enableThinking are passed as top-level properties and do not need to be placed in extra_body.
          enable_thinking: enableThinking,
          thinking_budget: 81920

        });

        if (enableThinking){console.log('\n' + '='.repeat(20) + 'Thinking process' + '='.repeat(20) + '\n');}

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Handle the thinking process
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Handle the formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Full response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();
# ======= Important =======
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen3-vl-plus",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
          }
        },
        {
          "type": "text",
          "text": "Please solve this problem"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true},
    "enable_thinking": true,
    "thinking_budget": 81920
}'

DashScope

import os
import dashscope
from dashscope import MultiModalConversation

# If you use a model in the Singapore region, uncomment the following line.
# dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"
enable_thinking=True
messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
            {"text": "Solve this problem."}
        ]
    }
]

response = MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen3-vl-plus",  
    messages=messages,
    stream=True,
    # The enable_thinking parameter enables the thinking process.
    # For qwen3-vl-plus and qwen3-vl-flash, you can enable or disable thinking with enable_thinking. For models with the `thinking` suffix, such as qwen3-vl-235b-a22b-thinking, enable_thinking can only be set to true. This does not apply to other Qwen-VL models.
    enable_thinking=enable_thinking,
    # The thinking_budget parameter sets the maximum number of tokens for the reasoning process.
    thinking_budget=81920,

)

# Define the full thinking process
reasoning_content = ""
# Define the full response
answer_content = ""
# Determine if the thinking process has ended and the response has started
is_answering = False

if enable_thinking:
    print("=" * 20 + "Thinking process" + "=" * 20)

for chunk in response:
    # If both the thinking process and the response are empty, ignore
    message = chunk.output.choices[0].message
    reasoning_content_chunk = message.get("reasoning_content", None)
    if (chunk.output.choices[0].message.content == [] and
        reasoning_content_chunk == ""):
        pass
    else:
        # If it is currently in the thinking process
        if reasoning_content_chunk != None and chunk.output.choices[0].message.content == []:
            print(chunk.output.choices[0].message.reasoning_content, end="")
            reasoning_content += chunk.output.choices[0].message.reasoning_content
        # If it is currently responding
        elif chunk.output.choices[0].message.content != []:
            if not is_answering:
                print("\n" + "=" * 20 + "Full response" + "=" * 20)
                is_answering = True
            print(chunk.output.choices[0].message.content[0]["text"], end="")
            answer_content += chunk.output.choices[0].message.content[0]["text"]

# To print the full thinking process and the full response, uncomment the following code and run it.
# print("=" * 20 + "Full thinking process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Full response" + "=" * 20 + "\n")
# print(f"{answer_content}")
// DashScope SDK version >= 2.21.10
import java.util.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.exception.InputRequiredException;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";}

    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(MultiModalConversationResult message) {
        String re = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String reasoning = Objects.isNull(re)?"":re; // default value

        List<Map<String, Object>> content = message.getOutput().getChoices().get(0).getMessage().getContent();
        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Thinking process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (Objects.nonNull(content) && !content.isEmpty()) {
            Object text = content.get(0).get("text");
            finalContent.append(content.get(0).get("text"));
            if (!isFirstPrint) {
                System.out.println("\n====================Full response====================");
                isFirstPrint = true;
            }
            System.out.print(text);
        }
    }
    public static MultiModalConversationParam buildMultiModalConversationParam(MultiModalMessage Msg)  {
        return MultiModalConversationParam.builder()
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(Msg))
                .enableThinking(true)
                .thinkingBudget(81920)
                .incrementalOutput(true)
                .build();
    }

    public static void streamCallWithMessage(MultiModalConversation conv, MultiModalMessage Msg)
            throws NoApiKeyException, ApiException, InputRequiredException, UploadFileException {
        MultiModalConversationParam param = buildMultiModalConversationParam(Msg);
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(message -> {
            handleGenerationResult(message);
        });
    }
    public static void main(String[] args) {
        try {
            MultiModalConversation conv = new MultiModalConversation();
            MultiModalMessage userMsg = MultiModalMessage.builder()
                    .role(Role.USER.getValue())
                    .content(Arrays.asList(Collections.singletonMap("image", "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"),
                            Collections.singletonMap("text", "Please solve this problem")))
                    .build();
            streamCallWithMessage(conv, userMsg);
//             Print the final result
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Full response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | UploadFileException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
                    {"text": "Please solve this problem"}
                ]
            }
        ]
    },
    "parameters":{
        "enable_thinking": true,
        "incremental_output": true,
        "thinking_budget": 81920
    }
}'

Enable high-resolution mode

For images that contain a large amount of detail, you can set the vl_high_resolution_images parameter to True to enable high-resolution mode. When this mode is enabled, the default token limit per image increases from 1280 or 2560 to 16384.

Mode

Token limit per image

Scenarios

Cost and latency

Enabled (vl_high_resolution_images = True)

16384

Scenarios with rich content that require attention to detail

Higher

Disabled (vl_high_resolution_images = False (default))

Qwen3-VL, qwen-vl-max-0813 and later, qwen-vl-plus-0815 and later updated models: 2560

Other Qwen-VL models: 1280

Scenarios with fewer details, high speed requirements, or cost sensitivity.

Lower

OpenAI compatible

vl_high_resolution_images is not a standard OpenAI parameter. If you use the OpenAI Python SDK, pass it through extra_body.

Python

import os
import time
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

def test_resolution(high_resolution=False):
    completion = client.chat.completions.create(
        model="qwen3-vl-plus",
        messages=[
           {"role": "user","content": [
               {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},},
               {"type": "text", "text": "What do these images depict?"},
                ],
            }
        ],
        extra_body={'enable_thinking': False,
                    "vl_high_resolution_images":high_resolution}

    )
    usage_info= completion.usage.prompt_tokens
    return {
        'usage_info': usage_info
    }

# Print the comparison result
print("\n==================== Token Usage Comparison ====================")
# Test low resolution
result_low = test_resolution(high_resolution=False)
# Wait a moment to avoid API limits
time.sleep(2)
# Test high resolution
result_high = test_resolution(high_resolution=True)


if result_low['usage_info'] and result_high['usage_info']:
    low_tokens = result_low['usage_info']
    high_tokens = result_high['usage_info']
    print(f"High-resolution mode - Total input tokens: {high_tokens}")
    print(f"Low-resolution mode - Total input tokens: {low_tokens}")
    print(f"Difference: {high_tokens - low_tokens} tokens")

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function test_resolution(high_resolution) {
    const response = await openai.chat.completions.create({
        model: "qwen3-vl-plus",
        messages: [
        {role: "user",content: [
            {type: "image_url",image_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"}},
            {type: "text", text: "What do these images depict?" },
        ]}],
        enable_thinking: false,
        vl_high_resolution_images:high_resolution
    });
    return response.usage.prompt_tokens;
}


// Test low and high resolution
(async function main() {
    console.log("\n==================== Token Usage Comparison ====================")
    const result_low = await test_resolution(false);
    const result_high = await test_resolution(true);

    console.log("High resolution - Total input tokens:",result_high);
    console.log("Low resolution - Total input tokens:", result_low);
    console.log("Difference:", result_high-result_low);

})();

curl

# ======= Important =======
# The API keys for the Singapore and China (Beijing) regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following base_url is for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-vl-plus",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"
          }
        },
        {
          "type": "text",
          "text": "What do these images depict?"
        }
      ]
    }
  ],
  "enable_thinking": false,
  "vl_high_resolution_images":true
}'

Token usage comparison

==================== Token usage comparison ====================
Total input tokens (high resolution): 4073
Total input tokens (low resolution): 2518
Difference: 1555

DashScope

Python

import os
import time

import dashscope

# The following is the base_url for the Singapore region. To use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
            {"text": "What does this image show?"}
        ]
    }
]

def test_resolution(high_resolution=False):
    """Tests the results of different resolution settings."""

    response = dashscope.MultiModalConversation.call(
        # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
        # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        api_key=os.getenv('DASHSCOPE_API_KEY'),
        model='qwen3-vl-plus',
        messages=messages,
        enable_thinking=False,
        vl_high_resolution_images=high_resolution
    )
    return {
        'usage_info': response.usage
    }

# Print the comparison results.
print("\n==================== Token Usage Comparison ====================")
# Test low resolution.
result_low = test_resolution(high_resolution=False)
# Wait for a moment to avoid API rate limits.
time.sleep(2)
# Test high resolution.
result_high = test_resolution(high_resolution=True)

if result_low['usage_info'] and result_high['usage_info']:
    low_tokens = result_low['usage_info'].input_tokens_details['image_tokens']
    high_tokens = result_high['usage_info'].input_tokens_details['image_tokens']
    print(f"High-resolution mode - Image tokens: {high_tokens}")
    print(f"Low-resolution mode - Image tokens: {low_tokens}")
    print(f"Difference: {high_tokens - low_tokens} tokens ")

Java

// DashScope SDK version >= 2.21.10
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

        static {
        // The following is the base_url for the Singapore region. To use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static Integer simpleMultiModalConversationCall(boolean highResolution)
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"),
                        Collections.singletonMap("text", "What does this image show?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .enableThinking(false)
                .messages(Arrays.asList(userMessage))
                .vlHighResolutionImages(highResolution)
                .build();
        MultiModalConversationResult result = conv.call(param);
        return result.getUsage().getImageTokens();
    }

    public static void main(String[] args) {
        try {
            // Invoke the high-resolution pattern.
            Integer highResToken = simpleMultiModalConversationCall(true);
            // Invoke the low-resolution pattern.
            Integer lowResToken = simpleMultiModalConversationCall(false);

            // Print the comparison results.
            System.out.println("=== Token Usage Comparison ===");
            System.out.println("High-resolution pattern token count: " + highResToken);
            System.out.println("Low-resolution pattern token count: " + lowResToken);
            System.out.println("Difference: " + (highResToken - lowResToken));
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the base_url for the Singapore region. To use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution. ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
               {"text": "What does this image show?"}
                ]
            }
        ]
    },
    "parameters": {
        "vl_high_resolution_images": true,
        "enable_thinking": false
    }
}'

Token usage comparison

==================== Token Usage Comparison ====================
High-resolution mode - Image tokens: 4058
Low-resolution mode - Image tokens: 2503
Difference: 1555 tokens 

Video understanding

The Qwen-VL model can understand video content from a list of images (video frames) or a video file.

For optimal performance when understanding video files, use the latest version or a recent snapshot version of the model.

Video files

Video frame extraction

The Qwen-VL model understands video content by extracting frames. The frame extraction frequency determines the level of detail in the analysis. The method for controlling this frequency differs between software development kits (SDKs):

  • Use the DashScope SDK:

    The fps parameter controls the frame rate. It specifies that one frame is extracted every seconds. The default value is 2.0, and the valid range is (0.1, 10). Use a higher fps value for fast-moving scenarios and a lower fps value for static scenes or long videos.

  • OpenAI compatible SDK: The frame extraction frequency is fixed at one frame every 0.5 seconds and cannot be customized.

The following code samples show how to understand an online video specified by a URL. For more information, see how to upload a local file.

OpenAI compatible

When you directly input a video file to the Qwen-VL model using the OpenAI SDK or HTTP, set the "type" parameter in the user message to "video_url".

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus",
    messages=[
        {"role": "user","content": [{
            # When passing a video file directly, set the value of type to video_url.
            # When using the OpenAI SDK, one frame is extracted from the video every 0.5 seconds by default. This frequency cannot be changed. To customize the frame rate, use the DashScope SDK.
            "type": "video_url",            
            "video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {"type": "text","text": "What is the content of this video?"}]
         }]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
        // If the environment variable is not configured, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3-vl-plus",
        messages: [
        {role: "user",content: [
            // When passing a video file directly, set the value of type to video_url.
            // When using the OpenAI SDK, one frame is extracted from the video every 0.5 seconds by default. This frequency cannot be changed. To customize the frame rate, use the DashScope SDK.
            {type: "video_url", video_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {type: "text", text: "What is the content of this video?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

curl

# ======= Important Note =======
# This base_url is for the Singapore region. To use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "messages": [
    {"role": "user","content": [{"type": "video_url","video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
    {"type": "text","text": "What is the content of this video?"}]}]
}'

DashScope

Python

import dashscope
import os
# The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {"role": "user",
        "content": [
            # The fps parameter controls the video frame extraction frequency. It specifies that one frame is extracted every 1/fps seconds. For complete usage, see: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is the content of this video?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key ="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
   static {
            // The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
        }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // The fps parameter controls the video frame extraction frequency. It specifies that one frame is extracted every 1/fps seconds. For complete usage, see: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
        Map<String, Object> params = new HashMap<>();
        params.put("video", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4");
        params.put("fps", 2);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "What is the content of this video?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If you use a model in the China (Beijing) region, you must use an API key for the China (Beijing) region. Get an API key: https://bailian.console.alibabacloud.com/?tab=model#/api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {"role": "user","content": [{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is the content of this video?"}]}]}
}'

Image lists

Image list limits

  • Qwen3-VL, Qwen2.5-VL, and QVQ series models: A minimum of 4 images and a maximum of 512 images.

  • Other models: A minimum of 4 images and a maximum of 80 images.

Video frame extraction

When you provide a video as a list of images (pre-extracted video frames), you can use the fps parameter to inform the model of the time interval between frames. This helps the model better understand the sequence, duration, and dynamics of events.

  • Use the DashScope SDK:

    You can specify the frame rate of the original video using the fps parameter. This setting extracts a frame from the original video every seconds. This parameter is supported by the Qwen2.5-VL and Qwen3-VL models.

  • Use an OpenAI compatible SDK:

    The fps parameter is not supported. The model assumes a video frame extraction frequency of one frame every 0.5 seconds.

The following code samples show how to understand online video frames specified by URLs. For more information, see how to upload local files.

OpenAI compatible

When you input a video as a list of images to the Qwen-VL model using the OpenAI SDK or HTTP, set the "type" parameter in the user message to "video".

Python

import os
from openai import OpenAI

client = OpenAI(
    # API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus", # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see: https://www.alibabacloud.com/help/en/model-studio/models-and-pricing
    messages=[{"role": "user","content": [
        # When passing a list of images, the "type" parameter in the user message is "video".
        # When using the OpenAI SDK, the image list is assumed to be extracted from the video at a rate of one frame every 0.5 seconds. This cannot be changed. To customize the frame rate, use the DashScope SDK.
        {"type": "video","video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
        {"type": "text","text": "Describe the specific process in this video"},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

// Make sure you have specified "type": "module" in your package.json file.
import OpenAI from "openai";

const openai = new OpenAI({
    // API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
    // If the environment variable is not configured, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    // The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen3-vl-plus", // This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see: https://www.alibabacloud.com/help/en/model-studio/models-and-pricing
        messages: [{
            role: "user",
            content: [
                {
                    // When passing a list of images, the "type" parameter in the user message is "video".
                    // When using the OpenAI SDK, the image list is assumed to be extracted from the video at a rate of one frame every 0.5 seconds. This cannot be changed. To customize the frame rate, use the DashScope SDK.
                    type: "video",
                    video: [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
                    ]
                },
                {
                    type: "text",
                    text: "Describe the specific process in this video"
                }
            ]
        }]
    });
    console.log(response.choices[0].message.content);
}

main();

curl

# ======= Important =======
# API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
# The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
                {"type": "text",
                "text": "Describe the specific process in this video"}]}]
}'

DashScope

Python

import os
import dashscope

# The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
             "content": [
                  # If the model is from the Qwen2.5-VL or Qwen3-VL series and you pass a list of images, you can set the fps parameter. This indicates that the image list was extracted from the original video at a rate of one frame every 1/fps seconds.
                 {"video":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                   "fps":2},
                 {"text": "Describe the specific process in this video"}]}]
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen3-vl-plus',  # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see: https://www.alibabacloud.com/help/en/model-studio/models-and-pricing
    messages=messages
)
print(response.output.choices[0].message.content[0]["text"])

Java

// DashScope SDK version 2.18.3 or later is required.
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        // The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final String MODEL_NAME = "qwen3-vl-plus";  // This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see: https://www.alibabacloud.com/help/en/model-studio/models-and-pricing
    public static void videoImageListSample() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // If the model is from the Qwen2.5-VL or Qwen3-VL series and you pass a list of images, you can set the fps parameter. This indicates that the image list was extracted from the original video at a rate of one frame every 1/fps seconds.
        Map<String, Object> params = Map.of(
                "video", Arrays.asList("https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"),
                "fps",2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "Describe the specific process in this video")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# The following is the base_url for the Singapore region. If you use a model in the China (Beijing) region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# API keys for the Singapore and China (Beijing) regions are different. Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-vl-plus",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
            ],
            "fps":2
                 
          },
          {
            "text": "Describe the specific process in this video"
          }
        ]
      }
    ]
  }
}'

Upload local files (Base64 encoding or file path)

Qwen-VL provides two methods to upload local files:

  • Upload using Base64 encoding

  • Upload directly using a file path (more stable and recommended)

Upload methods

Upload using Base64 encoding

Convert the file to a Base64-encoded string and pass it to the model. This method is compatible with the OpenAI compatible mode, the DashScope software development kit (SDK), and HTTP requests.

Steps to pass a Base64-encoded string (image example)

  1. Encode the file: Convert the local image to a Base64 encoding.

    Example code to convert an image to Base64 encoding

    # Encoding function: Converts a local file to a Base64-encoded string
    import base64
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    
    # Replace xxx/eagle.png with the absolute path of your local image
    base64_image = encode_image("xxx/eagle.png")
  2. Build a Data URL: Build a Data URL in the following format: data:[MIME_type];base64,{base64_image}.

    1. Replace MIME_type with the actual media type. The media type must match a MIME Type value in the Supported image formats table, such as image/jpeg or image/png.

    2. base64_image is the Base64 string generated in the previous step.

  3. Call the model: Pass the Data URL using the image or image_url parameter and call the model.

Upload using a file path

Directly pass the local file path to the model. This method is supported only by the DashScope Python and Java SDKs. It is not supported for DashScope HTTP requests or the OpenAI compatible mode.

Refer to the following table to specify the file path based on your programming language and operating system.

Specify a file path (image example)

System

SDK

File path to pass

Example

Linux or macOS system

Python SDK

file://{absolute path of the file}

file:///home/images/test.png

Java SDK

Windows system

Python SDK

file://{absolute path of the file}

file://D:/images/test.png

Java SDK

file:///{absolute path of the file}

file:///D:/images/test.png

Limits:

  • For better stability, we recommend uploading files using a file path. For files smaller than 1 MB, you can use Base64 encoding.

  • When you pass a file path directly, a single image or video frame (from an image list) must be smaller than 10 MB, and a single video must be smaller than 100 MB.

  • When you use Base64 encoding, the encoded data size increases. Ensure that a single encoded image or video is smaller than 10 MB.

To compress a file, see How do I compress an image or video to meet the size requirements?

Images

Upload using a file path

Passing a file path is supported only when you call the model using the DashScope Python and Java SDKs. This method is not supported for DashScope HTTP requests or the OpenAI compatible mode.

Python

import os
import dashscope

# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace xxx/eagle.png with the absolute path of your local image
local_path = "xxx/eagle.png"
image_path = f"file://{local_path}"
messages = [
                {'role':'user',
                'content': [{'image': image_path},
                            {'text': 'What is depicted in the image?'}]}]
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',  # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=messages)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>(){{put("image", filePath);}},
                        new HashMap<String, Object>(){{put("text", "What is depicted in the image?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  // This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // Replace xxx/eagle.png with the absolute path of your local image
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Upload using Base64 encoding

OpenAI compatible

Python

from openai import OpenAI
import os
import base64


# Encoding function: Converts a local file to a Base64-encoded string.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxx/eagle.png with the absolute path of your local image
base64_image = encode_image("xxx/eagle.png")
client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus", # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # When passing Base64 image data, note that the image format (image/{format}) must match the Content Type in the list of supported images. "f" is a string formatting method.
                    # PNG image:  f"data:image/png;base64,{base64_image}"
                    # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP image: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                },
                {"type": "text", "text": "What is depicted in the image?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';


const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If the environment variable is not configured, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
// Replace xxx/eagle.png with the absolute path of your local image
const base64Image = encodeImage("xxx/eagle.png")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3-vl-plus",  // This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
        messages: [
            {"role": "user",
            "content": [{"type": "image_url",
                            // Note: When passing Base64 data, the image format (image/{format}) must match the Content Type in the list of supported images.
                           // PNG image:  data:image/png;base64,${base64Image}
                          // JPEG image: data:image/jpeg;base64,${base64Image}
                         // WEBP image: data:image/webp;base64,${base64Image}
                        "image_url": {"url": `data:image/png;base64,${base64Image}`},},
                        {"type": "text", "text": "What is depicted in the image?"}]}]
    });
    console.log(completion.choices[0].message.content);
} 

main();

curl

  • For the method to convert a file to a Base64-encoded string, see the example code.

  • For demonstration purposes, the Base64 string in the code, "...", is truncated. In actual use, be sure to pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3-vl-plus",
  "messages": [
  {
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": ""}},
      {"type": "text", "text": "What is depicted in the image?"}
    ]
  }]
}'

DashScope

Python

import base64
import os
import dashscope

# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Encoding function: Converts a local file to a Base64-encoded string.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Replace xxxx/eagle.png with the absolute path of your local image
base64_image = encode_image("xxxx/eagle.png")

messages = [
    {
        "role": "user",
        "content": [
            # Note: When passing Base64 data, the image format (image/{format}) must match the Content Type in the list of supported images. "f" is a string formatting method.
            # PNG image:  f"data:image/png;base64,{base64_image}"
            # JPEG image: f"data:image/jpeg;base64,{base64_image}"
            # WEBP image: f"data:image/webp;base64,{base64_image}"
            {"image": f"data:image/png;base64,{base64_image}"},
            {"text": "What is depicted in the image?"},
        ],
    },
]

response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-vl-plus",  # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=messages,
)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Base64;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }

    public static void callWithLocalFile(String localPath) throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image = encodeImageToBase64(localPath); // Base64 encoding

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        new HashMap<String, Object>() {{ put("image", "data:image/png;base64," + base64Image); }},
                        new HashMap<String, Object>() {{ put("text", "What is depicted in the image?"); }}
                )).build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/eagle.png with the absolute path of your local image
            callWithLocalFile("xxx/eagle.png");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For the method to convert a file to a Base64-encoded string, see the example code.

  • For demonstration purposes, the Base64 string in the code, "...", is truncated. In actual use, be sure to pass the complete encoded string.

# ======= Important =======
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"image": "..."},
               {"text": "What is depicted in the image?"}
                ]
            }
        ]
    }
}'

Video files

This section uses a locally saved test.mp4 file as an example.

Upload using a file path

Passing a file path is supported only when you call the model using the DashScope Python and Java SDKs. This method is not supported for DashScope HTTP requests or the OpenAI compatible mode.

Python

import os
import dashscope

# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace xxx/test.mp4 with the absolute path of your local video
local_path = "xxx/test.mp4"
video_path = f"file://{local_path}"
messages = [
                {'role':'user',
                # The fps parameter controls the number of frames extracted from the video. It means one frame is extracted every 1/fps seconds.
                'content': [{'video': video_path,"fps":2},
                            {'text': 'What does this video depict?'}]}]
response = MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',  
    messages=messages)
print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", filePath);// The fps parameter controls the number of frames extracted from the video. It means one frame is extracted every 1/fps seconds.
                                           put("fps", 2);
                                       }}, 
                        new HashMap<String, Object>(){{put("text", "What does this video depict?");}})).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")  
                .messages(Arrays.asList(userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));}

    public static void main(String[] args) {
        try {
            // Replace xxx/test.mp4 with the absolute path of your local video
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Upload using Base64 encoding

OpenAI compatible

Python

from openai import OpenAI
import os
import base64


# Encoding function: Converts a local file to a Base64-encoded string.
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# Replace xxx/test.mp4 with the absolute path of your local video
base64_video = encode_video("xxx/test.mp4")
client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus",  
    messages=[
        {
            "role": "user",
            "content": [
                {
                    # When passing a video file directly, set the value of type to video_url.
                    "type": "video_url",
                    "video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
                },
                {"type": "text", "text": "What does this video depict?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If the environment variable is not configured, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeVideo = (videoPath) => {
    const videoFile = readFileSync(videoPath);
    return videoFile.toString('base64');
  };
// Replace xxx/test.mp4 with the absolute path of your local video
const base64Video = encodeVideo("xxx/test.mp4")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3-vl-plus",  
        messages: [
            {"role": "user",
             "content": [{
                 // When passing a video file directly, set the value of type to video_url.
                "type": "video_url", 
                "video_url": {"url": `data:video/mp4;base64,${base64Video}`}},
                 {"type": "text", "text": "What does this video depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

curl

  • For the method to convert a file to a Base64-encoded string, see the example code.

  • For demonstration purposes, the Base64 string in the code, "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...", is truncated. In actual use, be sure to pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen3-vl-plus",
  "messages": [
  {
    "role": "user",
    "content": [
      {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."}},
      {"type": "text", "text": "What is depicted in the image?"}
    ]
  }]
}'

DashScope

Python

import base64
import os
import dashscope

# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Encoding function: Converts a local file to a Base64-encoded string.
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# Replace xxxx/test.mp4 with the absolute path of your local video
base64_video = encode_video("xxxx/test.mp4")

messages = [{'role':'user',
                # The fps parameter controls the number of frames extracted from the video. It means one frame is extracted every 1/fps seconds.
             'content': [{'video': f"data:video/mp4;base64,{base64_video}","fps":2},
                            {'text': 'What does this video depict?'}]}]
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    private static String encodeVideoToBase64(String videoPath) throws IOException {
        Path path = Paths.get(videoPath);
        byte[] videoBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(videoBytes);
    }

    public static void callWithLocalFile(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Video = encodeVideoToBase64(localPath); // Base64 encoding

        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(new HashMap<String, Object>()
                                       {{
                                           put("video", "data:video/mp4;base64," + base64Video);// The fps parameter controls the number of frames extracted from the video. It means one frame is extracted every 1/fps seconds.
                                           put("fps", 2);
                                       }},
                        new HashMap<String, Object>(){{put("text", "What does this video depict?");}})).build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/test.mp4 with the absolute path of your local image
            callWithLocalFile("xxx/test.mp4");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For the method to convert a file to a Base64-encoded string, see the example code.

  • For demonstration purposes, the Base64 string in the code, "f"data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...", is truncated. In actual use, be sure to pass the complete encoded string.

# ======= Important =======
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"video": "data:video/mp4;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
               {"text": "What is depicted in the image?"}
                ]
            }
        ]
    }
}'

Image lists

This section uses the locally saved files football1.jpg, football2.jpg, football3.jpg, and football4.jpg as examples.

Upload using a file path

Passing a file path is supported only when you call the model using the DashScope Python and Java SDKs. This method is not supported for DashScope HTTP requests or the OpenAI compatible mode.

Python

import os
import dashscope

# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

local_path1 = "football1.jpg"
local_path2 = "football2.jpg"
local_path3 = "football3.jpg"
local_path4 = "football4.jpg"

image_path1 = f"file://{local_path1}"
image_path2 = f"file://{local_path2}"
image_path3 = f"file://{local_path3}"
image_path4 = f"file://{local_path4}"

messages = [{'role':'user',
                # If the model is from the Qwen2.5-VL series and you pass an image list, you can set the fps parameter. This indicates that the image list is extracted from the original video every 1/fps seconds. This setting has no effect on other models.
             'content': [{'video': [image_path1,image_path2,image_path3,image_path4],"fps":2},
                         {'text': 'What does this video depict?'}]}]
response = MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen3-vl-plus',  # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

// DashScope SDK version 2.18.3 or later is required.
import java.util.Arrays;
import java.util.Map;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    private static final String MODEL_NAME = "qwen3-vl-plus";  // This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    public static void videoImageListSample(String localPath1, String localPath2, String localPath3, String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        String filePath1 = "file://" + localPath1;
        String filePath2 = "file://" + localPath2;
        String filePath3 = "file://" + localPath3;
        String filePath4 = "file://" + localPath4;
        Map<String, Object> params = Map.of(
                "video", Arrays.asList(filePath1,filePath2,filePath3,filePath4),
                // If the model is from the Qwen2.5-VL series and you pass an image list, you can set the fps parameter. This indicates that the image list is extracted from the original video every 1/fps seconds. This setting has no effect on other models.
                "fps",2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "Describe the process shown in this video")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Upload using Base64 encoding

OpenAI compatible

Python

import os
from openai import OpenAI
import base64

# Encoding function: Converts a local file to a Base64-encoded string.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")
client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen3-vl-plus",  # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=[  
    {"role": "user","content": [
        {"type": "video","video": [
            f"data:image/jpeg;base64,{base64_image1}",
            f"data:image/jpeg;base64,{base64_image2}",
            f"data:image/jpeg;base64,{base64_image3}",
            f"data:image/jpeg;base64,{base64_image4}",]},
        {"type": "text","text": "Describe the process shown in this video"},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
        // If the environment variable is not configured, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
  
const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen3-vl-plus",  // This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
        messages: [
            {"role": "user",
             "content": [{"type": "video",
                            // Note: When passing Base64 data, the image format (image/{format}) must match the Content Type in the list of supported images.
                           // PNG image:  data:image/png;base64,${base64Image}
                          // JPEG image: data:image/jpeg;base64,${base64Image}
                         // WEBP image: data:image/webp;base64,${base64Image}
                        "video": [
                            `data:image/jpeg;base64,${base64Image1}`,
                            `data:image/jpeg;base64,${base64Image2}`,
                            `data:image/jpeg;base64,${base64Image3}`,
                            `data:image/jpeg;base64,${base64Image4}`]},
                        {"type": "text", "text": "What does this video depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

curl

  • For the method to convert a file to a Base64-encoded string, see the example code.

  • For demonstration purposes, the Base64 string in the code, "...", is truncated. In actual use, be sure to pass the complete encoded string.

# ======= Important =======
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen3-vl-plus",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": [
                          "...",
                          "...",
                          "...",
                          "..."
                          ]},
                {"type": "text",
                "text": "Describe the process shown in this video"}]}]
}'

DashScope

Python

import base64
import os
import dashscope

# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Encoding function: Converts a local file to a Base64-encoded string.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")


messages = [{'role':'user',
            'content': [
                    {'video':
                         [f"data:image/png;base64,{base64_image1}",
                          f"data:image/png;base64,{base64_image2}",
                          f"data:image/png;base64,{base64_image3}",
                          f"data:image/png;base64,{base64_image4}"
                         ]
                    },
                    {'text': 'Please describe the process shown in this video.'}]}]
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen3-vl-plus',  # This example uses qwen3-vl-plus. You can replace it with another model name as needed. For a list of models, see https://www.alibabacloud.com/help/en/model-studio/getting-started/models
    messages=messages)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.io.IOException;
import java.util.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.alibaba.dashscope.aigc.multimodalconversation.*;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }

    public static void videoImageListSample(String localPath1,String localPath2,String localPath3,String localPath4)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image1 = encodeImageToBase64(localPath1); // Base64 encoding
        String base64Image2 = encodeImageToBase64(localPath2);
        String base64Image3 = encodeImageToBase64(localPath3);
        String base64Image4 = encodeImageToBase64(localPath4);

        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> params = Map.of(
                "video", Arrays.asList(
                        "data:image/jpeg;base64," + base64Image1,
                        "data:image/jpeg;base64," + base64Image2,
                        "data:image/jpeg;base64," + base64Image3,
                        "data:image/jpeg;base64," + base64Image4),
                // If the model is from the Qwen2.5-VL series and you pass an image list, you can set the fps parameter. This indicates that the image list is extracted from the original video every 1/fps seconds. This setting has no effect on other models.
                    "fps",2
        );
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(params,
                        Collections.singletonMap("text", "Describe the process shown in this video")))
                .build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen3-vl-plus")
                .messages(Arrays.asList(userMessage))
                .build();

        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/football1.png and other paths with the absolute paths of your local images
            videoImageListSample(
                    "xxx/football1.jpg",
                    "xxx/football2.jpg",
                    "xxx/football3.jpg",
                    "xxx/football4.jpg"
            );
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For the method to convert a file to a Base64-encoded string, see the example code.

  • For demonstration purposes, the Base64 string in the code, "...", is truncated. In actual use, be sure to pass the complete encoded string.

# ======= Important =======
# The following is the base_url for the Singapore region. If you use a model in the Beijing region, replace the base_url with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-vl-plus",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
                      "...",
                      "...",
                      "...",
                      "..."
            ],
            "fps":2     
          },
          {
            "text": "Describe the process shown in this video"
          }
        ]
      }
    ]
  }
}'

More features

Limits

Input file limits

Image file limits

  • Supported image formats

    Image format

    Common extensions

    MIME Type

    BMP

    .bmp

    image/bmp

    JPEG

    .jpe, .jpeg, .jpg

    image/jpeg

    PNG

    .png

    image/png

    TIFF

    .tif, .tiff

    image/tiff

    WEBP

    .webp

    image/webp

    HEIC

    .heic

    image/heic

  • Image size: The size of a single image cannot exceed 10 MB. If you pass a Base64-encoded image, ensure that the encoded string is less than 10 MB. For more information, see Upload local files. To learn how to compress the file size, see Image or video compression methods.

  • Dimensions and aspect ratio: The width and height of the image must each be greater than 10 pixels. The aspect ratio of the image (the ratio of the long edge to the short edge) cannot exceed 200.

  • Total pixels: The model automatically scales the image, so there is no strict limit on the total number of pixels.

Video file limits

  • Video size

    • Internet URL:

      • Qwen3-VL and qwen-vl-max (including qwen-vl-max-latest and all versions after qwen-vl-max-2025-04-08): Cannot exceed 2 GB.

      • Other Qwen2.5-VL series and QVQ series models: Cannot exceed 1 GB.

      • Other models: Cannot exceed 150 MB.

    • Base64 encoding: The Base64-encoded video must be smaller than 10 MB.

    • Local file path: The video file must be smaller than 100 MB.

  • Video duration

    • Qwen3-VL, qwen-vl-max, qwen-vl-max-latest, qwen-vl-max-2025-08-13, and qwen-vl-max-2025-04-08: 2 seconds to 20 minutes.

    • Other Qwen2.5-VL series and QVQ models: 2 seconds to 10 minutes.

    • Other models: 2 seconds to 40 seconds.

  • Video formats: MP4, AVI, MKV, MOV, FLV, and WMV.

  • Video dimensions: There are no specific limits. Before processing, the model resizes the video to about 600,000 pixels. Larger video files do not result in better understanding.

  • Audio understanding: Audio understanding for video files is not supported.

File input methods

  • Internet URL: Provide a publicly accessible URL for the file. The HTTP and HTTPS protocols are supported.

  • Base64 encoding: Convert the file to a Base64-encoded string.

  • Local file path: Provide the path to the local file directly.

Using in a production environment

  • Image and video pre-processing: Qwen-VL has a size limit for input files. To learn how to compress files, see Image or video compression methods.

  • Processing text files: Qwen-VL supports only image files and cannot process text files directly. Consider the following alternatives:

    • Convert the text file to an image format. You can use an image processing library, such as pdf2image for Python, to convert the file into multiple high-quality images page by page. Then, you can pass the images to the model using multi-image input.

    • Qwen-Long can process text files and parse their content.

    To convert a text file to an image format, you can use an image editing library (such as Python's pdf2image) to convert each page of the file into a high-quality image, and then use multi-image input to provide them to the model.

  • Fault tolerance and stability

    • Timeout handling: In non-streaming calls, a timeout error occurs if the model fails to complete its output within 180 seconds. When a timeout occurs, the partially generated content is returned in the response body. A response header that contains x-dashscope-partialresponse: true indicates a timeout. You can use the partial mode feature, which is supported by some Qwen-VL models, to add the generated content to the messages array and resend the request. This allows the model to continue generating content. For more information, see Continue generation from incomplete output.

    • Retry mechanism: You can design a proper retry logic for API calls, such as exponential backoff. This logic can handle network fluctuations or temporary service unavailability.

Billing and throttling

  • Throttling: For information about the throttling conditions for the Qwen-VL model, see Throttling.

  • Free quota (Singapore region only): The Qwen-VL model offers a free quota of 1 million tokens. The 90-day validity period starts when you activate Alibaba Cloud Model Studio or when your model request is approved.

  • Billing

    • Total cost = (Number of input tokens × Price per input token) + (Number of output tokens × Price per output token). For input and output prices, see Model List.

    • In thinking mode, the reasoning process (reasoning_content) is part of the output. It is counted as output tokens and billed accordingly. If the reasoning process is not included in the output in thinking mode, billing is based on the non-thinking mode price.

    Rules for converting images and videos to tokens

    Images

    • For Qwen3-VL, qwen-vl-max-0813, qwen-vl-plus-0815, and later versions: Each 32 × 32 pixel block corresponds to one token. An image requires a minimum of 4 tokens.

    • For other models: Each 28 × 28 pixel block corresponds to one token. An image requires a minimum of 4 tokens.

    Use the following code to estimate the number of tokens for an image:

    import math
    # To install the Pillow library, run the following command: pip install Pillow
    from PIL import Image
    
    def token_calculate(image_path):
        # Open the specified PNG image file
        image = Image.open(image_path)
    
        # Get the original dimensions of the image
        height = image.height
        width = image.width
        
        # For models updated after Qwen3-VL, qwen-vl-max-0813, and qwen-vl-plus-0815: Adjust both height and width to be multiples of 32
        # For other models: Adjust both height and width to be multiples of 28
        h_bar = round(height / 32) * 32 
        w_bar = round(width / 32) * 32
        
        # Lower limit for image tokens: 4 tokens
        min_pixels = 32 * 32 * 4
        # Upper limit for image tokens: 1280 tokens
        max_pixels = 1280 * 32 * 32
            
        # Scale the image to adjust the total number of pixels to be within the range of [min_pixels, max_pixels]
        if h_bar * w_bar > max_pixels:
            # Calculate the zoom factor beta so that the total number of pixels in the scaled image does not exceed max_pixels
            beta = math.sqrt((height * width) / max_pixels)
            # Recalculate the adjusted height and width. For models updated after Qwen3-VL, qwen-vl-max-0813, and qwen-vl-plus-0815, ensure they are multiples of 32. For other models, ensure they are multiples of 28.
            h_bar = math.floor(height / beta / 32) * 32
            w_bar = math.floor(width / beta / 32) * 32
        elif h_bar * w_bar < min_pixels:
            # Calculate the zoom factor beta so that the total number of pixels in the scaled image is not less than min_pixels
            beta = math.sqrt(min_pixels / (height * width))
            # Recalculate the adjusted height. For models updated after Qwen3-VL, qwen-vl-max-0813, and qwen-vl-plus-0815, ensure it is a multiple of 32. For other models, ensure it is a multiple of 28.
            h_bar = math.ceil(height * beta / 32) * 32
            w_bar = math.ceil(width * beta / 32) * 32
        return h_bar, w_bar
    
    # Replace test.png with the path to your local image
    h_bar, w_bar = token_calculate("path/to/your/test.png")
    print(f"Scaled image dimensions: height={h_bar}, width={w_bar}")
    
    # Calculate the number of image tokens. For models updated after Qwen3-VL, qwen-vl-max-0813, and qwen-vl-plus-0815, the number of tokens = Total pixels / (32 * 32). For other models, the number of tokens = Total pixels / (28 * 28).
    token = int((h_bar * w_bar) / (32 * 32))
    
    # The system automatically adds the <vision_bos> and <vision_eos> visual marks, each counting as 1 token.
    print(f"Number of image tokens: {token + 2}")

    Videos

    Use the following code to estimate the number of tokens for a video:

    # Before use, install the library: pip install opencv-python
    import math
    import os
    import logging
    import cv2
    
    logger = logging.getLogger(__name__)
    
    FRAME_FACTOR = 2
    # For Qwen3-VL, models updated after qwen-vl-max-0813, and qwen-vl-plus-0815, set IMAGE_FACTOR to 32
    # For other models, set IMAGE_FACTOR to 28
    IMAGE_FACTOR = 28
    # Aspect ratio of video frames
    MAX_RATIO = 200
    
    # Lower limit for video frame tokens
    VIDEO_MIN_PIXELS = 128 * 32 * 32
    # Upper limit for video frame tokens
    VIDEO_MAX_PIXELS = 768 * 32 * 32
    
    # If the user does not pass the FPS parameter, the default value is used
    FPS = 2.0
    # Minimum number of extracted frames
    FPS_MIN_FRAMES = 4
    # Maximum number of extracted frames. When using Qwen3-VL and Qwen2.5-vl models, set FPS_MAX_FRAMES to 512. For other models, set it to 80.
    FPS_MAX_FRAMES = 512
    
    # Maximum pixel value for video input.
    # When using Qwen3-VL and Qwen2.5-vl models, set VIDEO_TOTAL_PIXELS to 65536 * 32 * 32 and 65536 * 28 * 28. For other models, set it to 24576 * 28 * 28.
    VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 65536 * 28 * 28)))
    
    def round_by_factor(number: int, factor: int) -> int:
        """Returns the integer closest to 'number' that is divisible by 'factor'."""
        return round(number / factor) * factor
    
    def ceil_by_factor(number: int, factor: int) -> int:
        """Returns the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""
        return math.ceil(number / factor) * factor
    
    def floor_by_factor(number: int, factor: int) -> int:
        """Returns the largest integer less than or equal to 'number' that is divisible by 'factor'."""
        return math.floor(number / factor) * factor
    
    def smart_nframes(ele,total_frames,video_fps):
        """Calculates the number of video frames to extract.
    
        Args:
            ele (dict): A dictionary containing the video configuration.
                - fps: Controls the number of frames extracted for model input.
            total_frames (int): The original total number of frames in the video.
            video_fps (int | float): The original frame rate of the video.
    
        Raises:
            Raises an error if nframes is not within the interval [FRAME_FACTOR, total_frames].
    
        Returns:
            The number of video frames for model input.
        """
        assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
        fps = ele.get("fps", FPS)
        min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
        max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
        duration = total_frames / video_fps if video_fps != 0 else 0
        if duration-int(duration)>(1/fps):
            total_frames = math.ceil(duration * video_fps)
        else:
            total_frames = math.ceil(int(duration)*video_fps)
        nframes = total_frames / video_fps * fps
        if nframes > total_frames:
            logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
        nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
        if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
            raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
    
        return nframes
    
    def get_video(video_path):
        # Get video information
        cap = cv2.VideoCapture(video_path)
    
        frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        # Get video height
        frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
        video_fps = cap.get(cv2.CAP_PROP_FPS)
        return frame_height,frame_width,total_frames,video_fps
    
    def smart_resize(ele,path,factor = IMAGE_FACTOR):
        # Get the original width and height of the video
        height, width, total_frames, video_fps = get_video(path)
        # Lower limit for video frame tokens
        min_pixels = VIDEO_MIN_PIXELS
        total_pixels = VIDEO_TOTAL_PIXELS
        # Number of extracted frames
        nframes = smart_nframes(ele, total_frames, video_fps)
        max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),int(min_pixels * 1.05))
    
        # The aspect ratio of the video should not exceed 200:1 or 1:200
        if max(height, width) / min(height, width) > MAX_RATIO:
            raise ValueError(
                f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
            )
    
        h_bar = max(factor, round_by_factor(height, factor))
        w_bar = max(factor, round_by_factor(width, factor))
        if h_bar * w_bar > max_pixels:
            beta = math.sqrt((height * width) / max_pixels)
            h_bar = floor_by_factor(height / beta, factor)
            w_bar = floor_by_factor(width / beta, factor)
        elif h_bar * w_bar < min_pixels:
            beta = math.sqrt(min_pixels / (height * width))
            h_bar = ceil_by_factor(height * beta, factor)
            w_bar = ceil_by_factor(width * beta, factor)
        return h_bar, w_bar
    
    
    def token_calculate(video_path, fps):
        # Pass the video path and the fps frame extraction parameter
        messages = [{"content": [{"video": video_path, "fps":fps}]}]
        vision_infos = extract_vision_info(messages)[0]
    
        resized_height, resized_width=smart_resize(vision_infos,video_path)
    
        height, width, total_frames,video_fps = get_video(video_path)
        num_frames = smart_nframes(vision_infos,total_frames,video_fps)
        print(f"Original video dimensions: {height}×{width}, Model input dimensions: {resized_height}×{resized_width}, Total video frames: {total_frames}, Total frames extracted when fps is {fps}: {num_frames}",end=", ")
        video_token = int(math.ceil(num_frames / 2) * resized_height / 28 * resized_width / 28)
        video_token += 2 # The system automatically adds the <|vision_bos|> and <|vision_eos|> visual marks, each counting as 1 token.
        return video_token
    
    def extract_vision_info(conversations):
        vision_infos = []
        if isinstance(conversations[0], dict):
            conversations = [conversations]
        for conversation in conversations:
            for message in conversation:
                if isinstance(message["content"], list):
                    for ele in message["content"]:
                        if (
                            "image" in ele
                            or "image_url" in ele
                            or "video" in ele
                            or ele.get("type","") in ("image", "image_url", "video")
                        ):
                            vision_infos.append(ele)
        return vision_infos
    
    
    video_token = token_calculate("path/to/your/test.mp4", 1)
    print("Video tokens:", video_token)
  • View bills: You can view bills or top up your account on the Expenses and Costs page in the Alibaba Cloud Management Console.

API reference

For more information about the input and output parameters of the Qwen-VL model, see Qwen.

FAQ

How do I compress an image or video to the required size?

Qwen-VL has a size limit for input files. You can use the following methods to compress your files.

Image compression methods

  • Online tools: Use online tools such as CompressJPEG or TinyPNG to compress images.

  • Local software: Use software such as Photoshop to adjust the quality during export.

  • Code implementation:

    # pip install pillow
    
    from PIL import Image
    def compress_image(input_path, output_path, quality=85):
        with Image.open(input_path) as img:
            img.save(output_path, "JPEG", optimize=True, quality=quality)
    
    # Pass in a local image.
    compress_image("/xxx/before-large.jpeg","/xxx/after-min.jpeg")

Video compression methods

  • Online tools: Use online tools such as FreeConvert to compress videos.

  • Local software: Use software such as HandBrake.

  • Code implementation: Use the FFmpeg tool. For more information, see the official FFmpeg website.

    # Basic conversion command.
    # -i: Specifies the input file path. Example: input.mp4.
    # -vcodec: Specifies the video encoder. Common values include libx264 (recommended for general use) and libx265 (higher compression ratio).
    # -crf: Controls the video quality. The value ranges from 18 to 28. A smaller value indicates higher quality and a larger file size.
    # --preset: Controls the balance between encoding speed and compression efficiency. Common values include slow, fast, and faster.
    # -y: Overwrites the existing file. This parameter does not require a value.
    # output.mp4: Specifies the output file path.
    
    ffmpeg -i input.mp4 -vcodec libx264 -crf 28 -preset slow output.mp4

After the model outputs the object localization results, how do I draw the detection boxes on the original image?

After the Qwen-VL model outputs the object localization results, you can refer to the following code to draw the detection boxes and their labels on the original image.

  • Qwen2.5-VL: The returned coordinates are absolute values in pixels, relative to the top-left corner of the scaled image. To draw the detection boxes, see the code in qwen2_5_vl_2d.py.

  • Qwen3-VL: The model returns relative coordinates. The coordinate values are normalized to a range of [0, 999]. To draw the detection boxes, see the code in qwen3_vl_2d.py (for 2D localization) or qwen3_vl_3d.zip (for 3D localization).

Error codes

If a call fails, see Error messages for troubleshooting.