All Products
Search
Document Center

Alibaba Cloud Model Studio:Qwen VL

Last Updated:Jun 05, 2025

The Qwen VL model provides answers based on the images you submit.

Visit Playground to experience image understanding.

Scenarios

  • Image question answering: Describe the content in images or classify and label them, such as identifying people, places, animals, and more.

  • Mathematical problem solving: Solve mathematical problems in images, suitable for education and training.

  • Video understanding: Analyze video content, such as locating specific events and obtaining timestamps, or generating summaries of key time segments.

  • Object localization: Locate objects in images, returning the coordinates of the top-left and bottom-right corners of the bounding rectangle or the coordinates of the center point.

  • Document parsing: Parse image-based documents (such as scanned documents/image PDFs) into QwenVL HTML format, which not only accurately recognizes text but also obtains position information of elements such as images and tables.

  • OCR and information extraction: Recognize text and formulas in images, or extract information from receipts, certificates, and forms with formatted text output. Supported languages include Chinese, English, Japanese, Korean, Arabic, Vietnamese, French, German, Italian, Spanish, and Russian.

Models and pricing

Commercial models

Compared to the open-source models, the commercial models offer the latest capabilities and improvements.

Name

Version

Context window

Maximum input

Maximum output

Input price

Output price

Free quota

(Note)

(Tokens)

(Million tokens)

qwen-vl-max

Enhanced capabilities of visual reasoning and instruction following compared with qwen-vl-plus. Best for complex tasks.
Currently qwen-vl-max-2025-04-08

Stable

131,072

129,024

Up to 16,384 per image

8,192

$0.8

$3.2

1 million tokens each

Valid for 180 days after activation

qwen-vl-max-latest

Always the latest snapshot

Latest

qwen-vl-max-2025-04-08

Also qwen-vl-max-0408
Qwen2.5-VL series, with 128,000 context window and enhanced mathematics and reasoning capabilities.

Snapshot

qwen-vl-plus

Enhanced detail and text recognition capabilities, supporting images with over one million pixel resolution and any aspect ratio. Exceptional performance for various visual tasks.
Currently qwen-vl-plus-2025-01-25

Stable

131,072

129,024

Up to 16,384 per image

8,192

$0.21

$0.63

qwen-vl-plus-latest

Always the latest snapshot

Latest

qwen-vl-plus-2025-05-07

Also qwen-vl-plus-0507
Significantly enhanced mathematics, inference, and understanding of monitoring video content.

Snapshot

qwen-vl-plus-2025-01-25

Also qwen-vl-plus-0125
Qwen2.5-VL series, with 128,000 context window and enhanced mathematics and reasoning capabilities.

Open-source models

Name

Context window

Maximum input

Maximum output

Input price

Output price

Free quota

(Note)

(Tokens)

(Million tokens)

qwen2.5-vl-72b-instruct

131,072

129,024

Maximum 16384 per image

8,192

$2.8

$8.4

100 million tokens each

Valid for 180 days after activation

qwen2.5-vl-32b-instruct

$1.4

$4.2

qwen2.5-vl-7b-instruct

$0.35

$1.05

qwen2.5-vl-3b-instruct

$0.21

$0.63

Calculate image and video tokens

Image

Each 28×28 pixels corresponds to one token, and an image requires at least 4 tokens. You can estimate the tokens for an image using the following code:

import math
# Install Pillow library using: pip install Pillow
from PIL import Image

def token_calculate(image_path):
    # Open the specified PNG image file
    image = Image.open(image_path)

    # Get the original dimensions of the image
    height = image.height
    width = image.width
    
    # Adjust height to a multiple of 28
    h_bar = round(height / 28) * 28
    # Adjust width to a multiple of 28
    w_bar = round(width / 28) * 28
    
    # Image token lower limit: 4 tokens
    min_pixels = 28 * 28 * 4
    # Image token upper limit: 1280 tokens
    max_pixels = 1280 * 28 * 28
        
    # Process the image by scaling, adjusting total pixels within the range [min_pixels, max_pixels]
    if h_bar * w_bar > max_pixels:
        # Calculate scaling factor beta so that the scaled image's total pixels don't exceed max_pixels
        beta = math.sqrt((height * width) / max_pixels)
        # Recalculate adjusted height, ensuring it's a multiple of 28
        h_bar = math.floor(height / beta / 28) * 28
        # Recalculate adjusted width, ensuring it's a multiple of 28
        w_bar = math.floor(width / beta / 28) * 28
    elif h_bar * w_bar < min_pixels:
        # Calculate scaling factor beta so that the scaled image's total pixels aren't below min_pixels
        beta = math.sqrt(min_pixels / (height * width))
        # Recalculate adjusted height, ensuring it's a multiple of 28
        h_bar = math.ceil(height * beta / 28) * 28
        # Recalculate adjusted width, ensuring it's a multiple of 28
        w_bar = math.ceil(width * beta / 28) * 28
    return h_bar, w_bar

# Replace test.png with your local image path
h_bar, w_bar = token_calculate("test.png")
print(f"Scaled image dimensions: height {h_bar}, width {w_bar}")

# Calculate image tokens: total pixels divided by 28 * 28
token = int((h_bar * w_bar) / (28 * 28))

# The system automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each)
print(f"Image token count is {token + 2}")
// Install sharp using: npm install sharp
import sharp from 'sharp';
import fs from 'fs';

async function tokenCalculate(imagePath) {
    // Open the specified PNG image file
    const image = sharp(imagePath);
    const metadata = await image.metadata();

    // Get the original dimensions of the image
    const height = metadata.height;
    const width = metadata.width;

    // Adjust height to a multiple of 28
    let hBar = Math.round(height / 28) * 28;
    // Adjust width to a multiple of 28
    let wBar = Math.round(width / 28) * 28;

    // Image token lower limit: 4 tokens
    const minPixels = 28 * 28 * 4;
    // Image token upper limit: 1280 tokens
    const maxPixels = 1280 * 28 * 28;

    // Process the image by scaling, adjusting total pixels within the range [min_pixels, max_pixels]
    if (hBar * wBar > maxPixels) {
        // Calculate scaling factor beta so that the scaled image's total pixels don't exceed max_pixels
        const beta = Math.sqrt((height * width) / maxPixels);
        // Recalculate adjusted height, ensuring it's a multiple of 28
        hBar = Math.floor(height / beta / 28) * 28;
        // Recalculate adjusted width, ensuring it's a multiple of 28
        wBar = Math.floor(width / beta / 28) * 28;
    } else if (hBar * wBar < minPixels) {
        // Calculate scaling factor beta so that the scaled image's total pixels aren't below min_pixels
        const beta = Math.sqrt(minPixels / (height * width));
        // Recalculate adjusted height, ensuring it's a multiple of 28
        hBar = Math.ceil(height * beta / 28) * 28;
        // Recalculate adjusted width, ensuring it's a multiple of 28
        wBar = Math.ceil(width * beta / 28) * 28;
    }

    return { hBar, wBar };
}

// Replace test.png with your local image path
const imagePath = 'test.png';
tokenCalculate(imagePath).then(({ hBar, wBar }) => {
    console.log(`Scaled image dimensions: height ${hBar}, width ${wBar}`);

    // Calculate image tokens: total pixels divided by 28 * 28
    const token = Math.floor((hBar * wBar) / (28 * 28));

    // The system automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each)
    console.log(`Total image tokens: ${token + 2}`);
}).catch(err => {
    console.error('Error processing image:', err);
});
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class Main {

    // Custom class to store adjusted dimensions
    public static class ResizedSize {
        public final int height;
        public final int width;

        public ResizedSize(int height, int width) {
            this.height = height;
            this.width = width;
        }
    }

    public static ResizedSize smartResize(String imagePath) throws IOException {
        // 1. Load the image
        BufferedImage image = ImageIO.read(new File(imagePath));
        if (image == null) {
            throw new IOException("Cannot load image file: " + imagePath);
        }

        int originalHeight = image.getHeight();
        int originalWidth = image.getWidth();

        final int minPixels = 28 * 28 * 4;
        final int maxPixels = 1280 * 28 * 28;
        // 2. Initial adjustment to multiples of 28
        int hBar = (int) (Math.round(originalHeight / 28.0) * 28);
        int wBar = (int) (Math.round(originalWidth / 28.0) * 28);
        int currentPixels = hBar * wBar;

        // 3. Adjust dimensions based on conditions
        if (currentPixels > maxPixels) {
            // Current pixels exceed maximum, need to reduce
            double beta = Math.sqrt(
                    (originalHeight * (double) originalWidth) / maxPixels
            );
            double scaledHeight = originalHeight / beta;
            double scaledWidth = originalWidth / beta;

            hBar = (int) (Math.floor(scaledHeight / 28) * 28);
            wBar = (int) (Math.floor(scaledWidth / 28) * 28);
        } else if (currentPixels < minPixels) {
            // Current pixels below minimum, need to enlarge
            double beta = Math.sqrt(
                    (double) minPixels / (originalHeight * originalWidth)
            );
            double scaledHeight = originalHeight * beta;
            double scaledWidth = originalWidth * beta;

            hBar = (int) (Math.ceil(scaledHeight / 28) * 28);
            wBar = (int) (Math.ceil(scaledWidth / 28) * 28);
        }

        return new ResizedSize(hBar, wBar);
    }

    public static void main(String[] args) {
        try {
            ResizedSize size = smartResize(
                    // Replace xxx/test.png with your image path
                    "xxx/test.png"
            );

            System.out.printf("Scaled image dimensions: height %d, width %d%n", size.height, size.width);

            // Calculate tokens (total pixels / 28×28 + 2)
            int token = (size.height * size.width) / (28 * 28) + 2;
            System.out.printf("Total image tokens: %d%n", token);

        } catch (IOException e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Video

You can estimate video tokens using the following code:

# Install first: pip install opencv-python
import math
import os
import logging
import cv2

logger = logging.getLogger(__name__)

FRAME_FACTOR = 2
IMAGE_FACTOR = 28
# Video frame aspect ratio
MAX_RATIO = 200

# Video frame token lower limit
VIDEO_MIN_PIXELS = 128 * 28 * 28
# Video frame token upper limit
VIDEO_MAX_PIXELS = 768 * 28 * 28

# Default FPS value when user doesn't provide FPS parameter
FPS = 2.0
# Minimum number of frames to extract
FPS_MIN_FRAMES = 4
# Maximum number of frames to extract, set to 512 when using qwen2.5-vl model, 80 for other models
FPS_MAX_FRAMES = 512

# Maximum pixel value for video input
# Set VIDEO_TOTAL_PIXELS to 65536 * 28 * 28 when using qwen2.5-vl model, 24576 * 28 * 28 for other models
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 65536 * 28 * 28)))

def round_by_factor(number: int, factor: int) -> int:
    """Return the integer closest to "number" that is divisible by "factor"."""
    return round(number / factor) * factor

def ceil_by_factor(number: int, factor: int) -> int:
    """Return the smallest integer greater than or equal to "number" that is divisible by "factor"."""
    return math.ceil(number / factor) * factor

def floor_by_factor(number: int, factor: int) -> int:
    """Return the largest integer less than or equal to "number" that is divisible by "factor"."""
    return math.floor(number / factor) * factor

def smart_nframes(ele,total_frames,video_fps):
    """Calculate the number of video frames to extract.

    Args:
        ele (dict): Dictionary containing video configuration
            - fps: fps used to control the number of frames extracted for model input.
        total_frames (int): Original total frames of the video.
        video_fps (int | float): Original frame rate of the video

    Raises:
        Error if nframes is not in the interval [FRAME_FACTOR, total_frames]

    Returns:
        Number of video frames for model input.
    """
    assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
    fps = ele.get("fps", FPS)
    min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
    max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration-int(duration)>(1/fps):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration)*video_fps)
    nframes = total_frames / video_fps * fps
    if nframes > total_frames:
        logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
        raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")

    return nframes

def get_video(video_path):
    # Get video information
    cap = cv2.VideoCapture(video_path)

    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    # Get video height
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get_COUNT))

    video_fps = cap.get(cv2.CAP_PROP_FPS)
    return frame_height,frame_width,total_frames,video_fps

def smart_resize(ele,path,factor = IMAGE_FACTOR):
    # Get original video width and height
    height, width, total_frames, video_fps = get_video(path)
    # Video frame token lower limit
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    # Number of video frames to extract
    nframes = smart_nframes(ele, total_frames, video_fps)
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),int(min_pixels * 1.05))

    # Video aspect ratio should not exceed 200:1 or 1:200
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(
            f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
        )

    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar


def token_calculate(video_path, fps):
    # Pass video path and fps frame extraction parameter
    messages = [{"content": [{"video": video_path, "fps":fps}]}]
    vision_infos = extract_vision_info(messages)[0]

    resized_height, resized_width=smart_resize(vision_infos,video_path)

    height, width, total_frames,video_fps = get_video(video_path)
    num_frames = smart_nframes(vision_infos,total_frames,video_fps)
    print(f"Original video size: {height}*{width}, size input to model: {resized_height}*{resized_width}, total video frames: {total_frames}, with fps={fps}, total extracted frames: {num_frames}", end=", ")
    video_token = int(math.ceil(num_frames / 2) * resized_height / 28 * resized_width / 28)
    video_token += 2 # System automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each)
    return video_token

def extract_vision_info(conversations):
    vision_infos = []
    if isinstance(conversations[0], dict):
        conversations = [conversations]
    for conversation in conversations:
        for message in conversation:
            if isinstance(message["content"], list):
                for ele in message["content"]:
                    if (
                        "image" in ele
                        or "image_url" in ele
                        or "video" in ele
                        or ele.get("type","") in ("image", "image_url", "video")
                    ):
                        vision_infos.append(ele)
    return vision_infos


video_token = token_calculate("xxx/test.mp4", 1)
print("Video tokens:", video_token)
// Install first: npm install node-ffprobe @ffprobe-installer/ffprobe
import ffprobeInstaller from '@ffprobe-installer/ffprobe';
import ffprobe from 'node-ffprobe';
import probe from "node-ffprobe";
// Set ffprobe path (global configuration)
ffprobe.FFPROBE_PATH = ffprobeInstaller.path;


// Get video information
async function getVideoInfo(videoPath) {
  try {
    const probeData = await probe(videoPath);
    const videoStream = probeData.streams.find(
      stream => stream.codec_type === 'video'
    );

    if (!videoStream) {
      throw new Error('No video stream found in video');
    }

    const width = videoStream.width;
    const height = videoStream.height;
    const totalFrames = videoStream.nb_frames;
    const [numerator, denominator] = videoStream.avg_frame_rate.split('/');
    const frameRate =parseFloat(numerator/denominator);

    return {
      width,
      height,
      totalFrames,
      frameRate
    };
  } catch (error) {
    console.error('Failed to get video information:', error);
    throw error;
  }
}

// Configuration parameters
const FRAME_FACTOR = 2; 
const IMAGE_FACTOR = 28;
const MAX_RATIO = 200;
// Video frame token lower limit
const VIDEO_MIN_PIXELS = 128 * 28 * 28; 
// Video frame token upper limit
const VIDEO_MAX_PIXELS = 768 * 28 * 28; 
const FPS = 2.0; // Default FPS value when user doesn't provide FPS parameter
// Minimum number of frames to extract
const FPS_MIN_FRAMES = 4;
// Maximum number of frames to extract, set to 512 when using qwen2.5-vl model, 80 for other models
const FPS_MAX_FRAMES = 512; 
// Maximum pixel value for video input
// Set VIDEO_TOTAL_PIXELS to 65536 * 28 * 28 when using qwen2.5-vl model, 24576 * 28 * 28 for other models
const VIDEO_TOTAL_PIXELS = parseInt(process.env.VIDEO_MAX_PIXELS) || 65536 * 28 * 28;

// Math utility functions
function roundByFactor(number, factor) {
    return Math.round(number / factor) * factor;
}

function ceilByFactor(number, factor) {
    return Math.ceil(number / factor) * factor;
}

function floorByFactor(number, factor) {
    return Math.floor(number / factor) * factor;
}

// Calculate number of frames to extract
function smartNFrames(ele, totalFrames, frameRate) {
    const fps = ele.fps || FPS;
    const minFrames = ceilByFactor(ele.min_frames || FPS_MIN_FRAMES, FRAME_FACTOR);
    const maxFrames = floorByFactor(
        ele.max_frames || Math.min(FPS_MAX_FRAMES, totalFrames),
        FRAME_FACTOR
    );
    const duration = frameRate !== 0 ? parseFloat(totalFrames / frameRate) : 0;

    let totalFramesAdjusted = duration % 1 > (1 / fps)
        ? Math.ceil(duration * frameRate)
        : Math.ceil(Math.floor(parseInt(duration)) * frameRate);

    const nframes = (totalFramesAdjusted / frameRate) * fps;
    const finalNFrames = parseInt(Math.min(
        Math.max(nframes, minFrames),
        Math.min(maxFrames, totalFramesAdjusted)
    ));

    if (finalNFrames < FRAME_FACTOR || finalNFrames > totalFramesAdjusted) {
        throw new Error(
            `nframes should be between ${FRAME_FACTOR} and ${totalFramesAdjusted}, got ${finalNFrames}`
        );
    }
    return finalNFrames;
}

// Smart resize function
async function smartResize(ele, videoPath) {
    const {height, width, totalFrames, frameRate} = await getVideoInfo(videoPath);
    const minPixels = VIDEO_MIN_PIXELS;
    const nframes = smartNFrames(ele, totalFrames, frameRate)
    const maxPixels = Math.max(
        Math.min(VIDEO_MAX_PIXELS, VIDEO_TOTAL_PIXELS / nframes * FRAME_FACTOR),
        Math.floor(minPixels * 1.05)
    );

    // Check aspect ratio
    const ratio = Math.max(height, width) / Math.min(height, width);
    if (ratio > MAX_RATIO) {
        throw new Error(`Aspect ratio ${ratio} exceeds ${MAX_RATIO}`);
    }

    let hBar = Math.max(IMAGE_FACTOR, roundByFactor(height, IMAGE_FACTOR));
    let wBar = Math.max(IMAGE_FACTOR, roundByFactor(width, IMAGE_FACTOR));

    if (hBar * wBar > maxPixels) {
        const beta = Math.sqrt((height * width) / maxPixels);
        hBar = floorByFactor(height / beta, IMAGE_FACTOR);
        wBar = floorByFactor(width / beta, IMAGE_FACTOR);
    } else if (hBar * wBar < minPixels) {
        const beta = Math.sqrt(minPixels / (height * width));
        hBar = ceilByFactor(height * beta, IMAGE_FACTOR);
        wBar = ceilByFactor(width * beta, IMAGE_FACTOR);
    }

    return { hBar, wBar };
}

// Calculate token count
async function tokenCalculate(videoPath, fps) {
    const messages = [{ content: [{ video: videoPath, fps }] }];
    const visionInfos = extractVisionInfo(messages);

    const { hBar, wBar } = await smartResize(visionInfos[0], videoPath);
    const { height, width, totalFrames, frameRate  }= await getVideoInfo(videoPath);
    const numFrames = smartNFrames(visionInfos[0], totalFrames, frameRate);

    console.log(
        `Original video size: ${height}*${width}, size input to model: ${hBar}*${wBar}, total video frames: ${totalFrames}, with fps=${fps}, total extracted frames: ${numFrames}`
    );

    const videoToken = Math.ceil(numFrames / 2) * Math.floor(hBar / 28) * Math.floor(wBar / 28) + 2;
    return videoToken;
}

// Parse visual information
function extractVisionInfo(conversations) {
    const visionInfos = [];
    if (!Array.isArray(conversations)) {
        conversations = [conversations];
    }
    conversations.forEach(conversation => {
        if (!Array.isArray(conversation)) {
            conversation = [conversation];
        }
        conversation.forEach(message => {
            if (Array.isArray(message.content)) {
                message.content.forEach(ele => {
                    if (ele.image || ele.image_url || ele.video || ['image', 'image_url', 'video'].includes(ele.type)) {
                        visionInfos.push(ele);
                    }
                });
            }
        });
    });
    return visionInfos;
}

// Usage example
(async () => {
    try {
        const videoPath = "xxx/test.mp4"; // Replace with your local path
        const videoToken = await tokenCalculate(videoPath, 1);
        console.log('Video tokens:', videoToken);
    } catch (error) {
        console.error('Error:', error.message);
    }
})();

How to use

You must first Obtain an API key and Set API key as an environment variable. To use the OpenAI SDK or DashScope SDK, you must Install the SDK.
The DashScope Python SDK version must be no lower than 1.20.7. The DashScope Java SDK version must be no lower than 2.18.3.
If you are a member of a sub-workspace, make sure that the Super Admin has authorized models for the sub-workspace.
You can use the recommended prompts to adapt to different scenarios. Note that only Qwen2.5-VL supports document parsing, object localization, and video understanding with timestamps.
When using qwen-vl-plus-latest and qwen-vl-plus-2025-01-25 for text extraction, set presence_penalty to 1.5 and repetition_penalty to 1.0 for better accuracy.

Get started

Sample code for image understanding using image URLs.

Check the limitations on input images. To use local images, see Using local files.

OpenAI

Python

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
    model="qwen-vl-max",  # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
                    },
                },
                {"type": "text", "text": "What scene is depicted in this image?"},
            ],
        },
    ],
)

print(completion.choices[0].message.content)

Sample response

This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the ocean and sky in the background. The person and dog appear to be interacting, with the dog's front paws resting on the person's hand. Sunlight is shining from the right side of the frame, adding a warm atmosphere to the entire scene.

Node.js

import OpenAI from "openai";

const openai = new OpenAI({
  // If environment variable is not configured, replace the line below with: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

async function main() {
  const response = await openai.chat.completions.create({
    model: "qwen-vl-max",   // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models 
    messages: [{
        role: "system",
        content: [{
          type: "text",
          text: "You are a helpful assistant."
        }]
      },
      {
        role: "user",
        content: [{
            type: "image_url",
            image_url: {
              "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
            }
          },
          {
            type: "text",
            text: "What scene is depicted in this image?"
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}

main()

Sample response

This is a photo taken on a beach. In the photo, a woman wearing a plaid shirt is sitting on the sand, interacting with a yellow Labrador retriever wearing a collar. The background shows the ocean and sky, with sunlight shining on them, creating a warm atmosphere.

curl

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen-vl-max",
  "messages": [
    {"role":"system",
     "content":[
        {"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user",
     "content": [
        {"type": "image_url", "image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"}},
        {"type": "text", "text": "What scene is depicted in this image?"}
    ]
  }]
}'

Sample response

{
  "choices": [
    {
      "message": {
        "content": "This image shows a woman and a dog interacting on a beach. The woman is sitting on the sand, smiling and shaking hands with the dog. The background features the ocean and sunset sky, creating a very warm and harmonious atmosphere. The dog is wearing a collar and appears very gentle.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 1270,
    "completion_tokens": 54,
    "total_tokens": 1324
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen-vl-max",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
{
    "role": "system",
    "content": [
    {"text": "You are a helpful assistant."}]
},
{
    "role": "user",
    "content": [
    {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
    {"text": "What scene is depicted in this image?"}]
}]

response = dashscope.MultiModalConversation.call(
    # If environment variable is not configured, replace the line below with: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max',   # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Sample response

This is a photo taken on a beach. The photo shows a woman and a dog. The woman is sitting on the sand, smiling and interacting with the dog. The dog is wearing a collar and appears to be shaking hands with the woman. The background features the ocean and sky, with sunlight shining on them, creating a warm atmosphere.

Java

import java.util.Arrays;
import java.util.Collections;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "What scene is depicted in this image?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If environment variable is not configured, replace the line below with: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max")  // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Sample response

This is a photo taken on a beach. The photo shows a person wearing a plaid shirt and a dog wearing a collar. They are sitting face to face, appearing to interact. The background is the ocean and sky, with sunlight shining on them, creating a warm and harmonious atmosphere.

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	         {"text": "You are a helpful assistant."}]},
            {
                "role": "user",
                "content": [
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                    {"text": "What scene is depicted in this image?"}
                ]
            }
        ]
    }
}'

Sample response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "This is a photo taken on a beach. The photo shows a person wearing a plaid shirt and a dog wearing a collar. They are sitting on the sand, with the ocean and sky in the background. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the entire scene."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 55,
    "input_tokens": 1271,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

Multi-round conversation

Qwen-VL can reference conversation history when generating responses. You need to maintain a messages array and add the conversation history of each round and new instructions to the messages array.

OpenAI

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
          }
        },
        {
          "type": "text",
          "text": "What scene is depicted in the image?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "This is a girl and a dog."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Write a poem describing this scene"
        }
      ]
    }
  ]
}'

Response result

{
    "choices": [
        {
            "message": {
                "content": "Sea breeze gently caresses smiling faces,  \nOn sandy beach with canine companion.  \nSunset casts shadows short and sweet,  \nJoyful moments intoxicate the heart.",
                "role": "assistant"
            },
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null
        }
    ],
    "object": "chat.completion",
    "usage": {
        "prompt_tokens": 1295,
        "completion_tokens": 32,
        "total_tokens": 1327
    },
    "created": 1726324976,
    "system_fingerprint": null,
    "model": "qwen-vl-max",
    "id": "chatcmpl-3c953977-6107-96c5-9a13-c01e328b24ca"
}

DashScope

Python

import os
from dashscope import MultiModalConversation
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
	"role": "system",
	"content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            {
                "image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
            },
            {"text": "What scene is depicted in the image?"},
        ],
    }
]
response = MultiModalConversation.call(
    # If the environment variable is not configured, please replace the following line with: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max',  #  Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
    )
print(f"Model first round output: {response.output.choices[0].message.content[0]['text']}")
messages.append(response['output']['choices'][0]['message'])
user_msg = {"role": "user", "content": [{"text": "Write a poem describing this scene"}]}
messages.append(user_msg)
response = MultiModalConversation.call(
    # If the environment variable is not configured, please replace the following line with: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max',
    messages=messages
    )
print(f"Model second round output: {response.output.choices[0].message.content[0]['text']}")

Response result

Model first round output: This is a photo taken on a beach. In the photo, there is a person wearing a plaid shirt and a dog wearing a collar. The person and dog are sitting face to face, seemingly interacting. The background is the sea and sky, with sunlight shining on them, creating a warm atmosphere.
Model second round output: On the sun-drenched beach, person and dog share joyful moments together.

Java

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final String modelName = "qwen-vl-max";  //  Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    public static void MultiRoundConversationCall() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "What scene is depicted in the image?"))).build();
        List<MultiModalMessage> messages = new ArrayList<>();
        messages.add(systemMessage);
        messages.add(userMessage);
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If environment variable is not configured, replace the following line with: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))                
                .model(modelName)
                .messages(messages)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println("First round output: "+result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));        // add the result to conversation
        messages.add(result.getOutput().getChoices().get(0).getMessage());
        MultiModalMessage msg = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "Write a poem describing this scene"))).build();
        messages.add(msg);
        param.setMessages((List)messages);
        result = conv.call(param);
        System.out.println("Second round output: "+result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }

    public static void main(String[] args) {
        try {
            MultiRoundConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response result

First round output: This is a photo taken on a beach. In the photo, there is a person wearing a plaid shirt and a dog wearing a collar. The person and dog are sitting face to face, seemingly interacting. The background is the sea and sky, with sunlight shining on them, creating a warm atmosphere.
Second round output: On the sun-drenched beach, person and dog share joyful moments together.

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
            {
                "role": "user",
                "content": [
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                    {"text": "What scene is depicted in the image?"}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"text": "This is a woman and a Labrador retriever playing on the beach."}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"text": "Write a seven-character quatrain describing this scene"}
                ]
            }
        ]
    }
}'

Response result

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "Waves gently lap the sandy shore, girl and dog frolic together. Sunlight falls on smiling faces, joyful moments forever remembered."
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "output_tokens": 27,
        "input_tokens": 1298,
        "image_tokens": 1247
    },
    "request_id": "bdf5ef59-c92e-92a6-9d69-a738ecee1590"
}

Stream

In streaming output mode, the model generates and returns intermediate results in real-time instead of one final response. This reduces the wait time for the complete response.

OpenAI

Simply set stream to true.

Python

from openai import OpenAI
import os

client = OpenAI(
    # If environment variable is not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-vl-max",  # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {"role": "system",
         "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {"role": "user",
        "content": [{"type": "image_url",
                    "image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
                    {"type": "text", "text": "What scene is depicted in the image?"}]}],
    stream=True
)
full_content = ""
print("Streaming output content:")
for chunk in completion:
    if chunk.choices[0].delta.content is None:
        continue
    full_content += chunk.choices[0].delta.content
    print(chunk.choices[0].delta.content)
print(f"Complete content: {full_content}")

Sample response

Streaming output content:

The
image
depicts
a
woman
......
warm
harmonious
atmosphere
.
Complete content: The image depicts a woman and a dog interacting on a beach. The woman is sitting on the sand, smiling and shaking hands with the dog, looking very happy. The background shows the sea and sky, with sunlight shining on them, creating a warm harmonious atmosphere.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variable is not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const completion = await openai.chat.completions.create({
    model: "qwen-vl-max",  // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages: [
        {"role": "system",
         "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {"role": "user",
        "content": [{"type": "image_url",
                    "image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
                    {"type": "text", "text": "What scene is depicted in the image?"}]}],
    stream: true,
});

let fullContent = ""
console.log("Stream output content: ")
for await (const chunk of completion) {
    if (chunk.choices[0].delta.content != null) {
        fullContent += chunk.choices[0].delta.content;
        console.log(chunk.choices[0].delta.content);
    }
}
console.log(`Full output content: ${fullContent}`)

Sample response

Streaming output content:

The image depicts
a woman and a
dog interacting on a beach.
......
shining on them,
creating a warm and harmonious
atmosphere.
Complete content: The image depicts a woman and a dog interacting on a beach. The woman is wearing a plaid shirt, sitting on the sand, smiling and shaking hands with the dog. The dog is wearing a collar and looks happy. The background shows the sea and sky, with sunlight shining on them, creating a warm and harmonious atmosphere.

curl

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen-vl-plus",
    "messages": [
    {
      "role": "system",
      "content": [{"type":"text","text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
          }
        },
        {
          "type": "text",
          "text": "What scene is depicted in the image?"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true}
}'

Sample response

data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"finish_reason":null,"delta":{"content":"The"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"delta":{"content":" image"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

......

data: {"choices":[{"delta":{"content":" photo taken outdoors. The overall atmosphere appears very"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[{"finish_reason":"stop","delta":{"content":" harmonious and warm."},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":1276,"completion_tokens":85,"total_tokens":1361},"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}

data: [DONE]

DashScope

  • Python SDK: Set stream to True.

  • Java SDK: Use the streamCall interface.

  • HTTP: Specify X-DashScope-SSE as enable in the Header.

By default, the content of streaming output is non-incremental. Each returned content includes previously generated content. To use incremental stream output, set incremental_output to true for Python. Set incrementalOutput to true for Java.

Python

import os
from dashscope import MultiModalConversation
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "system",
        "content": [{"text": "You are a helpful assistant."}]},

    {
        "role": "user",
        "content": [
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
            {"text": "What scene is depicted in the image?"}
        ]
    }
]
responses = MultiModalConversation.call(
    # If environment variable is not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen-vl-max',  # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages,
    stream=True,
    incremental_output=True)
full_content = ""
print("Streaming output content:")
for response in responses:
    try:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
    except:
        pass
print(f"Complete content: {full_content}")

Sample response

Streaming output content:
The image depicts
a person and a dog
interacting on a beach
......
sunlight shining on them
, creating a
warm and harmonious atmosphere
.
Complete content: The image depicts a person and a dog interacting on a beach. The person is wearing a plaid shirt, sitting on the sand, shaking hands with a golden retriever wearing a collar. The background shows waves and sky, with sunlight shining on them, creating a warm and harmonious atmosphere.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void streamCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // must create mutable map.
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                        Collections.singletonMap("text", "What scene is depicted in the image?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If environment variable is not configured, replace the line below with: .apiKey("sk-xxx") using your Model Studio API Key
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max")  // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(systemMessage, userMessage))
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            streamCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Sample response

The
image
depicts
a
woman
and
a
dog
on
a
beach
......
creating
a
warm
and
harmonious
atmosphere
.

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen-vl-max",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [
                    {"text": "You are a helpful assistant."}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                    {"text": "What scene is depicted in the image?"}
                ]
            }
        ]
    },
    "parameters": {
        "incremental_output": true
    }
}'

Sample response

iid:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"This"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":1,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":" image"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":2,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

......

id:17
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":" appreciation. This is a heartwarming scene that shows"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":112,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:18
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":" the deep emotional bond between humans and animals."}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":120,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

id:19
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"input_tokens":1276,"output_tokens":121,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}

High resolution image understanding

Set vl_high_resolution_images to true to increase the token limit for a single image from 1,280 to 16,384:

Parameter value

Token limit per image

Description

Scenarios

True

16,384

  • The token limit for a single image is 16,384. Images exceeding this value will be scaled to lower than 16,384.

  • This allows the model to directly process images with higher resolution and understand more details. Its speed will decrease and token usage will increase.

Scenarios with rich contents and details.

False (default)

1,280

  • The token limit for a single image is 1,280. Images exceeding this value will be scaled to lower than 1280.

  • The model's speed will increase and token usage will decrease.

Scenarios with fewer details, where the model only needs to understand general information or where speed is more important.

vl_high_resolution_images is only supported by DashScope Python SDK and HTTP methods.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
            {"text": "What does this image show?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max', #  Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages,
    vl_high_resolution_images=True
)

print("Model response:\n ",response.output.choices[0].message.content[0]["text"])

print("Token usage information: ","Total input tokens: ",response.usage["input_tokens"] , "Image tokens: " , response.usage["image_tokens"])

Sample response

Model response:
  This image shows a cozy Christmas decoration scene. The following elements can be seen in the picture:

1. **Christmas trees**: Two small Christmas trees covered with white snow.
2. **Reindeer figurine**: A brown reindeer figurine with large antlers.
3. **Candles and candleholders**: Several wooden candleholders with lit candles that emit a warm glow.
4. **Christmas ornaments**: Including golden ball decorations, pinecones, red berry strings, etc.
5. **Christmas gift box**: A small golden gift box tied with a golden ribbon.
6. **Christmas lettering**: Wooden "MERRY CHRISTMAS" lettering that enhances the festive atmosphere.
7. **Background**: A wooden background that gives a natural and warm feeling.

The overall ambiance is very cozy and festive, filled with a strong Christmas spirit.
Token usage information:  Total input tokens:  5368, Image tokens:  5342

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	       {"text": "You are a helpful assistant."}]},
            {
             "role": "user",
             "content": [
               {"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
               {"text": "What does this image show?"}
                ]
            }
        ]
    },
    "parameters": {
        "vl_high_resolution_images": true
    }
}'

Sample response

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "This image shows a cozy Christmas decoration scene. The picture includes the following elements:\n\n1. **Christmas trees**: Two small Christmas trees covered with white snow.\n2. **Reindeer figurine**: A brown reindeer figurine positioned in the center-right of the image.\n3. **Candles**: Several wooden candles, two of which are lit, emitting a warm glow.\n4. **Christmas ornaments**: Some gold and red decorative balls, pinecones, berries, and green pine branches.\n5. **Christmas gift**: A small golden gift box, with a bag featuring Christmas patterns next to it.\n6. **\"MERRY CHRISTMAS\" lettering**: Wooden letters spelling \"MERRY CHRISTMAS\" placed on the left side of the image.\n\nThe entire scene is arranged against a wooden background, creating a warm, festive atmosphere that's perfect for Christmas celebrations."
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "total_tokens": 5553,
        "output_tokens": 185,
        "input_tokens": 5368,
        "image_tokens": 5342
    },
    "request_id": "38cd5622-e78e-90f5-baa0-c6096ba39b04"
}

Multiple image input

Qwen-VL can process multiple images in a single request, and the model will respond based on all of them. You can input images as URLs or local files, or a combination of both. The sample codes use URLs.

The total number of tokens in the input images must be less than the maximum input of the model. Calculate the maximum number of images based on Image number limitations.

OpenAI

Python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-vl-max",  # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {"role": "system","content": [{"type": "text", "text": "You are a helpful assistant."}]},
        {"role": "user","content": [
            # First image URL, if using a local file, replace the url value with the Base64 encoding of the image
            {"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
            # Second image URL, if using a local file, replace the url value with the Base64 encoding of the image
            {"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},},
            {"type": "text", "text": "What do these images depict?"},
            ],
        }
    ],
)

print(completion.choices[0].message.content)

Sample response

Image 1 shows a woman and a Labrador dog interacting on a beach. The woman is wearing a plaid shirt, sitting on the sand, and shaking hands with the dog. The background features ocean waves and sky, creating a warm and pleasant atmosphere.

Image 2 shows a tiger walking in a forest. The tiger has orange fur with black stripes, and it is stepping forward. It is surrounded by dense trees and vegetation, with fallen leaves covering the ground, giving the scene a wild, natural feeling.

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen-vl-max",  // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [
            {role: "system",content:[{ type: "text", text: "You are a helpful assistant." }]},
            {role: "user",content: [
            // First image URL, if using a local file, replace the url value with the Base64 encoding of the image
            { type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"}},
            // Second image URL, if using a local file, replace the url value with the Base64 encoding of the image
            { type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"}},
            { type: "text", text: "What do these images depict?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

Sample response

In the first image, a person and a dog are interacting on a beach. The person is wearing a plaid shirt, and the dog is wearing a collar. They appear to be shaking hands or high-fiving.

In the second image, a tiger is walking in a forest. The tiger has orange fur with black stripes, and the background consists of green trees and vegetation.

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen-vl-max",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
          }
        },
        {
          "type": "text",
          "text": "What do these images depict?"
        }
      ]
    }
  ]
}'

Sample response

{
  "choices": [
    {
      "message": {
        "content": "Image 1 shows a woman and a Labrador dog interacting on a beach. The woman is wearing a plaid shirt, sitting on the sand, and shaking hands with the dog. The background features an ocean view and sunset sky, creating a very warm and harmonious scene.\n\nImage 2 shows a tiger walking in a forest. The tiger has orange fur with black stripes, and it is stepping forward. It is surrounded by dense trees and vegetation, with fallen leaves covering the ground, giving the entire scene a natural wildness and vitality.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 2497,
    "completion_tokens": 109,
    "total_tokens": 2606
  },
  "created": 1725948561,
  "system_fingerprint": null,
  "model": "qwen-vl-max",
  "id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}

DashScope

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
	"role": "system",
	"content": [{"text": "You are a helpful assistant."}]},
    {
        "role": "user",
        "content": [
            # First image URL.
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
            # Second image URL
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
            # Third image URL
            {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"},
            {"text": "What do these images depict?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # If environment variables are not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max', # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/zh/model-studio/getting-started/models
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Sample response

These images show some animals and natural scenes. The first image shows a person and a dog interacting on a beach. The second image is a tiger walking in a forest. The third image is a cartoon-style rabbit jumping on a grassy field.

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        // First image URL
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
                       // Second image URL
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
                        // Third image URL
                        Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"),
                        Collections.singletonMap("text", "What do these images depict?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If environment variables are not configured, replace the line below with: .apiKey("sk-
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max")  // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Sample response

These images show some animals and natural scenes.

1. First image: A woman and a dog interacting on a beach. The woman is wearing a plaid shirt, sitting on the sand, and the dog is wearing a collar, extending its paw to shake hands with the woman.
2. Second image: A tiger walking in a forest. The tiger has orange fur with black stripes, and the background consists of trees and leaves.
3. Third image: A cartoon-style rabbit jumping on a grassy field. The rabbit is white with pink ears, and the background features blue sky and yellow flowers.

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen-vl-max",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": [{"text": "You are a helpful assistant."}]},
            {
                "role": "user",
                "content": [
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
                    {"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"},
                    {"text": "What do these images show?"}
                ]
            }
        ]
    }
}'

Sample response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "This image shows a woman and her dog on a beach. They appear to be enjoying each other's company, with the dog sitting on the sand and extending its paw to shake hands or interact with the woman. The background features a beautiful sunset view, with waves gently lapping at the shoreline.\n\nPlease note that my description is based on what is visible in the image and does not include any information beyond the visual content. If you need more specific details about this scene, please let me know!"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "output_tokens": 81,
    "input_tokens": 1277,
    "image_tokens": 1247
  },
  "request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}

Video understanding

Some Qwen-VL models support video understanding, including image sequences (video frames) or video files.

Video file

To pass video files to qwen-vl-plus, qwen2.5-vl-3b-instruct, and qwen2.5-vl-7b-instruct, you must first submit a ticket. You can use other models directly.

Video file limitations:

  • Video file size:

    • For video URL: Qwen2.5-VL series models support videos up to 1 GB, other models up to 150 MB.

    • For local file: When using OpenAI SDK, the Base64-encoded video must be less than 10 MB. When using DashScope SDK, the video must be less than 100 MB.

  • Video file formats: MP4, AVI, MKV, MOV, FLV, and WMV.

  • Video duration: Qwen2.5-VL models support videos from 2 seconds to 10 minutes. Other models support videos from 2 seconds to 40 seconds.

  • Video dimensions: No restrictions, but video files will be adjusted to approximately 600,000 pixels. Larger dimensions will not provide better understanding.

  • Currently, audio in video files is not supported for understanding.

Before Qwen VL processes video content, it extracts frames from the video file, generating several video frames for content understanding. You can set the fps parameter to control the frame extraction frequency:

  • Only the DashScope SDK supports this parameter. A frame is extracted every seconds. A higher fps is suitable for high-speed motion scenarios (such as sporting events and action movies), while a lower fps is suitable for long videos or static content.

  • The OpenAI SDK does not support this parameter. A frame is extracted every 0.5 seconds from the video file.

Below is an example code for using video URLs. For using local videos, see Local files.

OpenAI

When using the OpenAI SDK or HTTP method to input video files to the Qwen-VL model, you need to set the "type" parameter in the user message to "video_url".

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max",
    messages=[
        {"role": "system",
         "content": [{"type": "text","text": "You are a helpful assistant."}]},
        {"role": "user","content": [{
            # When directly providing a video file, set the type value to video_url
            # When using the OpenAI SDK, video frames are extracted every 0.5 seconds by default and cannot be modified. To customize the frame extraction frequency, please use the DashScope SDK.
            "type": "video_url",            
            "video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {"type": "text","text": "What is the content of this video?"}]
         }]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen-vl-max",
        messages: [
        {role:"system",content:["You are a helpful assistant."]},
        {role: "user",content: [
            // When directly providing a video file, set the type value to video_url
            // When using the OpenAI SDK, video frames are extracted every 0.5 seconds by default and cannot be modified. To customize the frame extraction frequency, please use the DashScope SDK.
            {type: "video_url", video_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
            {type: "text", text: "What is the content of this video?" },
        ]}]
    });
    console.log(response.choices[0].message.content);
}

main()

curl

curl -X POST https://dashscope.aliyuncs-intl.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max",
    "messages": [
    {"role": "system", "content": [{"type": "text","text": "You are a helpful assistant."}]},
    {"role": "user","content": [{"type": "video_url","video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
    {"type": "text","text": "What is the content of this video?"}]}]
}'

DashScope

Python

import dashscope
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {"role":"system","content":[{"text": "You are a helpful assistant."}]},
    {"role": "user",
        "content": [
            # The fps parameter controls video frame extraction frequency, indicating that a frame is extracted every 1/fps seconds. For complete usage, see: https://www.alibabacloud.com/help/zh/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is the content of this video?"}
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    # If environment variables are not configured, replace the following line with: api_key ="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-max',
    messages=messages
)

print(response.output.choices[0].message.content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;

public class Main {
   static {
            Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
        }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        // The fps parameter controls video frame extraction frequency, indicating that a frame is extracted every 1/fps seconds. For complete usage, see: https://www.alibabacloud.com/help/zh/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
        Map<String, Object> params = Map.of(
                "video", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
                "fps",2);
        MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text", "You are a helpful assistant."))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "What is the content of this video?"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If environment variables are not configured, replace the following line with: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-max")
                .messages(Arrays.asList(systemMessage, userMessage))
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max",
    "input":{
        "messages":[
            {"role": "system","content": [{"text": "You are a helpful assistant."}]},
            {"role": "user","content": [{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
            {"text": "What is the content of this video?"}]}]}
}'

Image sequence

Image sequence limitations

  • Qwen2.5-VL models: at least 4 images and at most 512 images.

  • Other models: at least 4 images and at most 80 images.

When image sequences are provided, it means that frames have been extracted from the video file in advance. When calling Qwen2.5-VL models, you can set the fps parameter, which helps the model to perceive time information:

  • Only the DashScope SDK supports this parameter. A frame is extracted every seconds.

  • The OpenAI SDK does not support this parameter. A frame is extracted every 0.5 seconds from the video file.

Below is an example code for using image sequence URLs. For using local videos, see Local files.

OpenAI compatible

When using the OpenAI SDK or HTTP method to input image sequences as video to the Qwen-VL model, you need to set the "type" parameter in the user message to "video".

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen2.5-vl-72b-instruct", # This example uses qwen2.5-vl-72b-instruct, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[{"role": "user","content": [
        # When providing an image list, the "type" parameter in the user message is "video"
        # When using the OpenAI SDK, image lists are treated as if they were extracted every 0.5 seconds from the video by default, and this cannot be modified. To customize the frame extraction frequency, please use the DashScope SDK.
        {"type": "video","video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
        {"type": "text","text": "Describe the specific process in this video"},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

// Make sure you've specified "type": "module" in your package.json
import OpenAI from "openai";

const openai = new OpenAI({
    // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx",
    apiKey: process.env.DASHSCOPE_API_KEY,
    baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

async function main() {
    const response = await openai.chat.completions.create({
        model: "qwen2.5-vl-72b-instruct", // This example uses qwen2.5-vl-72b-instruct, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [{
            role: "user",
            content: [
                {
                    // When providing an image list, the "type" parameter in the user message is "video"
                    // When using the OpenAI SDK, image lists are treated as if they were extracted every 0.5 seconds from the video by default, and this cannot be modified. To customize the frame extraction frequency, please use the DashScope SDK.
                    type: "video",
                    video: [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
                    ]
                },
                {
                    type: "text",
                    text: "Describe the specific process in this video"
                }
            ]
        }]
    });
    console.log(response.choices[0].message.content);
}

main();

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen2.5-vl-72b-instruct",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
                {"type": "text",
                "text": "Describe the specific process in this video"}]}]
}'

DashScope

Python

import os
# dashscope version must be at least 1.20.10
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
             "content": [
                  # When providing an image list to a Qwen2.5-VL series model, you can set the fps parameter, indicating that the image list was extracted from the original video every 1/fps seconds
                 {"video":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                   "fps":2},
                 {"text": "Describe the specific process in this video"}]}]
response = dashscope.MultiModalConversation.call(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qwen2.5-vl-72b-instruct',  # This example uses qwen2.5-vl-72b-instruct, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

// DashScope SDK version must be at least 2.18.3
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final String MODEL_NAME = "qwen2.5-vl-72b-instruct";  // This example uses qwen2.5-vl-72b-instruct, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
    public static void videoImageListSample() throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMessage = MultiModalMessage.builder()
                .role(Role.SYSTEM.getValue())
                .content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant.")))
                .build();
        //  When providing an image list to a Qwen2.5-VL series model, you can set the fps parameter, indicating that the image list was extracted from the original video every 1/fps seconds
        Map<String, Object> params = Map.of(
                "video", Arrays.asList("https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"),
                "fps",2);
        MultiModalMessage userMessage = MultiModalMessage.builder()
                .role(Role.USER.getValue())
                .content(Arrays.asList(
                        params,
                        Collections.singletonMap("text", "Describe the specific process in this video")))
                .build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If environment variables are not configured, replace the following line with: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model(MODEL_NAME)
                .messages(Arrays.asList(systemMessage, userMessage)).build();
        MultiModalConversationResult result = conv.call(param);
        System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            videoImageListSample();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen2.5-vl-72b-instruct",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
            ],
            "fps":2
                 
          },
          {
            "text": "Describe the specific process in this video"
          }
        ]
      }
    ]
  }
}'

Using local files (Input Base64 encoding)

Here are sample codes for passing local image files. Currently, only the OpenAI SDK or HTTP method support local files.

Image

Using eagle.png saved locally as an example.

When using the OpenAI SDK, the Base64-encoded image must be less than 10 MB.

OpenAI

Take the following steps:

  1. Encode the image file: Read the local image and encode it in Base64 format

  2. Pass the Base64 data: Provide the encoded data in image_url in this format: data:image/{format};base64,{base64_image}.

    image/{format}: The format of the local image. image/{format} must match the Content Type in the image format table. For example, for a jpg image, use image/jpeg.
  3. Call the model: Call the Qwen-VL and process the response.

Python

from openai import OpenAI
import os
import base64


#  base 64 encoding format
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxxx/eagle.png with the absolute path of your local image
base64_image = encode_image("xxx/eagle.png")
client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max", # Using qwen-vl-max as an example here, you can change the model name as needed. Model List: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
        {
    	    "role": "system",
            "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # When passing Base64 image data, note that the image format (i.e., image/{format}) needs to be consistent with the Content Type in the supported image list. "f" is a string formatting method.
                    # PNG image:  f"data:image/png;base64,{base64_image}"
                    # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP image: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                },
                {"type": "text", "text": "What scene is depicted in the image?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';


const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
// Replace xxx/eagle.png with the absolute path of your local image
const base64Image = encodeImage("xxx/eagle.png")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max",  // Using qwen-vl-max as an example here, you can change the model name as needed. Model List: https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [
            {"role": "system", 
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
            "content": [{"type": "image_url",
                            // Note that when passing Base64, the image format (i.e., image/{format}) needs to be consistent with the Content Type in the supported image list.
                           // PNG image:  data:image/png;base64,${base64Image}
                          // JPEG image: data:image/jpeg;base64,${base64Image}
                         // WEBP image: data:image/webp;base64,${base64Image}
                        "image_url": {"url": `data:image/png;base64,${base64Image}`},},
                        {"type": "text", "text": "What scene is depicted in the image?"}]}]
    });
    console.log(completion.choices[0].message.content);
} 

main();

Video

Image sequence

Using locally saved football1.jpg, football2.jpg, football3.jpg, football4.jpg as examples.

When using the OpenAI SDK, each Based64-encoded image must be less than 10 MB.

OpenAI

Python

import os
from openai import OpenAI
import base64
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")
client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max",  # Using qwen-vl-max as an example here, you can change the model name as needed. Model List: https://www.alibabacloud.com/help/model-studio/getting-started/models
    messages=[
    {"role": "system",
     "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user","content": [
        {"type": "video","video": [
            f"data:image/jpeg;base64,{base64_image1}",
            f"data:image/jpeg;base64,{base64_image2}",
            f"data:image/jpeg;base64,{base64_image3}",
            f"data:image/jpeg;base64,{base64_image4}",]},
        {"type": "text","text": "Describe the specific process in this video"},
    ]}]
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
  
const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max",  // Using qwen-vl-max as an example here, you can change the model name as needed. Model List: https://www.alibabacloud.com/help/model-studio/getting-started/models
        messages: [
            {"role": "system",
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
             "content": [{"type": "video",
                            // Note that when passing Base64, the image format (i.e., image/{format}) needs to be consistent with the Content Type in the supported image list.
                           // PNG image:  data:image/png;base64,${base64Image}
                          // JPEG image: data:image/jpeg;base64,${base64Image}
                         // WEBP image: data:image/webp;base64,${base64Image}
                        "video": [
                            `data:image/jpeg;base64,${base64Image1}`,
                            `data:image/jpeg;base64,${base64Image2}`,
                            `data:image/jpeg;base64,${base64Image3}`,
                            `data:image/jpeg;base64,${base64Image4}`]},
                        {"type": "text", "text": "What scene does this video depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

Video files

Using locally saved test.mp4 as an example.

When using the OpenAI SDK, the Base64-encoded local video must be less than 10 MB.

OpenAI

Python

from openai import OpenAI
import os
import base64


#  Base64 encoding format
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# Replace xxxx/test.mp4 with the absolute path of your local video
base64_video = encode_video("xxx/test.mp4")
client = OpenAI(
    # If environment variables are not configured, replace the following line with: api_key="sk-xxx" using your Model Studio API Key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-max",  
    messages=[
        {
            "role": "system",
            "content": [{"type":"text","text": "You are a helpful assistant."}]},
        {
            "role": "user",
            "content": [
                {
                    # When passing a video file directly, set the type value to video_url
                    "type": "video_url",
                    "video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
                },
                {"type": "text", "text": "What scene does this video depict?"},
            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using your Model Studio API Key
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeVideo = (videoPath) => {
    const videoFile = readFileSync(videoPath);
    return videoFile.toString('base64');
  };
// Replace xxxx/test.mp4 with the absolute path of your local video
const base64Video = encodeVideo("xxx/test.mp4")
async function main() {
    const completion = await openai.chat.completions.create({
        model: "qwen-vl-max",  
        messages: [
            {"role": "system",
             "content": [{"type":"text","text": "You are a helpful assistant."}]},
            {"role": "user",
             "content": [{
                 // When passing a video file directly, set the type value to video_url
                "type": "video_url", 
                "video_url": {"url": `data:video/mp4;base64,${base64Video}`}},
                 {"type": "text", "text": "What scene does this video depict?"}]}]
    });
    console.log(completion.choices[0].message.content);
}

main();

Usage notes

Supported image formats

Here are the supported image formats. When using the OpenAI SDK to input local images, set image/{format} according to the Content Type column.

Image format

File name extension

Content Type

BMP

.bmp

image/bmp

JPEG

.jpe, .jpeg, .jpg

image/jpeg

PNG

.png

image/png

TIFF

.tif, .tiff

image/tiff

WEBP

.webp

image/webp

HEIC

.heic

image/heic

Image size limits

  • The size of a single image must not exceed 10 MB. When using the OpenAI SDK, the Base64-encoded image must be less than 10 MB, see local files.

  • The width and height of an image must both be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.

  • No pixel count limit for a single image, because the model will scale and preprocess the image before understanding. Larger images do not necessarily improve understanding performance. Recommended pixel values:

    • For single image input to qwen-vl-max, the recommended number of pixels should not exceed 12 million. This supports standard 4K images.

    • For single image input to qwen-vl-plus, the number of pixels should not exceed 1,003,520.

Image input methods

  • Image URL: The URL must be accessible from the internet.

    Note

    You can upload images to OSS to obtain a public URL.

    • If you need to input an image into OSS with a private access control list, you can generate a signed URL using the public endpoint. This URL grants temporary access to the file for other users.

    • Do not use OSS internal URLs because they do not interconnect with Model Studio.

  • Local image files: When using the OpenAI SDK, input the Base64-encoded image data.

Image number limits

When inputting multiple images, the number of images is limited by the model's total token limit for text and images (maximum input). The total token count of all images must be less than the model's maximum input.

For example, qwen-vl-max has a maximum input of 30,720 tokens. If your input images are all 1280 × 1280, calculate the image tokens:

vl_high_resolution_images

Adjusted image dimensions

Image tokens

Maximum number of images

True

1288 x 1288

2,118

14

False

980 x 980

1,227

25

Prompt guide

Solving problems with images

Prompt tips: You can use the Chain-of-Thought method to solve complex mathematical problems. It guides the model to generate reasoning processes or break down complex tasks and reason step by step. This allows the model to generate more reasoning evidence before producing the final result, thus improving its performance on complex problems.

Input example

Sample code

Output example

Prompt: Please solve this problem step by step, showing your thinking process.

image

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i2/O1CN01e99Hxt1evMlWM6jUL_!!6000000003933-0-tps-1294-760.jpg"},
                    {"text": "Please solve this problem step by step, and output your thinking and judgment process for this problem."}
                ]
            }
        ]
    }
}'

image

Extracting information from invoice

Qwen VL can extract information from invoices, receipts, certificates, and forms, and returning it in a structured format.

Prompt tips:

  • Use separators to emphasize fields that need to be extracted.

  • Specify the output format, such as JSON.

  • Explicitly prohibit potential ```json``` in the prompt, such as "Please output in JSON format, do not output ```json```".

Sample input

Sample code

Sample output

Prompt: Extract the following from the image: ['Invoice Code', 'Invoice Number', 'Destination', 'Fuel Surcharge', 'Fare', 'Travel Date', 'Departure Time', 'Train Number', 'Seat Number']. Please output in JSON format, do not output ```json```.image

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg"},
                    {"text": "Extract the following from the image: ['InvoiceCode','InvoiceNumber','Destination','FuelSurcharge','Fare','TravelDate','DepartureTime','TrainNumber','SeatNumber'], please output in English and in JSON format, without the ```json``` code block."}
                ]
            }
        ]
    }
}'
{
    "InvoiceCode": "221021325353",
    "InvoiceNumber": "10283819",
    "Destination": "Development Zone",
    "FuelSurcharge": "2.0",
    "Fare": "8.00",
    "TravelDate": "2013-06-29",
    "DepartureTime": "Continuous",
    "TrainNumber": "040",
    "SeatNumber": "371"
}

Locating objects in images (Qwen2.5-VL only)

Qwen2.5-VL supports two methods of object localization: Box localization and Point localization. Box localization returns the coordinates of the top-left and bottom-right corners of the rectangle, while Point localization returns the coordinates of the center point of the rectangle (both types of coordinates are absolute values relative to the top-left corner of the image, in pixels).

The model scales the image before performing image understanding. You can refer to the code in Qwen2.5-VL to map the coordinates to the original image. You can also set vl_high_resolution_images to True to ensure the image is not scaled as much as possible, at the cost of higher token consumption.
For Qwen2.5-VL models, object localization is relatively robust in the resolution range of 480 × 480 to 2560 × 2560. Outside this range, occasional bbox drift may occur.
  1. Prompt tips

Method

Supported output format

Recommended prompt

Box localization

JSON or plain text

Detect all {objects} in the image and output their bbox coordinates in {JSON/plain text} format

Point localization

JSON or XML

Locate all {objects} in the image in point form, output their point coordinates in {JSON/XML} format

  1. Prompt improvements

  • When detecting densely arranged objects, such as "Detect all people in the image", the model may output a single box that contains all people, instead of individual boxes of each person. You can use the following prompts to emphasize detecting each object:

    • Box localization: Locate each one of all {object type} in the image and describe their respective {feature}, output their bbox coordinates in {JSON/plain text} format.

    • Point localization: Locate each {object type} in the image as points and describe their respective {feature}, output their point coordinates in {JSON/XML} format.

  • The localization results may contain irrelevant content like ```json``` or ```xml```. You can explicitly prohibit such content in the Prompt, such as "Please output in JSON format, do not output ```json```".

Sample input

Sample code

Sample output

Box localization:

Prompt: Locate the position of each one of all cakes and describe their respective features, output all bbox coordinates in JSON format, do not output ```json```.

image

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i3/O1CN01I1CXf21UR0Ld20Yzs_!!6000000002513-2-tps-1024-1024.png"},
                    {"text":  "Locate the position of each one of all cakes and describe their respective features, output all bbox coordinates in JSON format, do not output ```json```."}
                ]
            }
        ],
        "vl_high_resolution_images":"True",
        "temperature":"0",
        "top_k":"1",
        "seed":"3407"
    }
}'
[
  {
    "bbox": [60, 395, 204, 578],
    "description": "Chocolate cake with red frosting and colorful sprinkles on top"
  },
  {
    "bbox": [248, 381, 372, 542],
    "description": "Pink frosted cake with white and blue sprinkles on top"
  },
  {
    "bbox": [400, 368, 504, 504],
    "description": "Pink frosted cake with white and blue sprinkles on top"
  },
  {
    "bbox": [530, 355, 654, 526],
    "description": "Pink frosted cake with white and blue sprinkles on top"
  },
  {
    "bbox": [432, 445, 566, 606],
    "description": "Pink frosted cake with two black eyes on top"
  },
  {
    "bbox": [630, 475, 774, 646],
    "description": "Yellow frosted cake with multi-colored sprinkles on top"
  },
  {
    "bbox": [740, 380, 868, 539],
    "description": "Chocolate cake with brown frosting on top"
  },
  {
    "bbox": [796, 512, 960, 693],
    "description": "Yellow frosted cake with multi-colored sprinkles on top"
  },
  {
    "bbox": [39, 555, 200, 736],
    "description": "Yellow frosted cake with multi-colored sprinkles on top"
  },
  {
    "bbox": [292, 546, 446, 707],
    "description": "Black cake with white frosting and two black eyes on top"
  },
  {
    "bbox": [516, 564, 666, 715],
    "description": "Yellow frosted cake with two black eyes on top"
  },
  {
    "bbox": [352, 655, 516, 822],
    "description": "White frosted cake with two black eyes on top"
  },
  {
    "bbox": [130, 746, 304, 924],
    "description": "White frosted cake with two black eyes on top"
  }
]

Point localization:

Prompt: Locate the person who is acting bravely in the image as a point, and output the result in XML format, do not output ```xml```.

image

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://gw.alicdn.com/imgextra/i1/O1CN01ILRlNK1gvU5xqbaxb_!!6000000004204-49-tps-1138-640.webp"},
                    {"text":  "Locate the person who is acting bravely in the image as a point, and output the result in XML format, do not output ```xml```."}
                ]
            }
        ],
        "vl_high_resolution_images":"True",
        "temperature":"0",
        "top_k":"1",
        "seed":"3407"
    }
}'
<points x1=\"302\" 
y1=\"258\" 
alt=\"the person who is acting bravely\">the person who is acting bravely</points>

Parsing documents into Qwen-HTML format (Qwen2.5-VL only)

Qwen2.5-VL support parsing image-based documents (such as scanned documents/image PDFs) into QwenVL HTML format, which not only accurately recognizes text but also obtains position information of elements such as images and tables.

Prompt tips: You need to guide the model to output QwenVL HTML in your prompt, otherwise it will parse into HTML format text without position information:

  • Recommended system prompt: "You are an AI specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in QwenVL Document Parser HTML format using specified tags while maintaining user privacy and data integrity."

  • Recommended user prompt: "QwenVL HTML"

Input example

Sample code

Output example

image

Use the stream output to avoid timeout.
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "system",
                "content": "You are an AI specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in QwenVL Document Parser HTML format using specified tags while maintaining user privacy and data integrity."
            },
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i3/O1CN01nVbWzy1vx3iInC3z0_!!6000000006238-0-tps-1430-2022.jpg"},
                    {"text": "QwenVL HTML"}
                ]
            }
        ],
    "parameters": {
        "incremental_output":true
    }
    }
}'
Full output:
```html
<html><body>
<h2 data-bbox=\"91 95 223 120\"> 1 Introduction</h2> 
 <p data-bbox=\"91 128 742 296\">The sparks of artificial general intelligence (AGI) are increasingly visible through the fast development of large foundation models, notably large language models (LLMs) (Brown et al., 2020; OpenAI, 2023; 2024; Gemini Team, 2024; Anthropic, 2023a,b; 2024; Bai et al., 2023; Yang et al., 2024a; Touvron et al., 2023a,b; Dubey et al., 2024). The continuous advancement in model and data scaling, combined with the paradigm of large-scale pre-training followed by high-quality supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022), has enabled large language models (LLMs) to develop emergent capabilities in language understanding, generation, and reasoning. Building on this foundation, recent breakthroughs in inference time scaling, particularly demonstrated by o1 (OpenAI, 2024b), have enhanced LLMs’ capacity for deep thinking through step-by-step reasoning and reflection. These developments have elevated the potential of language models, suggesting they may achieve significant breakthroughs in scientific exploration as they continue to demonstrate emergent capabilities indicative of more general artificial intelligence.</p> 
 <p data-bbox=\"91 296 742 448\">Besides the fast development of model capabilities, the recent two years have witnessed a burst of open (open-weight) large language models in the LLM community, for example, the Llama series (Touvron et al., 2023a,b; Dubey et al., 2024), Mistral series (Jiang et al., 2023a; 2024a), and our Qwen series (Bai et al., 2023; Yang et al., 2024a; Qwen Team, 2024a; Hui et al., 2024; Qwen Team, 2024c; Yang et al., 2024b). The open-weight models have democratized the access of large language models to common users and developers, enabling broader research participation, fostering innovation through community collaboration, and accelerating the development of AI applications across diverse domains.</p> 
 <p data-bbox=\"91 448 742 586\">Recently, we release the details of our latest version of the Qwen series, Qwen2.5. In terms of the openweight part, we release pre-trained and instruction-tuned models of 7 sizes, including $0.5 \\mathrm{~B}, 1.5 \\mathrm{~B}, 3 \\mathrm{~B}, 7 \\mathrm{~B}$, $14 \\mathrm{~B}, 32 \\mathrm{~B}$, and $72 \\mathrm{~B}$, and we provide not only the original models in bfloat16 precision but also the quantized models in different precisions. Specifically, the flagship model Qwen2.5-72B-Instruct demonstrates competitive performance against the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Additionally, we also release the proprietary models of Mixture-of-Experts (MoE, Lepikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022), namely Qwen2.5-Turbo and Qwen2.5-Plus ${ }^{1}$, which performs competitively against GPT-4o-mini and GPT-4o respectively.</p> 
 <p data-bbox=\"91 586 742 868\">In this technical report, we introduce Qwen2.5, the result of our continuous endeavor to create better LLMs. Below, we show the key features of the latest version of Qwen: </p> 
 <ul data-bbox=\"136 614 742 868\"><li data-bbox=\"136 614 742 712\">Better in Size: Compared with Qwen2, in addition to $0.5 \\mathrm{~B}, 1.5 \\mathrm{~B}, 7 \\mathrm{~B}$, and $72 \\mathrm{~B}$ models, Qwen2.5 brings back the $3 \\mathrm{~B}, 14 \\mathrm{~B}$, and $32 \\mathrm{~B}$ models, which are more cost-effective for resource-limited scenarios and are under-represented in the current field of open foundation models. Qwen2.5Turbo and Qwen2.5-Plus offer a great balance among accuracy, latency, and cost.</li><li data-bbox=\"136 708 742 784\">Better in Data: The pre-training and post-training data have been improved significantly. The pre-training data increased from 7 trillion tokens to 18 trillion tokens, with focus on knowledge, coding, and mathematics. The pre-training is staged to allow transitions among different mixtures. The post-training data amounts to 1 million examples, across the stage of supervised finetuning (SFT, Ouyang et al., 2022), direct preference optimization (DPO, Raffelov et al., 2023), and group relative policy optimization (GRPO, Shao et al., 2024).</li><li data-bbox=\"136 780 742 868\">Better in Use: Several key limitations of Qwen2 in use have been eliminated, including larger generation length (from 2K tokens to 8K tokens), better support for structured input and output, (e.g., tables and JSON), and easier tool use. In addition, Qwen2.5-Turbo supports a context length of up to 1 million tokens.</li></ul> 
 <h2 data-bbox=\"91 892 338 920\"> 2 Architecture &amp; Tokenizer</h2> 
 <p data-bbox=\"91 926 742 978\">Basically, the Qwen2.5 series include dense models for opensource, namely Qwen2.5-0.5B / 1.5B / 3B / $7 \\mathrm{~B} / 14 \\mathrm{~B} / 32 \\mathrm{~B} / 72 \\mathrm{~B}$, and MoE models for API service, namely Qwen2.5-Turbo and Qwen2.5-Plus. Below, we provide details about the architecture of models.</p> 
 <p data-bbox=\"91 982 742 1070\">For dense models, we maintain the Transformer-based decoder architecture (Vaswani et al., 2017; Radford et al., 2018) as Qwen2 (Yang et al., 2024a). The architecture incorporates several key components: Grouped Query Attention (GQA, Ainslie et al., 2023) for efficient KV cache utilization, SwiGLU activation function (Dauphin et al., 2017) for non-linear activation, Rotary Positional Embeddings (RoPE, Su</p> 
 <hr/> 
 <section class=\"footnotes\" data-bbox=\"91 1028 742 1070\"><ol class=\"footnotes-list\" data-bbox=\"91 1028 742 1070\"><li class=\"footnote-item\" data-bbox=\"91 1028 742 1070\"><p data-bbox=\"91 1028 742 1070\">${ }^{1}$ Qwen2.5-Turbo is identified as qwen-turbo-2024-11-01 and Qwen2.5-Plus is identified as qwen-plus-2024-xx-xx (to be released) in the API.</p></li></ol></section> 
</body>
```html

Timestamp-related video understanding (Qwen2.5-VL only)

Qwen2.5-VL models can perceive time information, allowing them to search for specific events in videos or summarize key points from different time segments.

Prompt tips:

  • Clearly specify task requirements:

    • Specify the time range for video understanding, such as "Please describe the series of events in the following video" or "Please describe the series of events in the time range 00:05:00 to 00:10:00".

    • Event counting: such as "Count the number of occurrences and total duration of 'knowledge explanation' scenes in the video, and record the start and end timestamps of the events".

    • Action or scene localization: such as "Is there a 'player mistake' event within 5 seconds of 00:03:25 in the video? Please be accurate to the nearest 0.5 seconds".

    • Long video segmentation processing: such as "Generate a summary (with timestamps) for every 3 minutes of the following 2-hour meeting video, highlighting 'Q&A sessions' and 'resolution passing' events".

  • Clearly specify output requirements or format:

    • JSON structure constraints: "Return timestamps (start_time, end_time), event type (category), and specific event (event) in JSON format".

    • Time format representation: "Return timestamps using HH:mm:ss format or seconds (Example: 20 seconds)".

Input example

Sample code

Output example

Prompt: Please describe the sequence of actions performed by the person in the video. Output the results in JSON format including start_time, end_time, and event. Use the HH:mm:ss format for timestamps. Do not include ```json``` in your output.

Use the fps parameter to control the frequency of frame extraction. The video file will be sampled at intervals of seconds for content understanding. For instructions, see Video understanding.
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-max-latest",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"video": "https://cloud.video.taobao.com/vod/C6gCj5AJ3Qrd_UQ9kaMVRY9Ig9G-WToxVYSPRdNXCao.mp4","fps": 2.0},
                    {"text": "Please describe the series of events in the video, output the start time (start_time), end time (end_time), and event (event) in JSON format, do not output ```json```."}
                ]
            }
        ]
    }
}'
[
    {
        "start_time": "00:00:00.00",
        "end_time": "00:00:04.00",
        "event": "A person walks from the left side of the frame toward the table, carrying a cardboard box."
    },
    {
        "start_time": "00:00:04.00",
        "end_time": "00:00:06.00",
        "event": "The person places the cardboard box on the table."
    },
    {
        "start_time": "00:00:06.00",
        "end_time": "00:00:10.00",
        "event": "The person picks up a scanner gun with their right hand and scans the barcode on the cardboard box."
    },
    {
        "start_time": "00:00:10.00",
        "end_time": "00:00:12.00",
        "event": "The person puts the scanner gun back in its original position."
    },
    {
        "start_time": "00:00:12.00",
        "end_time": "00:00:15.00",
        "event": "The person picks up the cardboard box with both hands and moves it to the side."
    },
    {
        "start_time": "00:00:15.00",
        "end_time": "00:00:20.00",
        "event": "The person picks up a pen with their right hand and records information in a notebook on the table."
    }
]

API references

For input and output parameters of Qwen-VL, see Qwen.

FAQ

  • Do I need to manually delete uploaded images?

    No. The server automatically deletes images after the model completes text generation.

  • Can Qwen-VL process PDF, XLSX, XLS, DOC, and other text files?

    No, Qwen-VL is designed for visual understanding and only processes image files, not text files.

  • Can Qwen-VL understand video content?

    Yes, please refer to Video understanding.

Error codes

If the call failed and an error message is returned, see Error messages.