All Products
Search
Document Center

Alibaba Cloud Model Studio:Visual reasoning (QVQ)

Last Updated:Jun 13, 2025

The QVQ model has powerful visual reasoning capability. It first outputs the thinking process, then the response content. Currently, QVQ only supports streaming output.

Supported models

QVQ is a visual reasoning model that supports visual input and chain-of-thought output. It shows stronger capabilities in mathematics, programming, visual analysis, creation, and general tasks.

Name

Version

Context window

Maximum input

Maximum CoT

Maximum response

Input price

Output price

Free quota

(Note)

(Tokens)

(Million tokens)

qvq-max

Same performance as qvq-max-2025-03-25

Stable

131,072

106,496

Up to 16,384 per image

16,384

8,192

Time-limited free trial

After the free quota runs out, you cannot access this model. Please stay tuned for updates.

1 million tokens each

Valid for 180 days after activation

qvq-max-latest

Always same performance as the latest snapshot

Latest

qvq-max-2025-03-25

Also qvq-max-0325

Snapshot

Calculate image and video tokens

Image

Each 28×28 pixels correspond to one token, and an image requires at least 4 tokens. You can estimate the tokens for an image using the following code:

import math
# Use the following command to install the Pillow library: pip install Pillow
from PIL import Image

def token_calculate(image_path):
    # Open the specified PNG image file
    image = Image.open(image_path)

    # Get the original dimensions of the image
    height = image.height
    width = image.width
    
    # Adjust the height to a multiple of 28
    h_bar = round(height / 28) * 28
    # Adjust the width to a multiple of 28
    w_bar = round(width / 28) * 28
    
    # Minimum token limit for an image: 4 tokens
    min_pixels = 28 * 28 * 4
    # Maximum token limit for an image: 1280 tokens
    max_pixels = 1280 * 28 * 28
        
    # Scale the image to ensure the total number of pixels is within the range [min_pixels, max_pixels]
    if h_bar * w_bar > max_pixels:
        # Calculate the scaling factor beta so that the scaled image's total pixels do not exceed max_pixels
        beta = math.sqrt((height * width) / max_pixels)
        # Recalculate the adjusted height, ensuring it's a multiple of 28
        h_bar = math.floor(height / beta / 28) * 28
        # Recalculate the adjusted width, ensuring it's a multiple of 28
        w_bar = math.floor(width / beta / 28) * 28
    elif h_bar * w_bar < min_pixels:
        # Calculate the scaling factor beta so that the scaled image's total pixels are not less than min_pixels
        beta = math.sqrt(min_pixels / (height * width))
        # Recalculate the adjusted height, ensuring it's a multiple of 28
        h_bar = math.ceil(height * beta / 28) * 28
        # Recalculate the adjusted width, ensuring it's a multiple of 28
        w_bar = math.ceil(width * beta / 28) * 28
    return h_bar, w_bar

# Replace test.png with the path to your local image
h_bar, w_bar = token_calculate("test.png")
print(f"Scaled image dimensions: height {h_bar}, width {w_bar}")

# Calculate the number of tokens for the image: total pixels divided by 28 * 28
token = int((h_bar * w_bar) / (28 * 28))

# The system automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each)
print(f"The image token count is {token + 2}")
// Use the following command to install sharp: npm install sharp
import sharp from 'sharp';
import fs from 'fs';

async function tokenCalculate(imagePath) {
    // Open the specified PNG image file
    const image = sharp(imagePath);
    const metadata = await image.metadata();

    // Get the original dimensions of the image
    const height = metadata.height;
    const width = metadata.width;

    // Adjust the height to a multiple of 28
    let hBar = Math.round(height / 28) * 28;
    // Adjust the width to a multiple of 28
    let wBar = Math.round(width / 28) * 28;

    // Minimum token limit for an image: 4 tokens
    const minPixels = 28 * 28 * 4;
    // Maximum token limit for an image: 1280 tokens
    const maxPixels = 1280 * 28 * 28;

    // Scale the image to ensure the total number of pixels is within the range [min_pixels, max_pixels]
    if (hBar * wBar > maxPixels) {
        // Calculate the scaling factor beta so that the scaled image's total pixels do not exceed max_pixels
        const beta = Math.sqrt((height * width) / maxPixels);
        // Recalculate the adjusted height, ensuring it's a multiple of 28
        hBar = Math.floor(height / beta / 28) * 28;
        // Recalculate the adjusted width, ensuring it's a multiple of 28
        wBar = Math.floor(width / beta / 28) * 28;
    } else if (hBar * wBar < minPixels) {
        // Calculate the scaling factor beta so that the scaled image's total pixels are not less than min_pixels
        const beta = Math.sqrt(minPixels / (height * width));
        // Recalculate the adjusted height, ensuring it's a multiple of 28
        hBar = Math.ceil(height * beta / 28) * 28;
        // Recalculate the adjusted width, ensuring it's a multiple of 28
        wBar = Math.ceil(width * beta / 28) * 28;
    }

    return { hBar, wBar };
}

// Replace test.png with the path to your local image
const imagePath = 'test.png';
tokenCalculate(imagePath).then(({ hBar, wBar }) => {
    console.log(`Scaled image dimensions: height ${hBar}, width ${wBar}`);

    // Calculate the number of tokens for the image: total pixels divided by 28 * 28
    const token = Math.floor((hBar * wBar) / (28 * 28));

    // The system automatically adds <|vision_bos|> and <|vision_eos|> visual markers (1 token each)
    console.log(`The total image token count is ${token + 2}`);
}).catch(err => {
    console.error('Error processing image:', err);
});
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class Main {

    // Custom class to store the resized dimensions
    public static class ResizedSize {
        public final int height;
        public final int width;

        public ResizedSize(int height, int width) {
            this.height = height;
            this.width = width;
        }
    }

    public static ResizedSize smartResize(String imagePath) throws IOException {
        // 1. Load the image
        BufferedImage image = ImageIO.read(new File(imagePath));
        if (image == null) {
            throw new IOException("Cannot load image file: " + imagePath);
        }

        int originalHeight = image.getHeight();
        int originalWidth = image.getWidth();

        final int minPixels = 28 * 28 * 4;
        final int maxPixels = 1280 * 28 * 28;
        // 2. Initial adjustment to multiples of 28
        int hBar = (int) (Math.round(originalHeight / 28.0) * 28);
        int wBar = (int) (Math.round(originalWidth / 28.0) * 28);
        int currentPixels = hBar * wBar;

        // 3. Adjust dimensions based on conditions
        if (currentPixels > maxPixels) {
            // Current pixels exceed the maximum, need to reduce
            double beta = Math.sqrt(
                    (originalHeight * (double) originalWidth) / maxPixels
            );
            double scaledHeight = originalHeight / beta;
            double scaledWidth = originalWidth / beta;

            hBar = (int) (Math.floor(scaledHeight / 28) * 28);
            wBar = (int) (Math.floor(scaledWidth / 28) * 28);
        } else if (currentPixels < minPixels) {
            // Current pixels below the minimum, need to enlarge
            double beta = Math.sqrt(
                    (double) minPixels / (originalHeight * originalWidth)
            );
            double scaledHeight = originalHeight * beta;
            double scaledWidth = originalWidth * beta;

            hBar = (int) (Math.ceil(scaledHeight / 28) * 28);
            wBar = (int) (Math.ceil(scaledWidth / 28) * 28);
        }

        return new ResizedSize(hBar, wBar);
    }

    public static void main(String[] args) {
        try {
            ResizedSize size = smartResize(
                    // Replace xxx/test.png with your image path
                    "xxx/test.png"
            );

            System.out.printf("Scaled image dimensions: height %d, width %d%n", size.height, size.width);

            // Calculate tokens (total pixels / 28×28 + 2)
            int token = (size.height * size.width) / (28 * 28) + 2;
            System.out.printf("Total image tokens: %d%n", token);

        } catch (IOException e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Video

You can estimate the tokens for a video using the following code:

# Install with before use: pip install opencv-python
import math
import os
import logging
import cv2

logger = logging.getLogger(__name__)

FRAME_FACTOR = 2
IMAGE_FACTOR = 28
# Aspect ratio of video frames
MAX_RATIO = 200

# Minimum token limit for video frames
VIDEO_MIN_PIXELS = 128 * 28 * 28
# Maximum token limit for video frames
VIDEO_MAX_PIXELS = 768 * 28 * 28

# If the user does not pass the FPS parameter, use the default value
FPS = 2.0
# Minimum number of frames to extract
FPS_MIN_FRAMES = 4
# Maximum number of frames to extract. When using the qwen2.5-vl model, set FPS_MAX_FRAMES to 512; for other models, set it to 80
FPS_MAX_FRAMES = 512

# Maximum pixel value for video input,
# When using the qwen2.5-vl model, set VIDEO_TOTAL_PIXELS to 65536 * 28 * 28; for other models, set it to 24576 * 28 * 28
VIDEO_TOTAL_PIXELS = int(float(os.environ.get('VIDEO_MAX_PIXELS', 65536 * 28 * 28)))

def round_by_factor(number: int, factor: int) -> int:
    """Return the integer closest to 'number' that is divisible by 'factor'."""
    return round(number / factor) * factor

def ceil_by_factor(number: int, factor: int) -> int:
    """Return the smallest integer greater than or equal to 'number' that is divisible by 'factor'."""
    return math.ceil(number / factor) * factor

def floor_by_factor(number: int, factor: int) -> int:
    """Return the largest integer less than or equal to 'number' that is divisible by 'factor'."""
    return math.floor(number / factor) * factor

def smart_nframes(ele,total_frames,video_fps):
    """Calculate the number of video frames to extract.

    Args:
        ele (dict): Dictionary containing video configuration
            - fps: fps is used to control the number of input frames extracted by the model.
        total_frames (int): Total frames of the original video.
        video_fps (int | float): Original frame rate of the video

    Raises:
        nframes should be within [FRAME_FACTOR, total_frames], otherwise throws an error

    Returns:
        The number of video frames for model input.
    """
    assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
    fps = ele.get("fps", FPS)
    min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
    max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
    duration = total_frames / video_fps if video_fps != 0 else 0
    if duration-int(duration) > (1/fps):
        total_frames = math.ceil(duration * video_fps)
    else:
        total_frames = math.ceil(int(duration) * video_fps)
    nframes = total_frames / video_fps * fps
    if nframes > total_frames:
        logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
    nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
    if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
        raise ValueError(f"nframes should be in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")

    return nframes

def get_video(video_path):
    # Get video information
    cap = cv2.VideoCapture(video_path)

    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    # Get video height
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    video_fps = cap.get(cv2.CAP_PROP_FPS)
    return frame_height, frame_width, total_frames, video_fps

def smart_resize(ele, path, factor=IMAGE_FACTOR):
    # Get the original video's width and height
    height, width, total_frames, video_fps = get_video(path)
    # Minimum token limit for video frames
    min_pixels = VIDEO_MIN_PIXELS
    total_pixels = VIDEO_TOTAL_PIXELS
    # Number of extracted video frames
    nframes = smart_nframes(ele, total_frames, video_fps)
    max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))

    # The aspect ratio of the video should not exceed 200:1 or 1:200
    if max(height, width) / min(height, width) > MAX_RATIO:
        raise ValueError(
            f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}"
        )

    h_bar = max(factor, round_by_factor(height, factor))
    w_bar = max(factor, round_by_factor(width, factor))
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar
    

def token_calculate(video_path, fps):
    # Pass video path and fps extraction parameter
    messages = [{"content": [{"video": video_path, "fps": fps}]}]
    vision_infos = extract_vision_info(messages)[0]

    resized_height, resized_width = smart_resize(vision_infos, video_path)

    height, width, total_frames, video_fps = get_video(video_path)
    num_frames = smart_nframes(vision_infos, total_frames, video_fps)
    print(f"Original video size: {height}*{width}, size for model input: {resized_height}*{resized_width}, total frames of video: {total_frames}, when fps is {fps}, total frames extracted: {num_frames}", end=", ")
    video_token = int(math.ceil(num_frames / 2) * resized_height / 28 * resized_width / 28)
    video_token += 2  # The system will automatically add <|vision_bos|> and <|vision_eos|> visual markers (each counts as 1 Token)
    return video_token

def extract_vision_info(conversations):
    vision_infos = []
    if isinstance(conversations[0], dict):
        conversations = [conversations]
    for conversation in conversations:
        for message in conversation:
            if isinstance(message["content"], list):
                for ele in message["content"]:
                    if (
                        "image" in ele
                        or "image_url" in ele
                        or "video" in ele
                        or ele.get("type", "") in ("image", "image_url", "video")
                    ):
                        vision_infos.append(ele)
    return vision_infos


video_token = token_calculate("xxx/test.mp4", 1)
print("Video tokens:", video_token)
// Install before use: npm i node-ffprobe
import ffprobeInstaller from '@ffprobe-installer/ffprobe';
import ffprobe from 'node-ffprobe';
import probe from "node-ffprobe";
// Set ffprobe path (global configuration)
ffprobe.FFPROBE_PATH = ffprobeInstaller.path;

// Get video information
async function getVideoInfo(videoPath) {
  try {
    const probeData = await probe(videoPath);
    const videoStream = probeData.streams.find(
      stream => stream.codec_type === 'video'
    );

    if (!videoStream) {
      throw new Error('No video stream found in the video');
    }

    const width = videoStream.width;
    const height = videoStream.height;
    const totalFrames = videoStream.nb_frames;
    const [numerator, denominator] = videoStream.avg_frame_rate.split('/');
    const frameRate = parseFloat(numerator / denominator);

    return {
      width,
      height,
      totalFrames,
      frameRate
    };
  } catch (error) {
    console.error('Failed to get video information:', error);
    throw error;
  }
}

// Configuration parameters
const FRAME_FACTOR = 2; 
const IMAGE_FACTOR = 28;
const MAX_RATIO = 200;
// Minimum token limit for video frames
const VIDEO_MIN_PIXELS = 128 * 28 * 28; 
// Maximum token limit for video frames
const VIDEO_MAX_PIXELS = 768 * 28 * 28; 
const FPS = 2.0; // If the user does not pass the FPS parameter, use the default value
// Minimum number of frames to extract
const FPS_MIN_FRAMES = 4;
// Maximum number of frames to extract. When using the qwen2.5-vl model, set FPS_MAX_FRAMES to 512; for other models, set it to 80
const FPS_MAX_FRAMES = 512; 
// Maximum pixel value for video input,
// When using the qwen2.5-vl model, set VIDEO_TOTAL_PIXELS to 65536 * 28 * 28; for other models, set it to 24576 * 28 * 28
const VIDEO_TOTAL_PIXELS = parseInt(process.env.VIDEO_MAX_PIXELS) || 65536 * 28 * 28;

// Math utility functions
function roundByFactor(number, factor) {
    return Math.round(number / factor) * factor;
}

function ceilByFactor(number, factor) {
    return Math.ceil(number / factor) * factor;
}

function floorByFactor(number, factor) {
    return Math.floor(number / factor) * factor;
}

// Calculate the number of frames to extract
function smartNFrames(ele, totalFrames, frameRate) {
    const fps = ele.fps || FPS;
    const minFrames = ceilByFactor(ele.min_frames || FPS_MIN_FRAMES, FRAME_FACTOR);
    const maxFrames = floorByFactor(
        ele.max_frames || Math.min(FPS_MAX_FRAMES, totalFrames),
        FRAME_FACTOR
    );
    const duration = frameRate !== 0 ? parseFloat(totalFrames / frameRate) : 0;

    let totalFramesAdjusted = duration % 1 > (1 / fps)
        ? Math.ceil(duration * frameRate)
        : Math.ceil(Math.floor(parseInt(duration)) * frameRate);

    const nframes = (totalFramesAdjusted / frameRate) * fps;
    const finalNFrames = parseInt(Math.min(
        Math.max(nframes, minFrames),
        Math.min(maxFrames, totalFramesAdjusted)
    ));

    if (finalNFrames < FRAME_FACTOR || finalNFrames > totalFramesAdjusted) {
        throw new Error(
            `nframes should be between ${FRAME_FACTOR} and ${totalFramesAdjusted}, got ${finalNFrames}`
        );
    }
    return finalNFrames;
}

// Smartly adjust resolution
async function smartResize(ele, videoPath) {
    const { height, width, totalFrames, frameRate } = await getVideoInfo(videoPath);
    const minPixels = VIDEO_MIN_PIXELS;
    const nframes = smartNFrames(ele, totalFrames, frameRate);
    const maxPixels = Math.max(
        Math.min(VIDEO_MAX_PIXELS, VIDEO_TOTAL_PIXELS / nframes * FRAME_FACTOR),
        Math.floor(minPixels * 1.05)
    );

    // Check aspect ratio
    const ratio = Math.max(height, width) / Math.min(height, width);
    if (ratio > MAX_RATIO) {
        throw new Error(`Aspect ratio ${ratio} exceeds ${MAX_RATIO}`);
    }

    let hBar = Math.max(IMAGE_FACTOR, roundByFactor(height, IMAGE_FACTOR));
    let wBar = Math.max(IMAGE_FACTOR, roundByFactor(width, IMAGE_FACTOR));

    if (hBar * wBar > maxPixels) {
        const beta = Math.sqrt((height * width) / maxPixels);
        hBar = floorByFactor(height / beta, IMAGE_FACTOR);
        wBar = floorByFactor(width / beta, IMAGE_FACTOR);
    } else if (hBar * wBar < minPixels) {
        const beta = Math.sqrt(minPixels / (height * width));
        hBar = ceilByFactor(height * beta, IMAGE_FACTOR);
        wBar = ceilByFactor(width * beta, IMAGE_FACTOR);
    }

    return { hBar, wBar };
}

// Calculate token count
async function tokenCalculate(videoPath, fps) {
    const messages = [{ content: [{ video: videoPath, fps }] }];
    const visionInfos = extractVisionInfo(messages);

    const { hBar, wBar } = await smartResize(visionInfos[0], videoPath);
    const { height, width, totalFrames, frameRate } = await getVideoInfo(videoPath);
    const numFrames = smartNFrames(visionInfos[0], totalFrames, frameRate);

    console.log(
        `Original video size: ${height}*${width}, size for model input: ${hBar}*${wBar}, total frames of video: ${totalFrames}, when fps is ${fps}, total frames extracted: ${numFrames}`
    );

    const videoToken = Math.ceil(numFrames / 2) * Math.floor(hBar / 28) * Math.floor(wBar / 28) + 2;
    return videoToken;
}

// Extract vision information
function extractVisionInfo(conversations) {
    const visionInfos = [];
    if (!Array.isArray(conversations)) {
        conversations = [conversations];
    }
    conversations.forEach(conversation => {
        if (!Array.isArray(conversation)) {
            conversation = [conversation];
        }
        conversation.forEach(message => {
            if (Array.isArray(message.content)) {
                message.content.forEach(ele => {
                    if (ele.image || ele.image_url || ele.video || ['image', 'image_url', 'video'].includes(ele.type)) {
                        visionInfos.push(ele);
                    }
                });
            }
        });
    });
    return visionInfos;
}

// Example usage
(async () => {
    try {
        const videoPath = "xxx/test.mp4"; // Replace with your local path
        const videoToken = await tokenCalculate(videoPath, 1);
        console.log('Video tokens:', videoToken);
    } catch (error) {
        console.error('Error:', error.message);
    }
})();
For concurrent rate limiting, see Rate limits.

Get started

Prerequisites: You must have obtained an API key and configured it as an environment variable. To use the SDKs, you must install OpenAI or DashScope SDK. The DashScope SDK for Java must be version 2.19.0 or later.

Important
  • Due to the long reasoning time, currently QVQ only supports streaming output.

  • Thinking cannot be disabled for QVQ.

  • QVQ does not support System Message.

  • For the DashScope method;

    • incremental_output defaults to true and cannot be false.

    • result_format defaults to "message", and cannot be "text".

The following sample codes use image URL for understanding.

Check the limitations on input images section. To use local images, see Using local files.

OpenAI

Python

Sample code

from openai import OpenAI
import os

# Initialize OpenAI client
client = OpenAI(
    # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""  # Define complete thinking process
answer_content = ""     # Define complete response
is_answering = False   # Determine if thinking process has ended and response has begun

# Create chat completion request
completion = client.chat.completions.create(
    model="qvq-max",  # Using qvq-max as an example, can be replaced with other model names as needed
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
                    },
                },
                {"type": "text", "text": "How to solve this problem?"},
            ],
        },
    ],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start response
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                is_answering = True
            # Print response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(answer_content)

Click to view reasoning process and complete response

====================Reasoning Process====================

Okay, let's see. I need to find the surface area and volume for these two shapes. The first one is a rectangular prism, and the second is a cube. Hmm, right. Let me start with the first one.

For the rectangular prism, the formula for surface area is 2lw + 2lh + 2wh. The dimensions are 4 cm, 3 cm, and 2 cm. So plugging those in: 2*(4*3) + 2*(4*2) + 2*(3*2). Let me calculate each part. 4*3 is 12, times 2 is 24. Then 4*2 is 8, times 2 is 16. Then 3*2 is 6, times 2 is 12. Adding those up: 24 + 16 is 40, plus 12 is 52. So the surface area should be 52 cm². For the volume, it's length*width*height, so 4*3*2. That's 12*2=24 cm³. Okay, that seems right.

Now the cube. All sides are 3 cm. Surface area for a cube is 6 times one face's area. Each face is 3*3=9 cm², so 6*9=54 cm². Volume is side cubed, so 3*3*3=27 cm³. Let me double-check. Yeah, cube has all edges equal, so formulas make sense. I think that's it. Wait, did I mix up any numbers? Let me confirm the first one again. The rectangular prism's dimensions are 4, 3, 2. So surface area calculation was correct. Volume 4*3*2=24. Cube is straightforward. Alright, I think that's all.

====================Complete Response====================

To solve the problem, we calculate the surface area and volume for a rectangular prism and a cube using their respective formulas.

### Rectangular Prism (4 cm × 3 cm × 2 cm):
- **Surface Area**:  
  \(2(lw + lh + wh) = 2(4 \times 3 + 4 \times 2 + 3 \times 2) = 2(12 + 8 + 6) = 2 \times 26 = 52 \, \text{cm}^2\).  
- **Volume**:  
  \(l \times w \times h = 4 \times 3 \times 2 = 24 \, \text{cm}^3\).

### Cube (3 cm × 3 cm × 3 cm):
- **Surface Area**:  
  \(6s^2 = 6 \times 3^2 = 6 \times 9 = 54 \, \text{cm}^2\).  
- **Volume**:  
  \(s^3 = 3^3 = 27 \, \text{cm}^3\).

**Final Answers**:
1. Rectangular Prism: Surface Area = 52 cm², Volume = 24 cm³.
2. Cube: Surface Area = 54 cm², Volume = 27 cm³.

Node.js

Sample code

import OpenAI from "openai";
import process from 'process';

// Initialize openai client
const openai = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY, // Read from environment variable
    baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1'
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;

let messages = [
    {
        role: "user",
    content: [
        { type: "image_url", image_url: { "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg" } },
        { type: "text", text: "Solve this problem" },
    ]
}]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qvq-max',
            messages: messages,
            stream: true
        });

        console.log('\n' + '='.repeat(20) + 'Reasoning Process' + '='.repeat(20) + '\n');

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Process thinking process
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Process formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Complete Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();

Click to view reasoning process and complete response

====================Reasoning Process====================

Okay, let's see. I need to find the surface area and volume for these two shapes. The first one is a rectangular prism, and the second is a cube. Hmm, right. Let me start with the first figure.

For the rectangular prism, the formula for surface area is 2lw + 2lh + 2wh. The dimensions are given as length 4 cm, width 3 cm, and height 2 cm. So plugging those in: 2*(4*3) + 2*(4*2) + 2*(3*2). Let me calculate each part. 4*3 is 12, times 2 is 24. Then 4*2 is 8, times 2 is 16. Then 3*2 is 6, times 2 is 12. Adding those together: 24 + 16 is 40, plus 12 is 52. So the surface area should be 52 cm². For the volume, it's length*width*height, so 4*3*2. That's 12*2=24 cm³. Okay, that seems right.

Now the second figure is a cube with edge length 3 cm. Surface area for a cube is 6 times the area of one face. Since each face is 3x3, that's 9 cm². So 6*9=54 cm². Volume is edge length cubed, so 3*3*3=27 cm³. Let me double-check. Yeah, cube formulas are straightforward. 

Wait, did I mix up any numbers? For the rectangular prism, the dimensions are 4, 3, 2. So length, width, height. The surface area calculation was 2*(4*3 + 4*2 + 3*2). Which is 2*(12 + 8 + 6) = 2*26=52. Correct. Volume is 4*3*2=24. Cube is all sides equal, so 3^3=27. Surface area 6*3^2=54. Yep, that's right. I think that's all.

====================Complete Response====================

To solve the problem, we calculate the surface area and volume for both the rectangular prism and the cube using their respective formulas.

**1. Rectangular Prism (Length = 4 cm, Width = 3 cm, Height = 2 cm):**
- **Surface Area**:  
  \[
  2(lw + lh + wh) = 2(4 \times 3 + 4 \times 2 + 3 \times 2) = 2(12 + 8 + 6) = 2 \times 26 = 52 \, \text{cm}^2
  \]
- **Volume**:  
  \[
  l \times w \times h = 4 \times 3 \times 2 = 24 \, \text{cm}^3
  \]

**2. Cube (Edge Length = 3 cm):**
- **Surface Area**:  
  \[
  6a^2 = 6 \times 3^2 = 6 \times 9 = 54 \, \text{cm}^2
  \]
- **Volume**:  
  \[
  a^3 = 3^3 = 27 \, \text{cm}^3
  \]

**Final Answers:**
1. Rectangular Prism: Surface Area = 52 cm², Volume = 24 cm³
2. Cube: Surface Area = 54 cm², Volume = 27 cm³

HTTP

Sample code

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qvq-max",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
          }
        },
        {
          "type": "text",
          "text": "Solve this problem"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true}
}'

Click to view reasoning process and complete response

data: {"choices":[{"delta":{"content":null,"role":"assistant","reasoning_content":""},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1742983020,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-ab4f3963-2c2a-9291-bda2-65d5b325f435"}

data: {"choices":[{"finish_reason":null,"delta":{"content":null,"reasoning_content":"Well"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742983020,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-ab4f3963-2c2a-9291-bda2-65d5b325f435"}

data: {"choices":[{"delta":{"content":null,"reasoning_content":","},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742983020,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-ab4f3963-2c2a-9291-bda2-65d5b325f435"}

data: {"choices":[{"delta":{"content":null,"reasoning_content":"I am"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742983020,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-ab4f3963-2c2a-9291-bda2-65d5b325f435"}

data: {"choices":[{"delta":{"content":null,"reasoning_content":"going"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742983020,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-ab4f3963-2c2a-9291-bda2-65d5b325f435"}

data: {"choices":[{"delta":{"content":null,"reasoning_content":"to solve"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742983020,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-ab4f3963-2c2a-9291-bda2-65d5b325f435"}
.....
data: {"choices":[{"delta":{"content":"cm"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742983095,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-23d30959-42b4-9f24-b7ab-1bb0f72ce265"}

data: {"choices":[{"delta":{"content":"³"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742983095,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-23d30959-42b4-9f24-b7ab-1bb0f72ce265"}

data: {"choices":[{"finish_reason":"stop","delta":{"content":"","reasoning_content":null},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742983095,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-23d30959-42b4-9f24-b7ab-1bb0f72ce265"}

data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":544,"completion_tokens":590,"total_tokens":1134,"completion_tokens_details":{"text_tokens":590},"prompt_tokens_details":{"text_tokens":24,"image_tokens":520}},"created":1742983095,"system_fingerprint":null,"model":"qvq-max","id":"chatcmpl-23d30959-42b4-9f24-b7ab-1bb0f72ce265"}

data: [DONE]

DashScope

Python

Sample code

import os
import dashscope
from dashscope import MultiModalConversation

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
            {"text": "Solve this problem."}
        ]
    }
]

response = MultiModalConversation.call(
    # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qvq-max",  # Using qvq-max as an example, can be replaced with other model names as needed
    messages=messages,
    stream=True,
)

# Define complete thinking process
reasoning_content = ""
# Define complete response
answer_content = ""
# Determine if thinking process has ended and response has begun
is_answering = False

print("=" * 20 + "Reasoning Process" + "=" * 20)

for chunk in response:
    # If both thinking process and response are empty, ignore
    message = chunk.output.choices[0].message
    reasoning_content_chunk = message.get("reasoning_content", None)
    if (chunk.output.choices[0].message.content == [] and
        reasoning_content_chunk == ""):
        pass
    else:
        # If current is thinking process
        if reasoning_content_chunk != None and chunk.output.choices[0].message.content == []:
            print(chunk.output.choices[0].message.reasoning_content, end="")
            reasoning_content += chunk.output.choices[0].message.reasoning_content
        # If current is response
        elif chunk.output.choices[0].message.content != []:
            if not is_answering:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20)
                is_answering = True
            print(chunk.output.choices[0].message.content[0]["text"], end="")
            answer_content += chunk.output.choices[0].message.content[0]["text"]

# If you need to print the complete thinking process and complete response, uncomment the following code
# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(f"{answer_content}")

Click to view reasoning process and complete response

====================Reasoning Process====================
Okay, let's tackle this problem step by step. So, there are two figures: a rectangular prism and a cube. I need to find the surface area and volume for each. Let me start with the first one, the rectangular prism.

First, the formula for the surface area of a rectangular prism is 2lw + 2lh + 2wh, right? Where l is length, w is width, and h is height. Looking at the diagram, the dimensions are 4 cm, 3 cm, and 2 cm. Wait, which is which? The problem says "Unit: cm" so all measurements are in centimeters. The prism is labeled with 4, 3, and 2. Let me confirm: the length is usually the longest side, so maybe 4 cm is the length, 3 cm the width, and 2 cm the height. But actually, the formula works regardless of the order as long as we use all three dimensions correctly. So plugging in the numbers: 2*(4*3) + 2*(4*2) + 2*(3*2). Let me compute each part. 4*3 is 12, times 2 is 24. 4*2 is 8, times 2 is 16. 3*2 is 6, times 2 is 12. Adding those together: 24 + 16 is 40, plus 12 is 52. So the surface area should be 52 cm². 

For the volume of the rectangular prism, it's length*width*height. So 4*3*2. 4*3 is 12, times 2 is 24. So volume is 24 cm³. That seems straightforward.

Now the cube. All sides are equal, so each edge is 3 cm. Surface area of a cube is 6 times the area of one face. Since each face is a square, area is 3*3=9 cm². Multiply by 6: 6*9=54 cm². Volume is edge length cubed, so 3*3*3=27 cm³. 

Wait, let me double-check. For the cube, surface area: each face is 3x3, six faces. Yes, 6*9=54. Volume is 3^3=27. That's correct. 

So both calculations seem right. Let me just make sure I didn't mix up any dimensions for the rectangular prism. The given dimensions are 4, 3, 2. The formula uses all three, so regardless of which is length, width, or height, the calculation should be the same. So 2*(4*3 + 4*2 + 3*2) = 2*(12 + 8 + 6) = 2*26=52. Yep, that's correct. Volume is 4*3*2=24. 

Alright, I think that's all. Both surface areas and volumes calculated correctly.

====================Complete Response====================
### Solution:

**1. Rectangular Prism (4 cm × 3 cm × 2 cm):**

- **Surface Area**:  
  \[
  2(lw + lh + wh) = 2(4 \times 3 + 4 \times 2 + 3 \times 2) = 2(12 + 8 + 6) = 2 \times 26 = 52 \, \text{cm}^2
  \]

- **Volume**:  
  \[
  l \times w \times h = 4 \times 3 \times 2 = 24 \, \text{cm}^3
  \]

**2. Cube (3 cm × 3 cm × 3 cm):**

- **Surface Area**:  
  \[
  6a^2 = 6 \times (3 \times 3) = 6 \times 9 = 54 \, \text{cm}^2
  \]

- **Volume**:  
  \[
  a^3 = 3 \times 3 \times 3 = 27 \, \text{cm}^3
  \]

**Final Answers:**  
1. Surface Area: \(52 \, \text{cm}^2\), Volume: \(24 \, \text{cm}^3\)  
2. Surface Area: \(54 \, \text{cm}^2\), Volume: \(27 \, \text{cm}^3\)

Java

Sample code

// dashscope SDK version >= 2.19.0
import java.util.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.exception.InputRequiredException;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(MultiModalConversationResult message) {
        String re = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String reasoning = Objects.isNull(re)?"":re; // Default value

        List> content = message.getOutput().getChoices().get(0).getMessage().getContent();
        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Reasoning Process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (Objects.nonNull(content) && !content.isEmpty()) {
            Object text = content.get(0).get("text");
            finalContent.append(content.get(0).get("text"));
            if (!isFirstPrint) {
                System.out.println("\n====================Complete Response====================");
                isFirstPrint = true;
            }
            System.out.print(text);
        }
    }
    public static MultiModalConversationParam buildMultiModalConversationParam(MultiModalMessage Msg)  {
        return MultiModalConversationParam.builder()
                // If environment variable is not configured, replace with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // Using qvq-max as an example, can be replaced with other model names as needed
                .model("qvq-max")
                .messages(Arrays.asList(Msg))
                .incrementalOutput(true)
                .build();
    }

    public static void streamCallWithMessage(MultiModalConversation conv, MultiModalMessage Msg)
            throws NoApiKeyException, ApiException, InputRequiredException, UploadFileException {
        MultiModalConversationParam param = buildMultiModalConversationParam(Msg);
        Flowable result = conv.streamCall(param);
        result.blockingForEach(message -> {
            handleGenerationResult(message);
        });
    }
    public static void main(String[] args) {
        try {
            MultiModalConversation conv = new MultiModalConversation();
            MultiModalMessage userMsg = MultiModalMessage.builder()
                    .role(Role.USER.getValue())
                    .content(Arrays.asList(Collections.singletonMap("image", "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"),
                            Collections.singletonMap("text", "Solve this problem")))
                    .build();
            streamCallWithMessage(conv, userMsg);
//             Print final result
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Complete Response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | UploadFileException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }
}

Click to view reasoning process and complete response

====================Complete Response====================
### Solution:

**1. Rectangular Prism (4 cm × 3 cm × 2 cm):**

- **Surface Area**:  
  \[
  2(lw + lh + wh) = 2(4 \times 3 + 4 \times 2 + 3 \times 2) = 2(12 + 8 + 6) = 2 \times 26 = 52 \, \text{cm}^2
  \]

- **Volume**:  
  \[
  l \times w \times h = 4 \times 3 \times 2 = 24 \, \text{cm}^3
  \]

**2. Cube (3 cm × 3 cm × 3 cm):**

- **Surface Area**:  
  \[
  6a^2 = 6 \times (3 \times 3) = 6 \times 9 = 54 \, \text{cm}^2
  \]

- **Volume**:  
  \[
  a^3 = 3 \times 3 \times 3 = 27 \, \text{cm}^3
  \]

**Final Answers:**  
1. Surface Area: \(52 \, \text{cm}^2\), Volume: \(24 \, \text{cm}^3\)  
2. Surface Area: \(54 \, \text{cm}^2\), Volume: \(27 \, \text{cm}^3\)[root@iZt4n3yeb7z0kt5ynbjumgZ ~]# python qvq.py
====================Reasoning Process====================
Okay, let's tackle this problem step by step. So, there are two figures: a rectangular prism and a cube. I need to find the surface area and volume for each. Let me start with the first one, the rectangular prism.

First, the formula for the surface area of a rectangular prism is 2lw + 2lh + 2wh, right? Where l is length, w is width, and h is height. Looking at the diagram, the dimensions are 4 cm, 3 cm, and 2 cm. Wait, which is which? The problem says "Unit: cm" so all measurements are in centimeters. The prism is labeled with 4, 3, and 2. Let me confirm: the length is usually the longest side, so maybe 4 cm is the length, 3 cm the width, and 2 cm the height. But actually, the formula works regardless of the order as long as we use all three dimensions correctly. So plugging in the numbers: 2*(4*3) + 2*(4*2) + 2*(3*2). Let me compute each part. 4*3 is 12, times 2 is 24. 4*2 is 8, times 2 is 16. 3*2 is 6, times 2 is 12. Adding those together: 24 + 16 is 40, plus 12 is 52. So the surface area should be 52 cm². 

For the volume of the rectangular prism, it's length*width*height. So 4*3*2. 4*3 is 12, times 2 is 24. So volume is 24 cm³. That seems straightforward.

Now the cube. All sides are equal, so each edge is 3 cm. Surface area of a cube is 6 times the area of one face. Since each face is a square, area is 3*3=9 cm². Multiply by 6: 6*9=54 cm². Volume is edge length cubed, so 3*3*3=27 cm³. 

Wait, let me double-check. For the cube, surface area: each face is 3x3, six faces. Yes, 6*9=54. Volume is 3^3=27. That's correct. 

So both calculations seem right. Let me just make sure I didn't mix up any dimensions for the rectangular prism. The given dimensions are 4, 3, 2. The formula uses all three, so regardless of which is length, width, or height, the calculation should be the same. So 2*(4*3 + 4*2 + 3*2) = 2*(12 + 8 + 6) = 2*26=52. Yep, that's correct. Volume is 4*3*2=24. 

Alright, I think that's all. Both surface areas and volumes calculated correctly.

====================Complete Response====================
### Solution:

**1. Rectangular Prism (4 cm × 3 cm × 2 cm):**

- **Surface Area**:  
  \[
  2(lw + lh + wh) = 2(4 \times 3 + 4 \times 2 + 3 \times 2) = 2(12 + 8 + 6) = 2 \times 26 = 52 \, \text{cm}^2
  \]

- **Volume**:  
  \[
  l \times w \times h = 4 \times 3 \times 2 = 24 \, \text{cm}^3
  \]

**2. Cube (3 cm × 3 cm × 3 cm):**

- **Surface Area**:  
  \[
  6a^2 = 6 \times (3 \times 3) = 6 \times 9 = 54 \, \text{cm}^2
  \]

- **Volume**:  
  \[
  a^3 = 3 \times 3 \times 3 = 27 \, \text{cm}^3
  \]

**Final Answers:**  
1. Surface Area: \(52 \, \text{cm}^2\), Volume: \(24 \, \text{cm}^3\)  
2. Surface Area: \(54 \, \text{cm}^2\), Volume: \(27 \, \text{cm}^3\)

HTTP

Sample code

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qvq-max",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
                    {"text": "Please solve this problem"}
                ]
            }
        ]
    }
}'

Click to view the thinking process and complete response

id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"reasoning_content":"Well","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":547,"input_tokens_details":{"image_tokens":520,"text_tokens":24},"output_tokens":3,"input_tokens":544,"output_tokens_details":{"text_tokens":3},"image_tokens":520},"request_id":"f361ae45-fbef-9387-9f35-1269780e0864"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"reasoning_content":",","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":548,"input_tokens_details":{"image_tokens":520,"text_tokens":24},"output_tokens":4,"input_tokens":544,"output_tokens_details":{"text_tokens":4},"image_tokens":520},"request_id":"f361ae45-fbef-9387-9f35-1269780e0864"}

id:3
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"reasoning_content":"I will","role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":549,"input_tokens_details":{"image_tokens":520,"text_tokens":24},"output_tokens":5,"input_tokens":544,"output_tokens_details":{"text_tokens":5},"image_tokens":520},"request_id":"f361ae45-fbef-9387-9f35-1269780e0864"}
.....
id:566
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"square"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":1132,"input_tokens_details":{"image_tokens":520,"text_tokens":24},"output_tokens":588,"input_tokens":544,"output_tokens_details":{"text_tokens":588},"image_tokens":520},"request_id":"758b0356-653b-98ac-b4d3-f812437ba1ec"}

id:567
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"centimeters"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":1133,"input_tokens_details":{"image_tokens":520,"text_tokens":24},"output_tokens":589,"input_tokens":544,"output_tokens_details":{"text_tokens":589},"image_tokens":520},"request_id":"758b0356-653b-98ac-b4d3-f812437ba1ec"}

id:568
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"total_tokens":1134,"input_tokens_details":{"image_tokens":520,"text_tokens":24},"output_tokens":590,"input_tokens":544,"output_tokens_details":{"text_tokens":590},"image_tokens":520},"request_id":"758b0356-653b-98ac-b4d3-f812437ba1ec"}

Multi-round conversation

By default, the QVQ API does not store your conversation history. The multi-round conversation feature equips the model with the ability to "remember" past interactions, catering to scenarios such as follow-up questions and information gathering. You will receive reasoning_content and content from QVQ. You just need to include content in the context by using {'role': 'assistant', 'content': concatenated streaming output content}. reasoning_content is not required.

OpenAI

Implement multi-round conversation through OpenAI SDK or OpenAI-compatible HTTP method.

Python

Sample code

from openai import OpenAI
import os

# Initialize OpenAI client
client = OpenAI(
    # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
                },
            },
            {"type": "text", "text": "Solve this problem"},
        ],
    }
]
conversation_idx = 1
while True:
    reasoning_content = ""  # Define complete thinking process
    answer_content = ""     # Define complete response
    is_answering = False   # Determine if thinking process has ended and response has begun
    print("="*20+f"Round {conversation_idx}"+"="*20)
    conversation_idx += 1
    # Create chat completion request
    completion = client.chat.completions.create(
        model="qvq-max",  # Using qvq-max as an example, can be replaced with other model names as needed
        messages=messages,
        stream=True
    )
    print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")
    for chunk in completion:
        # If chunk.choices is empty, print usage
        if not chunk.choices:
            print("\nUsage:")
            print(chunk.usage)
        else:
            delta = chunk.choices[0].delta
            # Print thinking process
            if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
                print(delta.reasoning_content, end='', flush=True)
                reasoning_content += delta.reasoning_content
            else:
                # Start response
                if delta.content != "" and is_answering is False:
                    print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                    is_answering = True
                # Print response process
                print(delta.content, end='', flush=True)
                answer_content += delta.content
    messages.append({"role": "assistant", "content": answer_content})
    messages.append({
        "role": "user",
        "content": [
        {
            "type": "text",
            "text": input("\nEnter your message: ")
        }
        ]
    })
    print("\n")
    # print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
    # print(reasoning_content)
    # print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
    # print(answer_content)

Node.js

Sample code

import OpenAI from "openai";
import process from 'process';
import readline from 'readline/promises';

// Initialize readline interface
const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout
});

// Initialize openai client
const openai = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY, // Read from environment variable
    baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1'
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;
let messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
                },
            },
            {"type": "text", "text": "Solve this problem"},
        ],
    }
];
let conversationIdx = 1;

async function main() {
    while (true) {
        let reasoningContent = '';
        let answerContent = '';
        let isAnswering = false;
        console.log("=".repeat(20) + `Round ${conversationIdx}` + "=".repeat(20));
        conversationIdx++;

        // Reset state
        reasoningContent = '';
        answerContent = '';
        isAnswering = false;

        try {
            const stream = await openai.chat.completions.create({
                model: 'qvq-max',
                messages: messages,
                stream: true
            });

            console.log("\n" + "=".repeat(20) + "Reasoning Process" + "=".repeat(20) + "\n");

            for await (const chunk of stream) {
                if (!chunk.choices?.length) {
                    console.log('\nUsage:');
                    console.log(chunk.usage);
                    continue;
                }

                const delta = chunk.choices[0].delta;

                // Process thinking process
                if (delta.reasoning_content) {
                    process.stdout.write(delta.reasoning_content);
                    reasoningContent += delta.reasoning_content;
                }

                // Process formal response
                if (delta.content) {
                    if (!isAnswering) {
                        console.log('\n' + "=".repeat(20) + "Complete Response" + "=".repeat(20) + "\n");
                        isAnswering = true;
                    }
                    process.stdout.write(delta.content);
                    answerContent += delta.content;
                }
            }

            // Add complete response to message history
            messages.push({ role: 'assistant', content: answerContent });
            const userInput = await rl.question("Enter your message: ");
            messages.push({"role": "user", "content":userInput});

            console.log("\n");

        } catch (error) {
            console.error('Error:', error);
        }
    }
}

// Start program
main().catch(console.error);

HTTP

Sample code

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qvq-max",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
          }
        },
        {
          "type": "text",
          "text": "Solve this problem"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "Rectangular prism: surface area is 52, volume is 24. Cube: surface area is 54, volume is 27."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is the formula for the area of a triangle?"
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true}
}'

DashScope

Implement multi-round conversation through DashScope SDK or HTTP method.

Python

Sample code

import os
import dashscope
from dashscope import MultiModalConversation

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {
        "role": "user",
        "content": [
            {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
            {"text": "Solve this problem"}
        ]
    }
]

conversation_idx = 1
while True:
    print("=" * 20 + f"Round {conversation_idx}" + "=" * 20)
    conversation_idx += 1
    response = MultiModalConversation.call(
        # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
        api_key=os.getenv('DASHSCOPE_API_KEY'),
        model="qvq-max",  # Using qvq-max as an example, can be replaced with other model names as needed
        messages=messages,
        stream=True,
    )
    # Define complete thinking process
    reasoning_content = ""
    # Define complete response
    answer_content = ""
    # Determine if thinking process has ended and response has begun
    is_answering = False
    print("=" * 20 + "Reasoning Process" + "=" * 20)
    for chunk in response:
        # If both thinking process and response are empty, ignore
        message = chunk.output.choices[0].message
        reasoning_content_chunk = message.get("reasoning_content", None)
        if (chunk.output.choices[0].message.content == [] and
                reasoning_content_chunk == ""):
            pass
        else:
            # If current is thinking process
            if reasoning_content_chunk != None and chunk.output.choices[0].message.content == []:
                print(chunk.output.choices[0].message.reasoning_content, end="")
                reasoning_content += chunk.output.choices[0].message.reasoning_content
            # If current is response
            elif chunk.output.choices[0].message.content != []:
                if not is_answering:
                    print("\n" + "=" * 20 + "Complete Response" + "=" * 20)
                    is_answering = True
                print(chunk.output.choices[0].message.content[0]["text"], end="")
                answer_content += chunk.output.choices[0].message.content[0]["text"]
    messages.append({"role": "assistant", "content": answer_content})
    messages.append({
        "role": "user",
        "content": [
        {
            "type": "text",
            "text": input("\nEnter your message: ")
        }
        ]
    })
    print("\n")
    # If you need to print the complete thinking process and complete response, uncomment the following code
    # print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
    # print(f"{reasoning_content}")
    # print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
    # print(f"{answer_content}")

Java

Sample code

// dashscope SDK version >= 2.19.0
import java.util.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.exception.InputRequiredException;
import java.lang.System;

public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(MultiModalConversationResult message) {
        String re = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String reasoning = Objects.isNull(re)?"":re; // Default value

        List> content = message.getOutput().getChoices().get(0).getMessage().getContent();
        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Reasoning Process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (Objects.nonNull(content) && !content.isEmpty()) {
            Object text = content.get(0).get("text");
            finalContent.append(content.get(0).get("text"));
            if (!isFirstPrint) {
                System.out.println("\n====================Complete Response====================");
                isFirstPrint = true;
            }
            System.out.print(text);
        }
    }
    public static MultiModalConversationParam buildMultiModalConversationParam(List Msg)  {
        return MultiModalConversationParam.builder()
                // If environment variable is not configured, replace with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // Using qvq-max as an example, can be replaced with other model names as needed
                .model("qvq-max")
                .messages(Msg)
                .incrementalOutput(true)
                .build();
    }

    public static void streamCallWithMessage(MultiModalConversation conv, List Msg)
            throws NoApiKeyException, ApiException, InputRequiredException, UploadFileException {
        MultiModalConversationParam param = buildMultiModalConversationParam(Msg);
        Flowable result = conv.streamCall(param);
        result.blockingForEach(message -> {
            handleGenerationResult(message);
        });
    }
    public static void main(String[] args) {
        try {
            MultiModalConversation conv = new MultiModalConversation();
            MultiModalMessage userMsg1 = MultiModalMessage.builder()
                    .role(Role.USER.getValue())
                    .content(Arrays.asList(Collections.singletonMap("image", "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"),
                            Collections.singletonMap("text", "Solve this problem")))
                    .build();
            MultiModalMessage AssistantMsg = MultiModalMessage.builder()
                    .role(Role.ASSISTANT.getValue())
                    .content(Arrays.asList(Collections.singletonMap("text", "Rectangular prism: surface area is 52, volume is 24. Cube: surface area is 54, volume is 27.")))
                    .build();
            MultiModalMessage userMsg2 = MultiModalMessage.builder()
                    .role(Role.USER.getValue())
                    .content(Arrays.asList(Collections.singletonMap("text", "What is the formula for the area of a triangle?")))
                    .build();
            List Msg = Arrays.asList(userMsg1,AssistantMsg,userMsg2);
            streamCallWithMessage(conv, Msg);
//             Print final result
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Complete Response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | UploadFileException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }
}

HTTP

Sample code

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qvq-max",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
                    {"text": "Solve this problem"}
                ]
            },
            {
                "role": "assistant",
                "content": [
                    {"text": "Rectangular prism: surface area is 52, volume is 24. Cube: surface area is 54, volume is 27."}
                ]
            },
            {
                "role": "user",
                "content": [
                    {"text": "What is the formula for the area of a triangle?"}
                ]
            }
        ]
    }
}'

Multiple image input

QVQ can process multiple images in a single request, and the model will respond based on all of them. You can input images as URLs or local files, or a combination of both. The sample codes use URLs.

The total number of tokens in the input images must be less than the maximum input of the model. Calculate the maximum number of images based on Image number limits.

OpenAI

Python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

reasoning_content = ""  # Define complete thinking process
answer_content = ""     # Define complete response
is_answering = False   # Determine if thinking process has ended and response has begun


completion = client.chat.completions.create(
    model="qvq-max",
    messages=[
        {"role": "user", "content": [
            # First image link, if passing a local file, replace the url value with the Base64 encoded format of the image
            {"type": "image_url", "image_url": {
                "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"}, },
            # Second image link, if passing a local file, replace the url value with the Base64 encoded format of the image
            {"type": "image_url",
             "image_url": {"url": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg"}, },
            {"type": "text", "text": "Answer the question in the first image, then interpret the article in the second image."},
        ],
         }
    ],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)
print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start response
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                is_answering = True
            # Print response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(answer_content)

Node.js

import OpenAI from "openai";
import process from 'process';

// Initialize openai client
const openai = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY, // Read from environment variable
    baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1'
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;

let messages = [
        {role: "user",content: [
        // First image link, if passing a local file, replace the url value with the Base64 encoded format of the image
            {type: "image_url",image_url: {"url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"}},
            // Second image link, if passing a local file, replace the url value with the Base64 encoded format of the image
            {type: "image_url",image_url: {"url": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg"}},
            {type: "text", text: "Answer the question in the first image, then interpret the article in the second image." },
        ]}]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qvq-max',
            messages: messages,
            stream: true
        });

        console.log('\n' + '='.repeat(20) + 'Reasoning Process' + '='.repeat(20) + '\n');

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Process thinking process
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Process formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Complete Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qvq-max",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg"
          }
        },
        {
          "type": "text",
          "text": "Answer the question in the first image, then interpret the article in the second image."
        }
      ]
    }
  ],
  "stream":true,
  "stream_options":{"include_usage":true}

}'

DashScope

Python

import os
import dashscope
from dashscope import MultiModalConversation

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {
        "role": "user",
        "content": [
            # First image link, if passing a local file, replace the url value with the Base64 encoded format of the image
            {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
            # Second image link, if passing a local file, replace the url value with the Base64 encoded format of the image
            {"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg"},
            {"text": "Answer the question in the first image, then interpret the article in the second image."}
        ]
    }
]

response = MultiModalConversation.call(
    # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qvq-max",  # Using qvq-max as an example, can be replaced with other model names as needed
    messages=messages,
    stream=True,
)

# Define complete thinking process
reasoning_content = ""
# Define complete response
answer_content = ""
# Determine if thinking process has ended and response has begun
is_answering = False

print("=" * 20 + "Reasoning Process" + "=" * 20)


for chunk in response:
    # If both thinking process and response are empty, ignore
    message = chunk.output.choices[0].message
    reasoning_content_chunk = message.get("reasoning_content", None)

    if (chunk.output.choices[0].message.content == [] and
        reasoning_content_chunk == ""):
        pass
    else:
        # If current is thinking process
        if reasoning_content_chunk != None and chunk.output.choices[0].message.content == []:
            print(chunk.output.choices[0].message.reasoning_content, end="")
            reasoning_content += chunk.output.choices[0].message.reasoning_content
        # If current is response
        elif chunk.output.choices[0].message.content != []:
            if not is_answering:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20)
                is_answering = True
            print(chunk.output.choices[0].message.content[0]["text"], end="")
            answer_content += chunk.output.choices[0].message.content[0]["text"]

# If you need to print the complete thinking process and complete response, uncomment the following code
# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(f"{answer_content}")

Java

// dashscope SDK version >= 2.19.0
import java.util.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.exception.InputRequiredException;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(MultiModalConversationResult message) {
        String re = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String reasoning = Objects.isNull(re)?"":re; // Default value

        List> content = message.getOutput().getChoices().get(0).getMessage().getContent();
        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Reasoning Process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (Objects.nonNull(content) && !content.isEmpty()) {
            Object text = content.get(0).get("text");
            finalContent.append(text);
            if (!isFirstPrint) {
                System.out.println("\n====================Complete Response====================");
                isFirstPrint = true;
            }
            System.out.print(text);
        }
    }
    public static MultiModalConversationParam buildMultiModalConversationParam(MultiModalMessage Msg)  {
        return MultiModalConversationParam.builder()
                // If environment variable is not configured, replace with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // Using qvq-max as an example, can be replaced with other model names as needed
                .model("qvq-max")
                .messages(Arrays.asList(Msg))
                .incrementalOutput(true)
                .build();
    }

    public static void streamCallWithMessage(MultiModalConversation conv, MultiModalMessage Msg)
            throws NoApiKeyException, ApiException, InputRequiredException, UploadFileException {
        MultiModalConversationParam param = buildMultiModalConversationParam(Msg);
        Flowable result = conv.streamCall(param);
        result.blockingForEach(message -> {
            handleGenerationResult(message);
        });
    }
    public static void main(String[] args) {
        try {
            MultiModalConversation conv = new MultiModalConversation();
            MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                    .content(Arrays.asList(
                            // First image link
                            Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
                            // If using a local image, uncomment the line below
                            // new HashMap(){{put("image", filePath);}},
                            // Second image link
                            Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
                            // Third image link
                            Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/hbygyo/rabbit.jpg"),
                            Collections.singletonMap("text", "What do these images depict?"))).build();

            streamCallWithMessage(conv, userMessage);
//             Print final result
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Complete Response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | UploadFileException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
--data '{
    "model": "qvq-max",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01gDEY8M1W114Hi3XcN_!!6000000002727-0-tps-1024-406.jpg"},
                    {"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg"},
                    {"text": "Answer the question in the first image, then interpret the article in the second image."}
                ]
            }
        ]
    }
}'
 

Video understanding

Input videos as image lists or video files.

Image list

At least 4 images and at most 512 images.

Sample code for image sequence URL. To pass local video, see Using local files (Base64 encoded).

OpenAI

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

reasoning_content = ""  # Define complete thinking process
answer_content = ""     # Define complete response
is_answering = False   # Determine if thinking process has ended and response has begun


completion = client.chat.completions.create(
    model="qvq-max",
    messages=[{"role": "user","content": [
        # When passing an image list, the "type" parameter in the user message is "video"
        # When using the OpenAI SDK, the image sequence is by default extracted from the video at intervals of 0.5 seconds and does not support modification. If you need to customize the frame extraction frequency, please use the DashScope SDK.
        {"type": "video","video": [
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
        {"type": "text","text": "Describe the specific process of this video"},
    ]}],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)
print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start response
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                is_answering = True
            # Print response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(answer_content)

Node.js

import OpenAI from "openai";
import process from 'process';

// Initialize openai client
const openai = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY, // Read from environment variable
    baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1'
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;

let messages = [{
            role: "user",
            content: [
                {
                    // When passing an image list, the "type" parameter in the user message is "video"
                    // When using the OpenAI SDK, the image sequence is by default extracted from the video at intervals of 0.5 seconds and does not support modification. If you need to customize the frame extraction frequency, please use the DashScope SDK.
                    type: "video",
                    video: [
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                        "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
                    ]
                },
                {
                    type: "text",
                    text: "Describe the specific process of this video"
                }
            ]
        }]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qvq-max',
            messages: messages,
            stream: true
        });

        console.log('\n' + '='.repeat(20) + 'Reasoning Process' + '='.repeat(20) + '\n');

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Process thinking process
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Process formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Complete Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();

HTTP

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qvq-max",
    "messages": [{"role": "user",
                "content": [{"type": "video",
                "video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
                {"type": "text",
                "text": "Describe the specific process of this video"}]}],
    "stream":true,
    "stream_options":{"include_usage":true}
}'

DashScope

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{"role": "user",
             "content": [
                 {"video":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                           "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
                   "fps":2}, # When using qvq, you can specify the fps parameter. It indicates that the image sequence is extracted from the video at intervals of 1/fps seconds.
                 {"text": "Describe the specific process of this video"}]}]
response = dashscope.MultiModalConversation.call(
    # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model='qvq-max',
    messages=messages,
    stream=True
)

# Define complete thinking process
reasoning_content = ""
# Define complete response
answer_content = ""
# Determine if thinking process has ended and response has begun
is_answering = False

print("=" * 20 + "Reasoning Process" + "=" * 20)


for chunk in response:
    # If both thinking process and response are empty, ignore
    message = chunk.output.choices[0].message
    reasoning_content_chunk = message.get("reasoning_content", None)

    if (chunk.output.choices[0].message.content == [] and
        reasoning_content_chunk == ""):
        pass
    else:
        # If current is thinking process
        if reasoning_content_chunk != None and chunk.output.choices[0].message.content == []:
            print(chunk.output.choices[0].message.reasoning_content, end="")
            reasoning_content += chunk.output.choices[0].message.reasoning_content
        # If current is response
        elif chunk.output.choices[0].message.content != []:
            if not is_answering:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20)
                is_answering = True
            print(chunk.output.choices[0].message.content[0]["text"], end="")
            answer_content += chunk.output.choices[0].message.content[0]["text"]

# If you need to print the complete thinking process and complete response, uncomment the following code
# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(f"{answer_content}")

Java

// dashscope SDK version >= 2.19.0
import java.util.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.exception.InputRequiredException;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(MultiModalConversationResult message) {
        String re = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String reasoning = Objects.isNull(re)?"":re; // Default value

        List> content = message.getOutput().getChoices().get(0).getMessage().getContent();
        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Reasoning Process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (Objects.nonNull(content) && !content.isEmpty()) {
            Object text = content.get(0).get("text");
            finalContent.append(text);
            if (!isFirstPrint) {
                System.out.println("\n====================Complete Response====================");
                isFirstPrint = true;
            }
            System.out.print(text);
        }
    }
    public static MultiModalConversationParam buildMultiModalConversationParam(MultiModalMessage Msg)  {
        return MultiModalConversationParam.builder()
                // If environment variable is not configured, replace with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // Using qvq-max as an example, can be replaced with other model names as needed
                .model("qvq-max")
                .messages(Arrays.asList(Msg))
                .incrementalOutput(true)
                .build();
    }

    public static void streamCallWithMessage(MultiModalConversation conv, MultiModalMessage Msg)
            throws NoApiKeyException, ApiException, InputRequiredException, UploadFileException {
        MultiModalConversationParam param = buildMultiModalConversationParam(Msg);
        Flowable result = conv.streamCall(param);
        result.blockingForEach(message -> {
            handleGenerationResult(message);
        });
    }
    public static void main(String[] args) {
        try {
            MultiModalConversation conv = new MultiModalConversation();
            Map params = Map.of(
                    "video", Arrays.asList("https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
                            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
                            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
                            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg")
            );
            MultiModalMessage userMessage = MultiModalMessage.builder()
                    .role(Role.USER.getValue())
                    .content(Arrays.asList(
                            params,
                            Collections.singletonMap("text", "Describe the specific process of this video")))
                    .build();
            streamCallWithMessage(conv, userMessage);
//             Print final result
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Complete Response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | UploadFileException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }}

HTTP

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
  "model": "qvq-max",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "video": [
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
              "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
            ]

          },
          {
            "text": "Describe the specific process of this video"
          }
        ]
      }
    ]
  }
}'

Video file

  • Size:

  • Format: MP4, AVI, MKV, MOV, FLV, WMV.

  • Duration: From 2 seconds to 10 minutes.

  • Dimensions: No restrictions. However, video files will be adjusted to approximately 600,000 pixels. Larger video dimensions will not provide better understanding effects.

  • Currently, audio understanding of video files is not supported.

The following sample codes are for video URL. To pass local video, see Using local files (Base64 encoded).

OpenAI

Python

import os
from openai import OpenAI

client = OpenAI(
    # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

reasoning_content = ""  # Define complete thinking process
answer_content = ""     # Define complete response
is_answering = False   # Determine if thinking process has ended and response has begun

completion = client.chat.completions.create(
    model="qvq-max",
    messages=[{"role": "user","content": [
        # When passing an image list, the "type" parameter in the user message is "video"
        # When using the OpenAI SDK, the image sequence is by default extracted from the video at intervals of 0.5 seconds and does not support modification. If you need to customize the frame extraction frequency, please use the DashScope SDK.
        {"type": "video_url","video_url":{"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"} },
        {"type": "text","text": "This is the beginning part of the video. Please analyze and guess what knowledge the video is explaining."},
    ]}],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)
print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print thinking process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start response
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                is_answering = True
            # Print response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(answer_content)

Node.js

import OpenAI from "openai";
import process from 'process';

// Initialize openai client
const openai = new OpenAI({
    apiKey: process.env.DASHSCOPE_API_KEY, // Read from environment variable
    baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1'
});

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;

let messages = [
    {
        role: "user",
    content: [
        // When using the OpenAI SDK, the image sequence is by default extracted from the video at intervals of 0.5 seconds and does not support modification. If you need to customize the frame extraction frequency, please use the DashScope SDK.
        { type: "video_url", video_url: { "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov" } },
        { type: "text", text: "This is the beginning part of the video. Please analyze and guess what knowledge the video is explaining." },
    ]
}]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qvq-max',
            messages: messages,
            stream: true
        });

        console.log('\n' + '='.repeat(20) + 'Reasoning Process' + '='.repeat(20) + '\n');

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Process thinking process
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Process formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Complete Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();

HTTP

curl

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "qvq-max",
    "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "video_url",
          "video_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"
          }
        },
        {
          "type": "text",
          "text": "This is the beginning part of the video. Please analyze and guess what knowledge the video is explaining."
        }
      ]
    }
  ],
    "stream":true,
    "stream_options":{"include_usage":true}
}'

DashScope

Python

import os
import dashscope
from dashscope import MultiModalConversation

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
    {
        "role": "user",
        "content": [
            # You can specify the fps parameter. It indicates that the image sequence is extracted from the video at intervals of 1/fps seconds. For instructions, see https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
            {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov,"fps":2"},
            {"text": "This is the beginning part of the video. Please analyze and guess what knowledge the video is explaining."}
        ]
    }
]

response = MultiModalConversation.call(
    # If environment variable is not configured, replace with your Model Studio API Key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qvq-max",  # Using qvq-max as an example, can be replaced with other model names as needed
    messages=messages,
    stream=True,
)

# Define complete thinking process
reasoning_content = ""
# Define complete response
answer_content = ""
# Determine if thinking process has ended and response has begun
is_answering = False

print("=" * 20 + "Reasoning Process" + "=" * 20)

for chunk in response:
    # If both thinking process and response are empty, ignore
    message = chunk.output.choices[0].message
    reasoning_content_chunk = message.get("reasoning_content", None)
    if (chunk.output.choices[0].message.content == [] and
        reasoning_content_chunk == ""):
        pass
    else:
        # If current is thinking process
        if reasoning_content_chunk != None and chunk.output.choices[0].message.content == []:
            print(chunk.output.choices[0].message.reasoning_content, end="")
            reasoning_content += chunk.output.choices[0].message.reasoning_content
        # If current is response
        elif chunk.output.choices[0].message.content != []:
            if not is_answering:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20)
                is_answering = True
            print(chunk.output.choices[0].message.content[0]["text"], end="")
            answer_content += chunk.output.choices[0].message.content[0]["text"]

# If you need to print the complete thinking process and complete response, uncomment the following code
# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(f"{reasoning_content}")
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(f"{answer_content}")

Java

// dashscope SDK version >= 2.19.0
import java.util.*;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.Flowable;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.exception.InputRequiredException;
import java.lang.System;
import com.alibaba.dashscope.utils.Constants;

public class Main {
    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    private static final Logger logger = LoggerFactory.getLogger(Main.class);
    private static StringBuilder reasoningContent = new StringBuilder();
    private static StringBuilder finalContent = new StringBuilder();
    private static boolean isFirstPrint = true;

    private static void handleGenerationResult(MultiModalConversationResult message) {
        String re = message.getOutput().getChoices().get(0).getMessage().getReasoningContent();
        String reasoning = Objects.isNull(re)?"":re; // Default value

        List> content = message.getOutput().getChoices().get(0).getMessage().getContent();
        if (!reasoning.isEmpty()) {
            reasoningContent.append(reasoning);
            if (isFirstPrint) {
                System.out.println("====================Reasoning Process====================");
                isFirstPrint = false;
            }
            System.out.print(reasoning);
        }

        if (Objects.nonNull(content) && !content.isEmpty()) {
            Object text = content.get(0).get("text");
            finalContent.append(content.get(0).get("text"));
            if (!isFirstPrint) {
                System.out.println("\n====================Complete Response====================");
                isFirstPrint = true;
            }
            System.out.print(text);
        }
    }
    public static MultiModalConversationParam buildMultiModalConversationParam(MultiModalMessage Msg)  {
        return MultiModalConversationParam.builder()
                // If environment variable is not configured, replace with your Model Studio API Key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                // Using qvq-max as an example, can be replaced with other model names as needed
                .model("qvq-max")
                .messages(Arrays.asList(Msg))
                .incrementalOutput(true)
                .build();
    }

    public static void streamCallWithMessage(MultiModalConversation conv, MultiModalMessage Msg)
            throws NoApiKeyException, ApiException, InputRequiredException, UploadFileException {
        MultiModalConversationParam param = buildMultiModalConversationParam(Msg);
        Flowable result = conv.streamCall(param);
        result.blockingForEach(message -> {
            handleGenerationResult(message);
        });
    }
    public static void main(String[] args) {
        try {
            MultiModalConversation conv = new MultiModalConversation();
            MultiModalMessage userMsg = MultiModalMessage.builder()
                    .role(Role.USER.getValue())
                    .content(Arrays.asList(Collections.singletonMap("video", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"),
                            Collections.singletonMap("text", "This is the beginning part of the video. Please analyze and guess what knowledge the video is explaining.")))
                    .build();
            streamCallWithMessage(conv, userMsg);
//             Print final result
//            if (reasoningContent.length() > 0) {
//                System.out.println("\n====================Complete Response====================");
//                System.out.println(finalContent.toString());
//            }
        } catch (ApiException | NoApiKeyException | UploadFileException | InputRequiredException e) {
            logger.error("An exception occurred: {}", e.getMessage());
        }
        System.exit(0);
    }}

HTTP

curl

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
    "model": "qvq-max",
    "input":{
        "messages":[
            {
                "role": "user",
                "content": [
                    {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"},
                    {"text": "This is the beginning part of the video. Please analyze and guess what knowledge the video is explaining."}
                ]
            }
        ]
    }
}'

Using local files (Base64 encoded input)

Here are sample codes for passing local image files. Currently, only OpenAI SDK and HTTP method support local files.

Image

Limits on input images. To pass image URL, see Get started.

The Base64-encoded image must be less than 10 MB.

OpenAI

Python

from openai import OpenAI
import os
import base64


#  base64 encoding format
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxxx/test.jpg with the absolute path of your local image
base64_image = encode_image("xxx/test.jpg")

# Initialize OpenAI client
client = OpenAI(
    # If environment variable is not configured, replace with Alibaba Cloud Model Studio API Key: api_key="sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""  # Define complete reasoning process
answer_content = ""     # Define complete response
is_answering = False   # Determine whether the reasoning process is finished and the response has started

# Create chat completion request
completion = client.chat.completions.create(
    model="qvq-max",  # Using qvq-max as an example, you can change the model name as needed
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # Note that when passing Base64, the image format (i.e., image/{format}) needs to be consistent with the Content Type in the supported image list. "f" is a string formatting method.
                    # PNG image:  f"data:image/png;base64,{base64_image}"
                    # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP image: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                },
                {"type": "text", "text": "How do I solve this problem?"},
            ],
        }
    ],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print reasoning process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start response
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                is_answering = True
            # Print response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(answer_content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variable is not configured, replace the following line with: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };
// Replace xxxx/test.jpg with the absolute path of your local image
const base64Image = encodeImage("xxx/test.jpg")
let reasoningContent = '';
let answerContent = '';
let isAnswering = false;

let messages = [
    {
        role: "user",
    content: [
        { type: "image_url", image_url: {"url": `data:image/png;base64,${base64Image}`} },
        { type: "text", text: "Please solve this problem" },
    ]
}]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qvq-max',
            messages: messages,
            stream: true
        });

        console.log('\n' + '='.repeat(20) + 'Reasoning Process' + '='.repeat(20) + '\n');

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Process reasoning
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Process formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Complete Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();

Video

Video file

The Base64-encoded video must be less than 10 MB.

Python

from openai import OpenAI
import os
import base64


#  base64 encoding format
def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode("utf-8")

# Replace xxxx/test.mp4 with the absolute path of your local video
base64_video = encode_video("xxx/test.mp4")

# Initialize OpenAI client
client = OpenAI(
    # If environment variable is not configured, replace with Alibaba Cloud Model Studio API Key: api_key="sk-xxx"
    api_key = os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""  # Define complete reasoning process
answer_content = ""     # Define complete response
is_answering = False   # Determine whether the reasoning process is finished and the response has started

# Create chat completion request
completion = client.chat.completions.create(
    model="qvq-max",  # Using qvq-max as an example, you can change the model name as needed
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    # Note that when passing Base64, change video/mp4 to match your local video file
                    "video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
                },
                {"type": "text", "text": "What is this video about?"},
            ],
        }
    ],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print reasoning process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start response
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                is_answering = True
            # Print response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(answer_content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variable is not configured, replace the following line with: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeVideo = (videoPath) => {
    const videoFile = readFileSync(videoPath);
    return videoFile.toString('base64');
  };
// Replace xxxx/test.mp4 with the absolute path of your local video
const base64Video = encodeVideo("xxx/test.mp4")

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;

let messages = [
    {
        role: "user",
    content: [
        // Note that when passing Base64, change video/mp4 to match your local video file
        { type: "video_url", video_url: {"url": `data:video/mp4;base64,${base64Video}`} },
        { type: "text", text: "What is this video about?" },
    ]
}]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qvq-max',
            messages: messages,
            stream: true
        });

        console.log('\n' + '='.repeat(20) + 'Reasoning Process' + '='.repeat(20) + '\n');

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Process reasoning
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Process formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Complete Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();

Image list

Each Base64-encoded video frame must be less than 10 MB.

Python

import os
from openai import OpenAI
import base64
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")


client = OpenAI(
    # If environment variable is not configured, replace the following line with: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

reasoning_content = ""  # Define complete reasoning process
answer_content = ""     # Define complete response
is_answering = False   # Determine whether the reasoning process is finished and the response has started


completion = client.chat.completions.create(
    model="qvq-max",
    messages=[{"role": "user","content": [
        # When passing an image list, the "type" parameter in the user message is "video"
        {"type": "video","video": [
            f"data:image/png;base64,{base64_image1}",
            f"data:image/png;base64,{base64_image2}",
            f"data:image/png;base64,{base64_image3}",
            f"data:image/png;base64,{base64_image4}",]},
        {"type": "text","text": "Describe the specific process in this video?"},
    ]}],
    stream=True,
    # Uncomment the following to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)
print("\n" + "=" * 20 + "Reasoning Process" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print reasoning process
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content != None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            # Start response
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "Complete Response" + "=" * 20 + "\n")
                is_answering = True
            # Print response process
            print(delta.content, end='', flush=True)
            answer_content += delta.content

# print("=" * 20 + "Complete Reasoning Process" + "=" * 20 + "\n")
# print(reasoning_content)
# print("=" * 20 + "Complete Response" + "=" * 20 + "\n")
# print(answer_content)

Node.js

import OpenAI from "openai";
import { readFileSync } from 'fs';

const openai = new OpenAI(
    {
        // If environment variable is not configured, replace the following line with: apiKey: "sk-xxx"
        apiKey: process.env.DASHSCOPE_API_KEY,
        baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
    }
);

const encodeImage = (imagePath) => {
    const imageFile = readFileSync(imagePath);
    return imageFile.toString('base64');
  };

const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")

let reasoningContent = '';
let answerContent = '';
let isAnswering = false;
let messages = [{
            role: "user",
            content: [
                {
                    // When passing an image list, the "type" parameter in the user message is "video"
                    type: "video",
                    video: [
                            `data:image/png;base64,${base64Image1}`,
                            `data:image/png;base64,${base64Image2}`,
                            `data:image/png;base64,${base64Image3}`,
                            `data:image/png;base64,${base64Image4}`
                    ]
                },
                {
                    type: "text",
                    text: "Describe the specific process in this video"
                }
            ]
        }]

async function main() {
    try {
        const stream = await openai.chat.completions.create({
            model: 'qvq-max',
            messages: messages,
            stream: true
        });

        console.log('\n' + '='.repeat(20) + 'Reasoning Process' + '='.repeat(20) + '\n');

        for await (const chunk of stream) {
            if (!chunk.choices?.length) {
                console.log('\nUsage:');
                console.log(chunk.usage);
                continue;
            }

            const delta = chunk.choices[0].delta;

            // Process reasoning
            if (delta.reasoning_content) {
                process.stdout.write(delta.reasoning_content);
                reasoningContent += delta.reasoning_content;
            }
            // Process formal response
            else if (delta.content) {
                if (!isAnswering) {
                    console.log('\n' + '='.repeat(20) + 'Complete Response' + '='.repeat(20) + '\n');
                    isAnswering = true;
                }
                process.stdout.write(delta.content);
                answerContent += delta.content;
            }
        }
    } catch (error) {
        console.error('Error:', error);
    }
}

main();

Usage notes

Supported image formats

Here are the supported image formats. When using the OpenAI SDK to input local images, set image/{format} according to the Content Type column.

Image format

File name extension

Content Type

BMP

.bmp

image/bmp

JPEG

.jpe, .jpeg, .jpg

image/jpeg

PNG

.png

image/png

TIFF

.tif, .tiff

image/tiff

WEBP

.webp

image/webp

HEIC

.heic

image/heic

Image size limitations

  • The size of a single image file must not exceed 10 MB. When using the OpenAI SDK, the Base64-encoded image must not exceed 10 MB either, see .

  • The width and height of an image must both be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.

  • No pixel count limit for a single image, because the model will scale and preprocess the image before understanding. Larger images do not necessarily improve understanding performance. Recommended pixel values:

    • For a single image input to qvq-max, qvq-max-latest, or qvq-max-2025-03-25, the number of pixels should not exceed 1,003,520.

Image number limitations

In multi-image input, the number of images is limited by the model's total token limit for text and images (maximum input). The total token count of all images must be less than the model's maximum input.

For example, qvq-max has a maximum input of 106,496 tokens. The default token limit is 1,280 per image. You can set vl_high_resolution_images in DashScope to increase the token limit to 16,384 per image. If your input images are all 1280 × 1280:

Token limit per image

Adjusted image

Image tokens

Maximum number of images

1,280 (default)

980 x 980

1,227

86

16,384

1288 x 1288

2,118

50

API reference

For input and output parameter details, see Qwen.

Error codes

If the call failed and an error message is returned, see Error messages.