The Qwen VL model provides answers based on the images you submit.
Visit Playground to experience image understanding.
Scenarios
Image question answering: Describe the content in images or classify and label them, such as identifying people, places, animals, and more.
Mathematical problem solving: Solve mathematical problems in images, suitable for education and training.
Video understanding: Analyze video content, such as locating specific events and obtaining timestamps, or generating summaries of key time segments.
Object localization: Locate objects in images, returning the coordinates of the top-left and bottom-right corners of the bounding rectangle or the coordinates of the center point.
Document parsing: Parse image-based documents (such as scanned documents/image PDFs) into QwenVL HTML format, which not only accurately recognizes text but also obtains position information of elements such as images and tables.
OCR and information extraction: Recognize text and formulas in images, or extract information from receipts, certificates, and forms with formatted text output. Supported languages include Chinese, English, Japanese, Korean, Arabic, Vietnamese, French, German, Italian, Spanish, and Russian.
Models and pricing
How to use
You must first Obtain an API key and Set API key as an environment variable. To use the OpenAI SDK or DashScope SDK, you must Install the SDK.
The DashScope Python SDK version must be no lower than 1.20.7. The DashScope Java SDK version must be no lower than 2.18.3.
If you are a member of a sub-workspace, make sure that the Super Admin has authorized models for the sub-workspace.
You can use the recommended prompts to adapt to different scenarios. Note that only Qwen2.5-VL supports document parsing, object localization, and video understanding with timestamps.
When using qwen-vl-plus-latest and qwen-vl-plus-2025-01-25 for text extraction, set presence_penalty to 1.5 and repetition_penalty to 1.0 for better accuracy.
Get started
Sample code for image understanding using image URLs.
Check the limitations on input images. To use local images, see Using local files.
OpenAI
Python
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)
completion = client.chat.completions.create(
model="qwen-vl-max", # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=[
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}],
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
},
},
{"type": "text", "text": "What scene is depicted in this image?"},
],
},
],
)
print(completion.choices[0].message.content)
Sample response
This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the ocean and sky in the background. The person and dog appear to be interacting, with the dog's front paws resting on the person's hand. Sunlight is shining from the right side of the frame, adding a warm atmosphere to the entire scene.
Node.js
import OpenAI from "openai";
const openai = new OpenAI({
// If environment variable is not configured, replace the line below with: apiKey: "sk-xxx"
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});
async function main() {
const response = await openai.chat.completions.create({
model: "qwen-vl-max", // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages: [{
role: "system",
content: [{
type: "text",
text: "You are a helpful assistant."
}]
},
{
role: "user",
content: [{
type: "image_url",
image_url: {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
}
},
{
type: "text",
text: "What scene is depicted in this image?"
}
]
}
]
});
console.log(response.choices[0].message.content);
}
main()
Sample response
This is a photo taken on a beach. In the photo, a woman wearing a plaid shirt is sitting on the sand, interacting with a yellow Labrador retriever wearing a collar. The background shows the ocean and sky, with sunlight shining on them, creating a warm atmosphere.
curl
curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen-vl-max",
"messages": [
{"role":"system",
"content":[
{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"}},
{"type": "text", "text": "What scene is depicted in this image?"}
]
}]
}'
Sample response
{
"choices": [
{
"message": {
"content": "This image shows a woman and a dog interacting on a beach. The woman is sitting on the sand, smiling and shaking hands with the dog. The background features the ocean and sunset sky, creating a very warm and harmonious atmosphere. The dog is wearing a collar and appears very gentle.",
"role": "assistant"
},
"finish_reason": "stop",
"index": 0,
"logprobs": null
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 1270,
"completion_tokens": 54,
"total_tokens": 1324
},
"created": 1725948561,
"system_fingerprint": null,
"model": "qwen-vl-max",
"id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}
DashScope
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role": "system",
"content": [
{"text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},
{"text": "What scene is depicted in this image?"}]
}]
response = dashscope.MultiModalConversation.call(
# If environment variable is not configured, replace the line below with: api_key="sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-max', # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=messages
)
print(response.output.choices[0].message.content[0]["text"])
Sample response
This is a photo taken on a beach. The photo shows a woman and a dog. The woman is sitting on the sand, smiling and interacting with the dog. The dog is wearing a collar and appears to be shaking hands with the woman. The background features the ocean and sky, with sunlight shining on them, creating a warm atmosphere.
Java
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
.content(Arrays.asList(
Collections.singletonMap("text", "You are a helpful assistant."))).build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
Collections.singletonMap("text", "What scene is depicted in this image?"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If environment variable is not configured, replace the line below with: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-max") // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
.messages(Arrays.asList(systemMessage, userMessage))
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text")); }
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
Sample response
This is a photo taken on a beach. The photo shows a person wearing a plaid shirt and a dog wearing a collar. They are sitting face to face, appearing to interact. The background is the ocean and sky, with sunlight shining on them, creating a warm and harmonious atmosphere.
curl
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-max",
"input":{
"messages":[
{"role": "system",
"content": [
{"text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"text": "What scene is depicted in this image?"}
]
}
]
}
}'
Sample response
{
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "This is a photo taken on a beach. The photo shows a person wearing a plaid shirt and a dog wearing a collar. They are sitting on the sand, with the ocean and sky in the background. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the entire scene."
}
]
}
}
]
},
"usage": {
"output_tokens": 55,
"input_tokens": 1271,
"image_tokens": 1247
},
"request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}
Multi-round conversation
Qwen-VL can reference conversation history when generating responses. You need to maintain a messages array and add the conversation history of each round and new instructions to the messages array.
OpenAI
curl
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-max",
"messages": [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
}
},
{
"type": "text",
"text": "What scene is depicted in the image?"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "This is a girl and a dog."
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Write a poem describing this scene"
}
]
}
]
}'
Response result
{
"choices": [
{
"message": {
"content": "Sea breeze gently caresses smiling faces, \nOn sandy beach with canine companion. \nSunset casts shadows short and sweet, \nJoyful moments intoxicate the heart.",
"role": "assistant"
},
"finish_reason": "stop",
"index": 0,
"logprobs": null
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 1295,
"completion_tokens": 32,
"total_tokens": 1327
},
"created": 1726324976,
"system_fingerprint": null,
"model": "qwen-vl-max",
"id": "chatcmpl-3c953977-6107-96c5-9a13-c01e328b24ca"
}
DashScope
Python
import os
from dashscope import MultiModalConversation
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role": "system",
"content": [{"text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{
"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
},
{"text": "What scene is depicted in the image?"},
],
}
]
response = MultiModalConversation.call(
# If the environment variable is not configured, please replace the following line with: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-max', # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=messages
)
print(f"Model first round output: {response.output.choices[0].message.content[0]['text']}")
messages.append(response['output']['choices'][0]['message'])
user_msg = {"role": "user", "content": [{"text": "Write a poem describing this scene"}]}
messages.append(user_msg)
response = MultiModalConversation.call(
# If the environment variable is not configured, please replace the following line with: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-max',
messages=messages
)
print(f"Model second round output: {response.output.choices[0].message.content[0]['text']}")
Response result
Model first round output: This is a photo taken on a beach. In the photo, there is a person wearing a plaid shirt and a dog wearing a collar. The person and dog are sitting face to face, seemingly interacting. The background is the sea and sky, with sunlight shining on them, creating a warm atmosphere.
Model second round output: On the sun-drenched beach, person and dog share joyful moments together.
Java
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
private static final String modelName = "qwen-vl-max"; // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
public static void MultiRoundConversationCall() throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
.content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
Collections.singletonMap("text", "What scene is depicted in the image?"))).build();
List<MultiModalMessage> messages = new ArrayList<>();
messages.add(systemMessage);
messages.add(userMessage);
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If environment variable is not configured, replace the following line with: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model(modelName)
.messages(messages)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println("First round output: "+result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text")); // add the result to conversation
messages.add(result.getOutput().getChoices().get(0).getMessage());
MultiModalMessage msg = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(Collections.singletonMap("text", "Write a poem describing this scene"))).build();
messages.add(msg);
param.setMessages((List)messages);
result = conv.call(param);
System.out.println("Second round output: "+result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text")); }
public static void main(String[] args) {
try {
MultiRoundConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
Response result
First round output: This is a photo taken on a beach. In the photo, there is a person wearing a plaid shirt and a dog wearing a collar. The person and dog are sitting face to face, seemingly interacting. The background is the sea and sky, with sunlight shining on them, creating a warm atmosphere.
Second round output: On the sun-drenched beach, person and dog share joyful moments together.
curl
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-max",
"input":{
"messages":[
{
"role": "system",
"content": [{"text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"text": "What scene is depicted in the image?"}
]
},
{
"role": "assistant",
"content": [
{"text": "This is a woman and a Labrador retriever playing on the beach."}
]
},
{
"role": "user",
"content": [
{"text": "Write a seven-character quatrain describing this scene"}
]
}
]
}
}'
Response result
{
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "Waves gently lap the sandy shore, girl and dog frolic together. Sunlight falls on smiling faces, joyful moments forever remembered."
}
]
}
}
]
},
"usage": {
"output_tokens": 27,
"input_tokens": 1298,
"image_tokens": 1247
},
"request_id": "bdf5ef59-c92e-92a6-9d69-a738ecee1590"
}
Stream
In streaming output mode, the model generates and returns intermediate results in real-time instead of one final response. This reduces the wait time for the complete response.
OpenAI
Simply set stream
to true
.
Python
from openai import OpenAI
import os
client = OpenAI(
# If environment variable is not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-max", # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=[
{"role": "system",
"content": [{"type":"text","text": "You are a helpful assistant."}]},
{"role": "user",
"content": [{"type": "image_url",
"image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
{"type": "text", "text": "What scene is depicted in the image?"}]}],
stream=True
)
full_content = ""
print("Streaming output content:")
for chunk in completion:
if chunk.choices[0].delta.content is None:
continue
full_content += chunk.choices[0].delta.content
print(chunk.choices[0].delta.content)
print(f"Complete content: {full_content}")
Sample response
Streaming output content:
The
image
depicts
a
woman
......
warm
harmonious
atmosphere
.
Complete content: The image depicts a woman and a dog interacting on a beach. The woman is sitting on the sand, smiling and shaking hands with the dog, looking very happy. The background shows the sea and sky, with sunlight shining on them, creating a warm harmonious atmosphere.
Node.js
import OpenAI from "openai";
const openai = new OpenAI(
{
// If environment variable is not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
}
);
const completion = await openai.chat.completions.create({
model: "qwen-vl-max", // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages: [
{"role": "system",
"content": [{"type":"text","text": "You are a helpful assistant."}]},
{"role": "user",
"content": [{"type": "image_url",
"image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
{"type": "text", "text": "What scene is depicted in the image?"}]}],
stream: true,
});
let fullContent = ""
console.log("Stream output content: ")
for await (const chunk of completion) {
if (chunk.choices[0].delta.content != null) {
fullContent += chunk.choices[0].delta.content;
console.log(chunk.choices[0].delta.content);
}
}
console.log(`Full output content: ${fullContent}`)
Sample response
Streaming output content:
The image depicts
a woman and a
dog interacting on a beach.
......
shining on them,
creating a warm and harmonious
atmosphere.
Complete content: The image depicts a woman and a dog interacting on a beach. The woman is wearing a plaid shirt, sitting on the sand, smiling and shaking hands with the dog. The dog is wearing a collar and looks happy. The background shows the sea and sky, with sunlight shining on them, creating a warm and harmonious atmosphere.
curl
curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen-vl-plus",
"messages": [
{
"role": "system",
"content": [{"type":"text","text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
}
},
{
"type": "text",
"text": "What scene is depicted in the image?"
}
]
}
],
"stream":true,
"stream_options":{"include_usage":true}
}'
Sample response
data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}
data: {"choices":[{"finish_reason":null,"delta":{"content":"The"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}
data: {"choices":[{"delta":{"content":" image"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}
......
data: {"choices":[{"delta":{"content":" photo taken outdoors. The overall atmosphere appears very"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}
data: {"choices":[{"finish_reason":"stop","delta":{"content":" harmonious and warm."},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}
data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":1276,"completion_tokens":85,"total_tokens":1361},"created":1721823635,"system_fingerprint":null,"model":"qwen-vl-plus","id":"chatcmpl-9a9ec75a-3109-9910-b79e-7bcbce81c8f9"}
data: [DONE]
DashScope
Python SDK: Set
stream
to True.Java SDK: Use the
streamCall
interface.HTTP: Specify
X-DashScope-SSE
asenable
in the Header.
By default, the content of streaming output is non-incremental. Each returned content includes previously generated content. To use incremental stream output, setincremental_output
totrue
for Python. SetincrementalOutput
totrue
for Java.
Python
import os
from dashscope import MultiModalConversation
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role": "system",
"content": [{"text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"text": "What scene is depicted in the image?"}
]
}
]
responses = MultiModalConversation.call(
# If environment variable is not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
api_key=os.getenv("DASHSCOPE_API_KEY"),
model='qwen-vl-max', # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=messages,
stream=True,
incremental_output=True)
full_content = ""
print("Streaming output content:")
for response in responses:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
full_content += response["output"]["choices"][0]["message"].content[0]["text"]
except:
pass
print(f"Complete content: {full_content}")
Sample response
Streaming output content:
The image depicts
a person and a dog
interacting on a beach
......
sunlight shining on them
, creating a
warm and harmonious atmosphere
.
Complete content: The image depicts a person and a dog interacting on a beach. The person is wearing a plaid shirt, sitting on the sand, shaking hands with a golden retriever wearing a collar. The background shows waves and sky, with sunlight shining on them, creating a warm and harmonious atmosphere.
Java
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void streamCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
// must create mutable map.
MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
.content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant."))).build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(Collections.singletonMap("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"),
Collections.singletonMap("text", "What scene is depicted in the image?"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If environment variable is not configured, replace the line below with: .apiKey("sk-xxx") using your Model Studio API Key
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-max") // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
.messages(Arrays.asList(systemMessage, userMessage))
.incrementalOutput(true)
.build();
Flowable<MultiModalConversationResult> result = conv.streamCall(param);
result.blockingForEach(item -> {
try {
System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
streamCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
Sample response
The
image
depicts
a
woman
and
a
dog
on
a
beach
......
creating
a
warm
and
harmonious
atmosphere
.
curl
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
"model": "qwen-vl-max",
"input":{
"messages":[
{
"role": "system",
"content": [
{"text": "You are a helpful assistant."}
]
},
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"text": "What scene is depicted in the image?"}
]
}
]
},
"parameters": {
"incremental_output": true
}
}'
Sample response
iid:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"This"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":1,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}
id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":" image"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":2,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}
......
id:17
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":" appreciation. This is a heartwarming scene that shows"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":112,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}
id:18
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":" the deep emotional bond between humans and animals."}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"input_tokens":1276,"output_tokens":120,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}
id:19
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"input_tokens":1276,"output_tokens":121,"image_tokens":1247},"request_id":"00917f72-d927-9344-8417-2c4088d64c16"}
High resolution image understanding
Set vl_high_resolution_images
to true
to increase the token limit for a single image from 1,280 to 16,384:
Parameter value | Token limit per image | Description | Scenarios |
True | 16,384 |
| Scenarios with rich contents and details. |
False (default) | 1,280 |
| Scenarios with fewer details, where the model only needs to understand general information or where speed is more important. |
vl_high_resolution_images
is only supported by DashScope Python SDK and HTTP methods.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role": "user",
"content": [
{"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
{"text": "What does this image show?"}
]
}
]
response = dashscope.MultiModalConversation.call(
# If environment variables are not configured, replace the following line with: api_key="sk-xxx" using your Model Studio API Key
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-max', # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=messages,
vl_high_resolution_images=True
)
print("Model response:\n ",response.output.choices[0].message.content[0]["text"])
print("Token usage information: ","Total input tokens: ",response.usage["input_tokens"] , "Image tokens: " , response.usage["image_tokens"])
Sample response
Model response:
This image shows a cozy Christmas decoration scene. The following elements can be seen in the picture:
1. **Christmas trees**: Two small Christmas trees covered with white snow.
2. **Reindeer figurine**: A brown reindeer figurine with large antlers.
3. **Candles and candleholders**: Several wooden candleholders with lit candles that emit a warm glow.
4. **Christmas ornaments**: Including golden ball decorations, pinecones, red berry strings, etc.
5. **Christmas gift box**: A small golden gift box tied with a golden ribbon.
6. **Christmas lettering**: Wooden "MERRY CHRISTMAS" lettering that enhances the festive atmosphere.
7. **Background**: A wooden background that gives a natural and warm feeling.
The overall ambiance is very cozy and festive, filled with a strong Christmas spirit.
Token usage information: Total input tokens: 5368, Image tokens: 5342
curl
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-max",
"input":{
"messages":[
{"role": "system",
"content": [
{"text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
{"text": "What does this image show?"}
]
}
]
},
"parameters": {
"vl_high_resolution_images": true
}
}'
Sample response
{
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "This image shows a cozy Christmas decoration scene. The picture includes the following elements:\n\n1. **Christmas trees**: Two small Christmas trees covered with white snow.\n2. **Reindeer figurine**: A brown reindeer figurine positioned in the center-right of the image.\n3. **Candles**: Several wooden candles, two of which are lit, emitting a warm glow.\n4. **Christmas ornaments**: Some gold and red decorative balls, pinecones, berries, and green pine branches.\n5. **Christmas gift**: A small golden gift box, with a bag featuring Christmas patterns next to it.\n6. **\"MERRY CHRISTMAS\" lettering**: Wooden letters spelling \"MERRY CHRISTMAS\" placed on the left side of the image.\n\nThe entire scene is arranged against a wooden background, creating a warm, festive atmosphere that's perfect for Christmas celebrations."
}
]
}
}
]
},
"usage": {
"total_tokens": 5553,
"output_tokens": 185,
"input_tokens": 5368,
"image_tokens": 5342
},
"request_id": "38cd5622-e78e-90f5-baa0-c6096ba39b04"
}
Multiple image input
Qwen-VL can process multiple images in a single request, and the model will respond based on all of them. You can input images as URLs or local files, or a combination of both. The sample codes use URLs.
The total number of tokens in the input images must be less than the maximum input of the model. Calculate the maximum number of images based on Image number limitations.
OpenAI
Python
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-max", # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=[
{"role": "system","content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user","content": [
# First image URL, if using a local file, replace the url value with the Base64 encoding of the image
{"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},},
# Second image URL, if using a local file, replace the url value with the Base64 encoding of the image
{"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},},
{"type": "text", "text": "What do these images depict?"},
],
}
],
)
print(completion.choices[0].message.content)
Sample response
Image 1 shows a woman and a Labrador dog interacting on a beach. The woman is wearing a plaid shirt, sitting on the sand, and shaking hands with the dog. The background features ocean waves and sky, creating a warm and pleasant atmosphere.
Image 2 shows a tiger walking in a forest. The tiger has orange fur with black stripes, and it is stepping forward. It is surrounded by dense trees and vegetation, with fallen leaves covering the ground, giving the scene a wild, natural feeling.
Node.js
import OpenAI from "openai";
const openai = new OpenAI(
{
// If environment variables are not configured, replace the line below with: apiKey: "sk-xxx" using your Model Studio API Key
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
}
);
async function main() {
const response = await openai.chat.completions.create({
model: "qwen-vl-max", // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages: [
{role: "system",content:[{ type: "text", text: "You are a helpful assistant." }]},
{role: "user",content: [
// First image URL, if using a local file, replace the url value with the Base64 encoding of the image
{ type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"}},
// Second image URL, if using a local file, replace the url value with the Base64 encoding of the image
{ type: "image_url",image_url: {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"}},
{ type: "text", text: "What do these images depict?" },
]}]
});
console.log(response.choices[0].message.content);
}
main()
Sample response
In the first image, a person and a dog are interacting on a beach. The person is wearing a plaid shirt, and the dog is wearing a collar. They appear to be shaking hands or high-fiving.
In the second image, a tiger is walking in a forest. The tiger has orange fur with black stripes, and the background consists of green trees and vegetation.
curl
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-max",
"messages": [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"
}
},
{
"type": "text",
"text": "What do these images depict?"
}
]
}
]
}'
Sample response
{
"choices": [
{
"message": {
"content": "Image 1 shows a woman and a Labrador dog interacting on a beach. The woman is wearing a plaid shirt, sitting on the sand, and shaking hands with the dog. The background features an ocean view and sunset sky, creating a very warm and harmonious scene.\n\nImage 2 shows a tiger walking in a forest. The tiger has orange fur with black stripes, and it is stepping forward. It is surrounded by dense trees and vegetation, with fallen leaves covering the ground, giving the entire scene a natural wildness and vitality.",
"role": "assistant"
},
"finish_reason": "stop",
"index": 0,
"logprobs": null
}
],
"object": "chat.completion",
"usage": {
"prompt_tokens": 2497,
"completion_tokens": 109,
"total_tokens": 2606
},
"created": 1725948561,
"system_fingerprint": null,
"model": "qwen-vl-max",
"id": "chatcmpl-0fd66f46-b09e-9164-a84f-3ebbbedbac15"
}
DashScope
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role": "system",
"content": [{"text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
# First image URL.
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
# Second image URL
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
# Third image URL
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"},
{"text": "What do these images depict?"}
]
}
]
response = dashscope.MultiModalConversation.call(
# If environment variables are not configured, replace the line below with: api_key="sk-xxx" using your Model Studio API Key
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-max', # Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/zh/model-studio/getting-started/models
messages=messages
)
print(response.output.choices[0].message.content[0]["text"])
Sample response
These images show some animals and natural scenes. The first image shows a person and a dog interacting on a beach. The second image is a tiger walking in a forest. The third image is a cartoon-style rabbit jumping on a grassy field.
Java
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
.content(Arrays.asList(
Collections.singletonMap("text", "You are a helpful assistant."))).build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
// First image URL
Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"),
// Second image URL
Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"),
// Third image URL
Collections.singletonMap("image", "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"),
Collections.singletonMap("text", "What do these images depict?"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If environment variables are not configured, replace the line below with: .apiKey("sk-
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-max") // Using qwen-vl-max as an example, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
.messages(Arrays.asList(systemMessage, userMessage))
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text")); }
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
Sample response
These images show some animals and natural scenes.
1. First image: A woman and a dog interacting on a beach. The woman is wearing a plaid shirt, sitting on the sand, and the dog is wearing a collar, extending its paw to shake hands with the woman.
2. Second image: A tiger walking in a forest. The tiger has orange fur with black stripes, and the background consists of trees and leaves.
3. Third image: A cartoon-style rabbit jumping on a grassy field. The rabbit is white with pink ears, and the background features blue sky and yellow flowers.
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen-vl-max",
"input":{
"messages":[
{
"role": "system",
"content": [{"text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"},
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},
{"image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/rabbit.png"},
{"text": "What do these images show?"}
]
}
]
}
}'
Sample response
{
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "This image shows a woman and her dog on a beach. They appear to be enjoying each other's company, with the dog sitting on the sand and extending its paw to shake hands or interact with the woman. The background features a beautiful sunset view, with waves gently lapping at the shoreline.\n\nPlease note that my description is based on what is visible in the image and does not include any information beyond the visual content. If you need more specific details about this scene, please let me know!"
}
]
}
}
]
},
"usage": {
"output_tokens": 81,
"input_tokens": 1277,
"image_tokens": 1247
},
"request_id": "ccf845a3-dc33-9cda-b581-20fe7dc23f70"
}
Video understanding
Some Qwen-VL models support video understanding, including image sequences (video frames) or video files.
Video file
To pass video files toqwen-vl-plus
,qwen2.5-vl-3b-instruct
, andqwen2.5-vl-7b-instruct
, you must first submit a ticket. You can use other models directly.
Video file limitations:
Video file size:
For video URL: Qwen2.5-VL series models support videos up to 1 GB, other models up to 150 MB.
For local file: When using OpenAI SDK, the Base64-encoded video must be less than 10 MB. When using DashScope SDK, the video must be less than 100 MB.
Video file formats: MP4, AVI, MKV, MOV, FLV, and WMV.
Video duration: Qwen2.5-VL models support videos from 2 seconds to 10 minutes. Other models support videos from 2 seconds to 40 seconds.
Video dimensions: No restrictions, but video files will be adjusted to approximately 600,000 pixels. Larger dimensions will not provide better understanding.
Currently, audio in video files is not supported for understanding.
Before Qwen VL processes video content, it extracts frames from the video file, generating several video frames for content understanding. You can set the fps
parameter to control the frame extraction frequency:
Only the DashScope SDK supports this parameter. A frame is extracted every
seconds. A higher fps
is suitable for high-speed motion scenarios (such as sporting events and action movies), while a lowerfps
is suitable for long videos or static content.The OpenAI SDK does not support this parameter. A frame is extracted every 0.5 seconds from the video file.
Below is an example code for using video URLs. For using local videos, see Local files.
OpenAI
When using the OpenAI SDK or HTTP method to input video files to the Qwen-VL model, you need to set the"type"
parameter in the user message to"video_url"
.
Python
import os
from openai import OpenAI
client = OpenAI(
# If environment variables are not configured, replace the following line with: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-max",
messages=[
{"role": "system",
"content": [{"type": "text","text": "You are a helpful assistant."}]},
{"role": "user","content": [{
# When directly providing a video file, set the type value to video_url
# When using the OpenAI SDK, video frames are extracted every 0.5 seconds by default and cannot be modified. To customize the frame extraction frequency, please use the DashScope SDK.
"type": "video_url",
"video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
{"type": "text","text": "What is the content of this video?"}]
}]
)
print(completion.choices[0].message.content)
Node.js
import OpenAI from "openai";
const openai = new OpenAI(
{
// If environment variables are not configured, replace the following line with: apiKey: "sk-xxx"
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
}
);
async function main() {
const response = await openai.chat.completions.create({
model: "qwen-vl-max",
messages: [
{role:"system",content:["You are a helpful assistant."]},
{role: "user",content: [
// When directly providing a video file, set the type value to video_url
// When using the OpenAI SDK, video frames are extracted every 0.5 seconds by default and cannot be modified. To customize the frame extraction frequency, please use the DashScope SDK.
{type: "video_url", video_url: {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
{type: "text", text: "What is the content of this video?" },
]}]
});
console.log(response.choices[0].message.content);
}
main()
curl
curl -X POST https://dashscope.aliyuncs-intl.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-max",
"messages": [
{"role": "system", "content": [{"type": "text","text": "You are a helpful assistant."}]},
{"role": "user","content": [{"type": "video_url","video_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"}},
{"type": "text","text": "What is the content of this video?"}]}]
}'
DashScope
Python
import dashscope
import os
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role":"system","content":[{"text": "You are a helpful assistant."}]},
{"role": "user",
"content": [
# The fps parameter controls video frame extraction frequency, indicating that a frame is extracted every 1/fps seconds. For complete usage, see: https://www.alibabacloud.com/help/zh/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
{"text": "What is the content of this video?"}
]
}
]
response = dashscope.MultiModalConversation.call(
# If environment variables are not configured, replace the following line with: api_key ="sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-max',
messages=messages
)
print(response.output.choices[0].message.content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
// The fps parameter controls video frame extraction frequency, indicating that a frame is extracted every 1/fps seconds. For complete usage, see: https://www.alibabacloud.com/help/zh/model-studio/use-qwen-by-calling-api?#2ed5ee7377fum
Map<String, Object> params = Map.of(
"video", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
"fps",2);
MultiModalMessage systemMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
.content(Arrays.asList(
Collections.singletonMap("text", "You are a helpful assistant."))).build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
params,
Collections.singletonMap("text", "What is the content of this video?"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If environment variables are not configured, replace the following line with: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-max")
.messages(Arrays.asList(systemMessage, userMessage))
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-max",
"input":{
"messages":[
{"role": "system","content": [{"text": "You are a helpful assistant."}]},
{"role": "user","content": [{"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4","fps":2},
{"text": "What is the content of this video?"}]}]}
}'
Image sequence
Image sequence limitations
Qwen2.5-VL models: at least 4 images and at most 512 images.
Other models: at least 4 images and at most 80 images.
When image sequences are provided, it means that frames have been extracted from the video file in advance. When calling Qwen2.5-VL models, you can set the fps
parameter, which helps the model to perceive time information:
Only the DashScope SDK supports this parameter. A frame is extracted every
seconds. The OpenAI SDK does not support this parameter. A frame is extracted every 0.5 seconds from the video file.
Below is an example code for using image sequence URLs. For using local videos, see Local files.
OpenAI compatible
When using the OpenAI SDK or HTTP method to input image sequences as video to the Qwen-VL model, you need to set the"type"
parameter in the user message to"video"
.
Python
import os
from openai import OpenAI
client = OpenAI(
# If environment variables are not configured, replace the following line with: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen2.5-vl-72b-instruct", # This example uses qwen2.5-vl-72b-instruct, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=[{"role": "user","content": [
# When providing an image list, the "type" parameter in the user message is "video"
# When using the OpenAI SDK, image lists are treated as if they were extracted every 0.5 seconds from the video by default, and this cannot be modified. To customize the frame extraction frequency, please use the DashScope SDK.
{"type": "video","video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
{"type": "text","text": "Describe the specific process in this video"},
]}]
)
print(completion.choices[0].message.content)
Node.js
// Make sure you've specified "type": "module" in your package.json
import OpenAI from "openai";
const openai = new OpenAI({
// If environment variables are not configured, replace the following line with: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});
async function main() {
const response = await openai.chat.completions.create({
model: "qwen2.5-vl-72b-instruct", // This example uses qwen2.5-vl-72b-instruct, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages: [{
role: "user",
content: [
{
// When providing an image list, the "type" parameter in the user message is "video"
// When using the OpenAI SDK, image lists are treated as if they were extracted every 0.5 seconds from the video by default, and this cannot be modified. To customize the frame extraction frequency, please use the DashScope SDK.
type: "video",
video: [
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
]
},
{
type: "text",
text: "Describe the specific process in this video"
}
]
}]
});
console.log(response.choices[0].message.content);
}
main();
curl
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen2.5-vl-72b-instruct",
"messages": [{"role": "user",
"content": [{"type": "video",
"video": ["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"]},
{"type": "text",
"text": "Describe the specific process in this video"}]}]
}'
DashScope
Python
import os
# dashscope version must be at least 1.20.10
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{"role": "user",
"content": [
# When providing an image list to a Qwen2.5-VL series model, you can set the fps parameter, indicating that the image list was extracted from the original video every 1/fps seconds
{"video":["https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"],
"fps":2},
{"text": "Describe the specific process in this video"}]}]
response = dashscope.MultiModalConversation.call(
# If environment variables are not configured, replace the following line with: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model='qwen2.5-vl-72b-instruct', # This example uses qwen2.5-vl-72b-instruct, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
// DashScope SDK version must be at least 2.18.3
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
private static final String MODEL_NAME = "qwen2.5-vl-72b-instruct"; // This example uses qwen2.5-vl-72b-instruct, you can change the model name as needed. Model list: https://www.alibabacloud.com/help/model-studio/getting-started/models
public static void videoImageListSample() throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage systemMessage = MultiModalMessage.builder()
.role(Role.SYSTEM.getValue())
.content(Arrays.asList(Collections.singletonMap("text", "You are a helpful assistant.")))
.build();
// When providing an image list to a Qwen2.5-VL series model, you can set the fps parameter, indicating that the image list was extracted from the original video every 1/fps seconds
Map<String, Object> params = Map.of(
"video", Arrays.asList("https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"),
"fps",2);
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
params,
Collections.singletonMap("text", "Describe the specific process in this video")))
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If environment variables are not configured, replace the following line with: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model(MODEL_NAME)
.messages(Arrays.asList(systemMessage, userMessage)).build();
MultiModalConversationResult result = conv.call(param);
System.out.print(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
videoImageListSample();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen2.5-vl-72b-instruct",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"video": [
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg"
],
"fps":2
},
{
"text": "Describe the specific process in this video"
}
]
}
]
}
}'
Using local files (Input Base64 encoding)
Here are sample codes for passing local image files. Currently, only the OpenAI SDK or HTTP method support local files.
Image
Using eagle.png saved locally as an example.
When using the OpenAI SDK, the Base64-encoded image must be less than 10 MB.
OpenAI
Take the following steps:
Encode the image file: Read the local image and encode it in Base64 format
Pass the Base64 data: Provide the encoded data in
image_url
in this format:data:image/{format};base64,{base64_image}
.image/{format}
: The format of the local image.image/{format}
must match the Content Type in the image format table. For example, for a jpg image, useimage/jpeg
.Call the model: Call the Qwen-VL and process the response.
Python
from openai import OpenAI
import os
import base64
# base 64 encoding format
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Replace xxxx/eagle.png with the absolute path of your local image
base64_image = encode_image("xxx/eagle.png")
client = OpenAI(
# If environment variables are not configured, replace the following line with: api_key="sk-xxx" using your Model Studio API Key
api_key=os.getenv('DASHSCOPE_API_KEY'),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-max", # Using qwen-vl-max as an example here, you can change the model name as needed. Model List: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=[
{
"role": "system",
"content": [{"type":"text","text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{
"type": "image_url",
# When passing Base64 image data, note that the image format (i.e., image/{format}) needs to be consistent with the Content Type in the supported image list. "f" is a string formatting method.
# PNG image: f"data:image/png;base64,{base64_image}"
# JPEG image: f"data:image/jpeg;base64,{base64_image}"
# WEBP image: f"data:image/webp;base64,{base64_image}"
"image_url": {"url": f"data:image/png;base64,{base64_image}"},
},
{"type": "text", "text": "What scene is depicted in the image?"},
],
}
],
)
print(completion.choices[0].message.content)
Node.js
import OpenAI from "openai";
import { readFileSync } from 'fs';
const openai = new OpenAI(
{
// If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using your Model Studio API Key
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
}
);
const encodeImage = (imagePath) => {
const imageFile = readFileSync(imagePath);
return imageFile.toString('base64');
};
// Replace xxx/eagle.png with the absolute path of your local image
const base64Image = encodeImage("xxx/eagle.png")
async function main() {
const completion = await openai.chat.completions.create({
model: "qwen-vl-max", // Using qwen-vl-max as an example here, you can change the model name as needed. Model List: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages: [
{"role": "system",
"content": [{"type":"text","text": "You are a helpful assistant."}]},
{"role": "user",
"content": [{"type": "image_url",
// Note that when passing Base64, the image format (i.e., image/{format}) needs to be consistent with the Content Type in the supported image list.
// PNG image: data:image/png;base64,${base64Image}
// JPEG image: data:image/jpeg;base64,${base64Image}
// WEBP image: data:image/webp;base64,${base64Image}
"image_url": {"url": `data:image/png;base64,${base64Image}`},},
{"type": "text", "text": "What scene is depicted in the image?"}]}]
});
console.log(completion.choices[0].message.content);
}
main();
Video
Image sequence
Using locally saved football1.jpg, football2.jpg, football3.jpg, football4.jpg as examples.
When using the OpenAI SDK, each Based64-encoded image must be less than 10 MB.
OpenAI
Python
import os
from openai import OpenAI
import base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")
client = OpenAI(
# If environment variables are not configured, replace the following line with: api_key="sk-xxx" using your Model Studio API Key
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-max", # Using qwen-vl-max as an example here, you can change the model name as needed. Model List: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages=[
{"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user","content": [
{"type": "video","video": [
f"data:image/jpeg;base64,{base64_image1}",
f"data:image/jpeg;base64,{base64_image2}",
f"data:image/jpeg;base64,{base64_image3}",
f"data:image/jpeg;base64,{base64_image4}",]},
{"type": "text","text": "Describe the specific process in this video"},
]}]
)
print(completion.choices[0].message.content)
Node.js
import OpenAI from "openai";
import { readFileSync } from 'fs';
const openai = new OpenAI(
{
// If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using your Model Studio API Key
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
}
);
const encodeImage = (imagePath) => {
const imageFile = readFileSync(imagePath);
return imageFile.toString('base64');
};
const base64Image1 = encodeImage("football1.jpg")
const base64Image2 = encodeImage("football2.jpg")
const base64Image3 = encodeImage("football3.jpg")
const base64Image4 = encodeImage("football4.jpg")
async function main() {
const completion = await openai.chat.completions.create({
model: "qwen-vl-max", // Using qwen-vl-max as an example here, you can change the model name as needed. Model List: https://www.alibabacloud.com/help/model-studio/getting-started/models
messages: [
{"role": "system",
"content": [{"type":"text","text": "You are a helpful assistant."}]},
{"role": "user",
"content": [{"type": "video",
// Note that when passing Base64, the image format (i.e., image/{format}) needs to be consistent with the Content Type in the supported image list.
// PNG image: data:image/png;base64,${base64Image}
// JPEG image: data:image/jpeg;base64,${base64Image}
// WEBP image: data:image/webp;base64,${base64Image}
"video": [
`data:image/jpeg;base64,${base64Image1}`,
`data:image/jpeg;base64,${base64Image2}`,
`data:image/jpeg;base64,${base64Image3}`,
`data:image/jpeg;base64,${base64Image4}`]},
{"type": "text", "text": "What scene does this video depict?"}]}]
});
console.log(completion.choices[0].message.content);
}
main();
Video files
Using locally saved test.mp4 as an example.
When using the OpenAI SDK, the Base64-encoded local video must be less than 10 MB.
OpenAI
Python
from openai import OpenAI
import os
import base64
# Base64 encoding format
def encode_video(video_path):
with open(video_path, "rb") as video_file:
return base64.b64encode(video_file.read()).decode("utf-8")
# Replace xxxx/test.mp4 with the absolute path of your local video
base64_video = encode_video("xxx/test.mp4")
client = OpenAI(
# If environment variables are not configured, replace the following line with: api_key="sk-xxx" using your Model Studio API Key
api_key=os.getenv('DASHSCOPE_API_KEY'),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-max",
messages=[
{
"role": "system",
"content": [{"type":"text","text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{
# When passing a video file directly, set the type value to video_url
"type": "video_url",
"video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
},
{"type": "text", "text": "What scene does this video depict?"},
],
}
],
)
print(completion.choices[0].message.content)
Node.js
import OpenAI from "openai";
import { readFileSync } from 'fs';
const openai = new OpenAI(
{
// If environment variables are not configured, replace the following line with: apiKey: "sk-xxx" using your Model Studio API Key
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
}
);
const encodeVideo = (videoPath) => {
const videoFile = readFileSync(videoPath);
return videoFile.toString('base64');
};
// Replace xxxx/test.mp4 with the absolute path of your local video
const base64Video = encodeVideo("xxx/test.mp4")
async function main() {
const completion = await openai.chat.completions.create({
model: "qwen-vl-max",
messages: [
{"role": "system",
"content": [{"type":"text","text": "You are a helpful assistant."}]},
{"role": "user",
"content": [{
// When passing a video file directly, set the type value to video_url
"type": "video_url",
"video_url": {"url": `data:video/mp4;base64,${base64Video}`}},
{"type": "text", "text": "What scene does this video depict?"}]}]
});
console.log(completion.choices[0].message.content);
}
main();
Usage notes
Supported image formats
Here are the supported image formats. When using the OpenAI SDK to input local images, set image/{format}
according to the Content Type column.
Image format | File name extension | Content Type |
BMP | .bmp | image/bmp |
JPEG | .jpe, .jpeg, .jpg | image/jpeg |
PNG | .png | image/png |
TIFF | .tif, .tiff | image/tiff |
WEBP | .webp | image/webp |
HEIC | .heic | image/heic |
Image size limits
The size of a single image must not exceed 10 MB. When using the OpenAI SDK, the Base64-encoded image must be less than 10 MB, see local files.
The width and height of an image must both be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.
No pixel count limit for a single image, because the model will scale and preprocess the image before understanding. Larger images do not necessarily improve understanding performance. Recommended pixel values:
For single image input to
qwen-vl-max
, the recommended number of pixels should not exceed 12 million. This supports standard 4K images.For single image input to
qwen-vl-plus
, the number of pixels should not exceed 1,003,520.
Image input methods
Image URL: The URL must be accessible from the internet.
NoteYou can upload images to OSS to obtain a public URL.
If you need to input an image into OSS with a private access control list, you can generate a signed URL using the public endpoint. This URL grants temporary access to the file for other users.
Do not use OSS internal URLs because they do not interconnect with Model Studio.
Local image files: When using the OpenAI SDK, input the Base64-encoded image data.
Image number limits
When inputting multiple images, the number of images is limited by the model's total token limit for text and images (maximum input). The total token count of all images must be less than the model's maximum input.
For example, qwen-vl-max
has a maximum input of 30,720 tokens. If your input images are all 1280 × 1280, calculate the image tokens:
vl_high_resolution_images | Adjusted image dimensions | Image tokens | Maximum number of images |
True | 1288 x 1288 | 2,118 | 14 |
False | 980 x 980 | 1,227 | 25 |
Prompt guide
API references
For input and output parameters of Qwen-VL, see Qwen.
FAQ
Do I need to manually delete uploaded images?
No. The server automatically deletes images after the model completes text generation.
Can Qwen-VL process PDF, XLSX, XLS, DOC, and other text files?
No, Qwen-VL is designed for visual understanding and only processes image files, not text files.
Can Qwen-VL understand video content?
Yes, please refer to Video understanding.
Error codes
If the call failed and an error message is returned, see Error messages.