Qwen-OCR is a visual understanding model that extracts text and parses structured data from various images, such as scanned documents, tables, and receipts. It supports multiple languages and can perform advanced functions, including information extraction, table parsing, and formula recognition, using specific task instructions.
Try it online: Alibaba Cloud Model Studio (Singapore), Alibaba Cloud Model Studio (Virginia), or Alibaba Cloud Model Studio (Beijing)
Examples
Input image | Recognition result |
Recognize multiple languages
|
|
Recognize skewed images
| Product Introduction This product is made from imported fiber filaments from South Korea. It does not shrink, deform, grow mold, or harbor bacteria, and it does not damage surfaces. It is truly non-stick to oil, has strong water absorption, is resistant to water immersion, cleans thoroughly, is non-toxic, leaves no residue, and is easy to dry. Seller's experience: Stainless steel, ceramic products, bathtubs, and integrated bathrooms mostly have white, smooth surfaces. It is difficult to wash off stains with other cloths, and sharp objects can easily cause scratches. Using this simulated loofah cloth with a small amount of neutral detergent to create foam makes it easy to scrub these surface stains clean. 6941990612023 Item No.: 2023 |
Locate text position
high-precision recognition task supports text localization. | Visualization of localization
For more information, see the FAQ on how to draw the bounding box of each text line onto the original image. |
Scope
Supported regions
Supported models
Global
In the global deployment mode, both the endpoint and data storage are located in the US (Virginia) region, and inference compute resources are dynamically scheduled globally.
Model | Version | Context window | Max input | Max output | Input price | Output price |
(Tokens) | (Million tokens) | |||||
qwen-vl-ocr Provides the same capabilities as qwen-vl-ocr-2025-11-20. | Stable | 34,096 | 30,000 Max of 30,000 per image. | 4096 | $0.07 | $0.16 |
qwen-vl-ocr-2025-11-20 Also known as qwen-vl-ocr-1120. This model is based on the Qwen3-VL architecture and provides significantly improved document parsing and text localization capabilities. | Snapshot | 38,192 | 8,192 | |||
International
In the International deployment mode, endpoints and data storage are located in the Singapore region, and inference compute resources are dynamically scheduled globally (excluding Mainland China).
Model | Version | Context window | Max input | Max output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen-vl-ocr | Stable | 34,096 | 30,000 Max of 30,000 per image. | 4,096 | $0.72 | $0.72 | 1 million input tokens and 1 million output tokens Valid for 90 days after you activate Model Studio. |
qwen-vl-ocr-2025-11-20 Also known as qwen-vl-ocr-1120. Based on the Qwen3-VL architecture, this model offers significantly improved document parsing and text localization. | Snapshot | 38,192 | 8,192 | $0.07 | $0.16 | ||
Mainland China
In Mainland China deployment mode, endpoints and data storage are located in the Beijing region, and inference compute resources are limited to Mainland China.
Model | Version | Context window | Max input | Max output | Input price | Output price | Free quota |
(Tokens) | (Million tokens) | ||||||
qwen-vl-ocr Equivalent to qwen-vl-ocr-2025-08-28. | Stable | 34,096 | 30,000 Max 30,000 per image. | 4,096 | $0.717 | $0.717 | No free quota |
qwen-vl-ocr-latest It always has the capabilities of the latest version. | Latest | 38,192 | 8,192 | $0.043 | $0.072 | ||
qwen-vl-ocr-2025-11-20 Also known as qwen-vl-ocr-1120. Based on the Qwen3-VL architecture, this version provides significantly improved document parsing and text localization capabilities. | Snapshot | ||||||
qwen-vl-ocr-2025-08-28 Also known as qwen-vl-ocr-0828. | 34,096 | 4,096 | $0.717 | $0.717 | |||
qwen-vl-ocr-2025-04-13 Also known as qwen-vl-ocr-0413. | |||||||
qwen-vl-ocr-2024-10-28 Also known as qwen-vl-ocr-1028. | |||||||
qwen-vl-ocr, qwen-vl-ocr-2025-04-13, and qwen-vl-ocr-2025-08-28models, themax_tokensparameter (maximum output length) defaults to 4096. To increase this value to a range of 4097 to 8192, you can send an email to modelstudio@service.aliyun.com and provide the following information: your Alibaba Cloud account ID, image type (such as document images, e-commerce images, or contracts), model name, estimated Queries Per Second (QPS) and total daily requests, and the percentage of requests where the model output length exceeds 4096 tokens.
Preparations
You have configured an API key and added it to an environment variable.
If you call the model using the OpenAI SDK or DashScope SDK, you must install the latest version of the SDK. The minimum version for the DashScope Python SDK is 1.22.2, and the minimum version for the Java SDK is 2.21.8.
DashScope SDK
Advantages: Supports all advanced features, such as automatic image rotation and built-in OCR tasks. It provides a comprehensive feature set and a simple method for calling the model.
Scenarios: Ideal for projects that require full functionality.
OpenAI compatible SDK
Advantages: Eases migration for users who already use the OpenAI SDK or its ecosystem tools.
Limitations: Does not support calling advanced features, such as automatic image rotation and built-in OCR tasks, directly with parameters. You must manually simulate these features by creating complex prompts and then parsing the output.
Scenarios: Ideal for projects that already have an OpenAI integration and do not rely on advanced features that are exclusive to DashScope.
Getting started
The following example extracts key information from a train ticket image (URL) and returns the information in JSON format. For more information, see how to pass a local file and image limitations.
OpenAI compatible
Python
from openai import OpenAI
import os
PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image.
Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?).
Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}
"""
try:
client = OpenAI(
# API keys are region-specific. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace base_url with https://dashscope-us.aliyuncs.com/compatible-mode/v1
# If you use a model in the China (Beijing) region, replace base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr-2025-11-20",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
"max_pixels": 32 * 32 * 8192
},
# The model supports passing a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{"type": "text",
"text": PROMPT_TICKET_EXTRACTION}
]
}
])
print(completion.choices[0].message.content)
except Exception as e:
print(f"Error message: {e}")Node.js
import OpenAI from 'openai';
// Define the prompt to extract train ticket information.
const PROMPT_TICKET_EXTRACTION = `
Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image.
Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?).
Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}
`;
const openai = new OpenAI({
// API keys are region-specific. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
// If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace baseURL with https://dashscope-us.aliyuncs.com/compatible-mode/v1
// If you use a model in the China (Beijing) region, replace baseURL with https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});
async function main() {
const response = await openai.chat.completions.create({
model: 'qwen-vl-ocr-2025-11-20',
messages: [
{
role: 'user',
content: [
// The model supports passing a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{ type: 'text', text: PROMPT_TICKET_EXTRACTION},
{
type: 'image_url',
image_url: {
url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
},
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
min_pixels: 32 * 32 * 3,
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
max_pixels: 32 * 32 * 8192
}
]
}
],
});
console.log(response.choices[0].message.content)
}
main();curl
# ======= Important =======
# API keys are region-specific. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base URL with https://dashscope-us.aliyuncs.com/compatible-mode/v1/chat/completions
# If you use a model in the China (Beijing) region, replace the base URL with https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before running ===
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-vl-ocr-2025-11-20",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
"min_pixels": 3072,
"max_pixels": 8388608
},
{"type": "text", "text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"}
]
}
]
}'Example response
DashScope
Python
import os
import dashscope
PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image.
Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?).
Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}
"""
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
"max_pixels": 32 * 32 * 8192,
# Enables automatic image rotation.
"enable_rotate": False
},
# When no built-in task is set, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{"type": "text", "text": PROMPT_TICKET_EXTRACTION}]
}]
try:
response = dashscope.MultiModalConversation.call(
# API keys are region-specific. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
except Exception as e:
print(f"An error occurred: {e}")Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
map.put("min_pixels", 3072);
// Enables automatic image rotation.
map.put("enable_rotate", false);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// When no built-in task is set, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
Collections.singletonMap("text", "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys are region-specific. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
# ======= Important =======
# API keys are region-specific. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base URL with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base URL with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
},
{
"text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"
}
]
}
]
}
}'Use built-in tasks
To simplify calls in specific scenarios, the models (except for qwen-vl-ocr-2024-10-28) include several built-in tasks.
How to use:
DashScope SDK: You do not need to design and pass a
Prompt. The model uses a fixedPromptinternally. You can set theocr_optionsparameter to call the built-in task.OpenAI compatible SDK: You must manually enter the
Promptspecified for the task.
The following table lists the value of task, the specified Prompt, the output format, and an example for each built-in task:
High-precision recognition
We recommend that you use model versions later than qwen-vl-ocr-2025-08-28 or the latest version to call the high-precision recognition task. This task has the following features:
Recognizes and extracts text content.
Detects the position of text by locating text lines and outputting their coordinates.
After you obtain the coordinates of the text bounding box, see the FAQ for instructions on drawing the bounding box on the original image.
Value of task | Specified prompt | Output format and example |
| Locate all text lines and return the coordinates of the rotated rectangle |
|
import os
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
"max_pixels": 32 * 32 * 8192,
# Enable automatic image rotation.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
# Set the built-in task to high-precision recognition.
ocr_options={"task": "advanced_recognition"}
)
# The high-precision recognition task returns the result as plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])// dashscope SDK version >= 2.21.8
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
map.put("min_pixels", 3072);
// Enable automatic image rotation.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.ADVANCED_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "advanced_recognition"
}
}
}
'Information extraction
The model supports extracting structured information from documents such as receipts, certificates, and forms, and returns the results in JSON format. You can choose between two modes:
Custom field extraction: Specify the fields to extract. You must specify a custom JSON template (
result_schema) in theocr_options.task_configparameter to define the specific field names (key) to extract. The model automatically populates the corresponding values (value). The template supports up to three nested layers.Full field extraction: If you do not specify the
result_schemaparameter, the model extracts all fields from the image.
The prompts for the two modes are different:
Value of task | Specified prompt | Output format and example |
| Custom field extraction: |
|
Full field extraction: |
|
The following are code examples for calling the model using the DashScope SDK and HTTP:
# use [pip install -U dashscope] to update sdk
import os
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role":"user",
"content":[
{
"image":"http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": False
}
]
}
]
params = {
"ocr_options":{
"task": "key_information_extraction",
"task_config": {
"result_schema": {
"Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
"Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
"Invoice Number": "Extract the number from the invoice, usually composed of only digits."
}
}
}
}
response = dashscope.MultiModalConversation.call(
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
**params)
print(response.output.choices[0].message.content[0]["ocr_result"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.google.gson.JsonObject;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
map.put("min_pixels", 3072);
// Enable automatic image rotation.
map.put("enable_rotate", false);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
// Create the main JSON object.
JsonObject resultSchema = new JsonObject();
resultSchema.addProperty("Ride Date", "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05");
resultSchema.addProperty("Invoice Code", "Extract the invoice code from the image, usually a combination of numbers or letters");
resultSchema.addProperty("Invoice Number", "Extract the number from the invoice, usually composed of only digits.");
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.KEY_INFORMATION_EXTRACTION)
.taskConfig(OcrOptions.TaskConfig.builder()
.resultSchema(resultSchema)
.build())
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("ocr_result"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "key_information_extraction",
"task_config": {
"result_schema": {
"Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
"Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
"Invoice Number": "Extract the number from the invoice, usually composed of only digits."
}
}
}
}
}
'If you use the OpenAI SDK or HTTP methods, you must append the custom JSON schema to the end of the prompt string, as shown in the following code example:
Table parsing
The model parses the table elements in the image and returns the recognition result as text in HTML format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following are code examples for calling the model using the DashScope SDK and HTTP:
import os
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
"max_pixels": 32 * 32 * 8192,
# Enable automatic image rotation.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
# Set the built-in task to table parsing.
ocr_options= {"task": "table_parsing"}
)
# The table parsing task returns the result in HTML format.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
map.put("min_pixels",3072);
// Enable automatic image rotation.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.TABLE_PARSING)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "table_parsing"
}
}
}
'Document parsing
The model can parse scanned documents or PDF documents that are stored as images. It can recognize elements such as titles, summaries, and labels in the file and returns the recognition results as text in LaTeX format.
Task value | Specified prompt | Output format and example |
|
|
|
The following are code examples for calling the model using the DashScope SDK and HTTP:
import os
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
"max_pixels": 32 * 32 * 8192,
# Enable automatic image rotation.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
# Set the built-in task to document parsing.
ocr_options= {"task": "document_parsing"}
)
# The document parsing task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
map.put("min_pixels", 3072);
// Enable automatic image rotation.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.DOCUMENT_PARSING)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "document_parsing"
}
}
}
'Formula recognition
The model can parse formulas in images and returns the recognition results as text in LaTeX format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following are code examples for calling the model using the DashScope SDK and HTTP:
import os
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
"max_pixels": 32 * 32 * 8192,
# Enable automatic image rotation.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
# Set the built-in task to formula recognition.
ocr_options= {"task": "formula_recognition"}
)
# The formula recognition task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
map.put("min_pixels", 3072);
// Enable automatic image rotation.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.FORMULA_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "formula_recognition"
}
}
}
'General text recognition
General text recognition is primarily used for Chinese and English scenarios and returns recognition results in plain text format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following are code examples for calling the model using the DashScope SDK and HTTP:
import os
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
"max_pixels": 32 * 32 * 8192,
# Enable automatic image rotation.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
# Set the built-in task to general text recognition.
ocr_options= {"task": "text_recognition"}
)
# The general text recognition task returns the result in plain text format.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
map.put("min_pixels", 3072);
// Enable automatic image rotation.
map.put("enable_rotate", false);
// Configure the built-in task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.TEXT_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "text_recognition"
}
}
}'Multilingual recognition
Multilingual recognition is used for scenarios that involve languages other than Chinese and English. Supported languages are Arabic, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, and Vietnamese. The recognition results are returned in plain text format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following are code examples for calling the model using the DashScope SDK and HTTP:
import os
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
"max_pixels": 32 * 32 * 8192,
# Enable automatic image rotation.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
# Set the built-in task to multilingual recognition.
ocr_options={"task": "multi_lan"}
)
# The multilingual recognition task returns the result as plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels are greater than min_pixels.
map.put("min_pixels", 3072);
// Enable automatic image rotation.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.MULTI_LAN)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys vary by region. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "multi_lan"
}
}
}
'Pass a local file (Base64 encoding or file path)
Qwen-VL provides two methods for uploading local files: Base64 encoding and direct file path. You can select an upload method based on the file size and SDK type. For more information, see How to select a file upload method. Both methods must meet the file requirements described in Image limits.
Upload using Base64 encoding
You can convert the file to a Base64-encoded string and then pass it to the model. This method is applicable for OpenAI and DashScope SDKs, and HTTP requests.
Upload using a file path
You can pass the local file path directly to the model. This method is supported only by the DashScope Python and Java SDKs. It is not supported by DashScope HTTP or OpenAI-compatible methods.
Refer to the following table to specify the file path based on your programming language and operating system.
Pass a file path
Passing a file path is supported only when you call the model using the DashScope Python and Java SDKs. This method is not supported for DashScope HTTP or OpenAI-compatible methods.
Python
import os
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Replace xxx/test.jpg with the absolute path of your local image.
local_path = "xxx/test.jpg"
image_path = f"file://{local_path}"
messages = [
{
"role": "user",
"content": [
{
"image": image_path,
# The minimum pixel threshold for the input image. If the image has fewer pixels than this value, it is scaled up until the total number of pixels is greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image has more pixels than this value, it is scaled down until the total number of pixels is less than max_pixels.
"max_pixels": 32 * 32 * 8192,
},
# If no built-in task is set for the model, you can pass a prompt in the text field. If you do not pass a prompt, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{
"text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"
},
],
}
]
response = dashscope.MultiModalConversation.call(
# API keys are different for different regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key.
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-vl-ocr-2025-11-20",
messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall(String localPath)
throws ApiException, NoApiKeyException, UploadFileException {
String filePath = "file://"+localPath;
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", filePath);
// The maximum pixel threshold for the input image. If the image has more pixels than this value, it is scaled down until the total number of pixels is less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image has fewer pixels than this value, it is scaled up until the total number of pixels is greater than min_pixels.
map.put("min_pixels", 3072);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// If no built-in task is set for the model, you can pass a prompt in the text field. If you do not pass a prompt, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
Collections.singletonMap("text", "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys are different for different regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key.
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
// Replace xxx/test.jpg with the absolute path of your local image.
simpleMultiModalConversationCall("xxx/test.jpg");
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}Pass a Base64-encoded string
OpenAI compatible
Python
from openai import OpenAI
import os
import base64
# Read a local file and encode it in Base64 format.
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Replace xxx/test.png with the absolute path of your local image.
base64_image = encode_image("xxx/test.png")
client = OpenAI(
# API keys are different for different regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key.
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/compatible-mode/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr-2025-11-20",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
# Note: When you pass a Base64-encoded string, the image format (image/{format}) must match the Content-Type in the list of supported images. "f" is a string formatting method.
# PNG image: f"data:image/png;base64,{base64_image}"
# JPEG image: f"data:image/jpeg;base64,{base64_image}"
# WEBP image: f"data:image/webp;base64,{base64_image}"
"image_url": {"url": f"data:image/png;base64,{base64_image}"},
# The minimum pixel threshold for the input image. If the image has fewer pixels than this value, it is scaled up until the total number of pixels is greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image has more pixels than this value, it is scaled down until the total number of pixels is less than max_pixels.
"max_pixels": 32 * 32 * 8192
},
# The model supports passing a prompt in the following text field. If you do not pass a prompt, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{"type": "text", "text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"},
],
}
],
)
print(completion.choices[0].message.content)Node.js
import OpenAI from "openai";
import {
readFileSync
} from 'fs';
const client = new OpenAI({
// API keys are different for different regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key.
// If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/compatible-mode/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});
// Read a local file and encode it in Base64 format.
const encodeImage = (imagePath) => {
const imageFile = readFileSync(imagePath);
return imageFile.toString('base64');
};
// Replace xxx/test.jpg with the absolute path of your local image.
const base64Image = encodeImage("xxx/test.jpg")
async function main() {
const completion = await client.chat.completions.create({
model: "qwen-vl-ocr",
messages: [{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
// Note: When you pass a Base64-encoded string, the image format (image/{format}) must match the Content-Type in the list of supported images.
// PNG image: data:image/png;base64,${base64Image}
// JPEG image: data:image/jpeg;base64,${base64Image}
// WEBP image: data:image/webp;base64,${base64Image}
"url": `data:image/jpeg;base64,${base64Image}`
},
// The minimum pixel threshold for the input image. If the image has fewer pixels than this value, it is scaled up until the total number of pixels is greater than min_pixels.
"min_pixels": 32 * 32 * 3,
// The maximum pixel threshold for the input image. If the image has more pixels than this value, it is scaled down until the total number of pixels is less than max_pixels.
"max_pixels": 32 * 32 * 8192
},
// The model supports passing a prompt in the following text field. If you do not pass a prompt, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{
"type": "text",
"text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"
}
]
}]
});
console.log(completion.choices[0].message.content);
}
main();curl
For information about how to convert a file to a Base64-encoded string, see the example code.
For demonstration purposes, the Base64-encoded string
"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."in the code is truncated. In practice, you must pass the complete encoded string.
# ======= Important =======
# API keys are different for different regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/compatible-mode/v1/chat/completions
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before running ===
curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen-vl-ocr-latest",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."}},
{"type": "text", "text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"}
]
}]
}'DashScope
Python
import os
import base64
import dashscope
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Base64 encoding format.
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Replace xxx/test.jpg with the absolute path of your local image.
base64_image = encode_image("xxx/test.jpg")
messages = [
{
"role": "user",
"content": [
{
# Note: When you pass a Base64-encoded string, the image format (image/{format}) must match the Content-Type in the list of supported images. "f" is a string formatting method.
# PNG image: f"data:image/png;base64,{base64_image}"
# JPEG image: f"data:image/jpeg;base64,{base64_image}"
# WEBP image: f"data:image/webp;base64,{base64_image}"
"image": f"data:image/jpeg;base64,{base64_image}",
# The minimum pixel threshold for the input image. If the image has fewer pixels than this value, it is scaled up until the total number of pixels is greater than min_pixels.
"min_pixels": 32 * 32 * 3,
# The maximum pixel threshold for the input image. If the image has more pixels than this value, it is scaled down until the total number of pixels is less than max_pixels.
"max_pixels": 32 * 32 * 8192,
},
# If no built-in task is set for the model, you can pass a prompt in the text field. If you do not pass a prompt, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{
"text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"
},
],
}
]
response = dashscope.MultiModalConversation.call(
# API keys are different for different regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key.
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-vl-ocr-2025-11-20",
messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])Java
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1
// If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
// Base64 encoding format.
private static String encodeImageToBase64(String imagePath) throws IOException {
Path path = Paths.get(imagePath);
byte[] imageBytes = Files.readAllBytes(path);
return Base64.getEncoder().encodeToString(imageBytes);
}
public static void simpleMultiModalConversationCall(String localPath)
throws ApiException, NoApiKeyException, UploadFileException, IOException {
String base64Image = encodeImageToBase64(localPath); // Base64 encoding.
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "data:image/jpeg;base64," + base64Image);
// The maximum pixel threshold for the input image. If the image has more pixels than this value, it is scaled down until the total number of pixels is less than max_pixels.
map.put("max_pixels", 8388608);
// The minimum pixel threshold for the input image. If the image has fewer pixels than this value, it is scaled up until the total number of pixels is greater than min_pixels.
map.put("min_pixels", 3072);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// If no built-in task is set for the model, you can pass a prompt in the text field. If you do not pass a prompt, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
Collections.singletonMap("text", "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys are different for different regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key.
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
// Replace xxx/test.jpg with the absolute path of your local image.
simpleMultiModalConversationCall("xxx/test.jpg");
} catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
For information about how to convert a file to a Base64-encoded string, see the example code.
For demonstration purposes, the Base64-encoded string
"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."in the code is truncated. In practice, you must pass the complete encoded string.
# ======= Important =======
# API keys are different for different regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key.
# The following is the base URL for the Singapore region. If you use a model in the US (Virginia) region, replace the base_url with https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# If you use a model in the China (Beijing) region, replace the base_url with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-ocr-latest",
"input":{
"messages":[
{
"role": "user",
"content": [
{"image": "data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
{"text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"}
]
}
]
}
}'More usages
Limits
Image limits
Image size:
If you provide an image using a public URL or a local path, the image cannot exceed
10 MB.If you provide the data in Base64 encoding, the encoded string cannot exceed
10 MB.
For more information about how to compress the file size, see How to compress an image or video to the required size.
Dimensions and aspect ratio: The image width and height must both be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.
Total pixels: The model automatically scales images, so there is no strict limit on the total number of pixels. However, an image should not exceed 15.68 million pixels.
Image format
Common extensions
MIME Type
BMP
.bmp
image/bmp
JPEG
.jpe, .jpeg, .jpg
image/jpeg
PNG
.png
image/png
TIFF
.tif, .tiff
image/tiff
WEBP
.webp
image/webp
HEIC
.heic
image/heic
Model limits
System message: This model does not support a custom
system messagebecause it uses a fixed internalsystem message. All instructions must be passed through theuser message.No multi-turn conversations: The model does not support multi-turn conversations and only answers the most recent question.
Hallucination risk: The model may hallucinate if text in an image is too small or has a low resolution. Additionally, the accuracy of answers to questions not related to text extraction is not guaranteed.
Cannot process text files:
For files that contain image data, follow the recommendations in Going live to transform them into an image sequence before processing.
For files with plain text or structured data, use Qwen-Long, a model that can parse long text.
Billing and rate limiting
Billing: Qwen-OCR is a multimodal model. The total cost is calculated as follows: (Number of input tokens × Unit price for input) + (Number of output tokens × Unit price for output). For more information about how image tokens are calculated, see Image token conversion method. You can view your bills or top up your account on the Expenses and Costs page in the Alibaba Cloud Management Console.
Rate limiting: For details about the rate limiting conditions for the Qwen-OCR model, see the following per-minute rate limiting conditions by model name. Rate limiting is triggered if any value is exceeded. The service may also impose limits based on Requests Per Second (RPS), which is calculated as RPM/60, and Tokens Per Second (TPS), which is calculated as TPM/60. Limits are measured in calls per minute (RPM) and tokens per minute (TPM). The TPM value includes both input and output tokens. The limits are as follows: qwen-vl-ocr (600 RPM, 6,000,000 TPM), qwen-vl-ocr-latest (1,200 RPM), qwen-vl-ocr-2025-11-20, qwen-vl-ocr-2025-08-28 (600 RPM), and qwen-vl-ocr-2024-10-28.
Free quota (Singapore region only): Qwen-OCR provides a free quota of 1 million tokens. This quota is valid for 90 days, starting from the date you activate Model Studio or the date your request to use the model is approved.
Going live
Processing multi-page documents, such as PDFs:
Split: Use an image editing library, such as
Python'spdf2image, to convert each page of a PDF file into a high-quality image.Submit a request: Use the multi-image input method for recognition.
Image pre-processing:
Ensure that input images are clear, evenly lit, and not overly compressed:
To prevent information loss, use lossless formats, such as PNG, for image storage and transmission.
To improve image definition, use denoising algorithms, such as mean or median filtering, to smooth noisy images.
To correct uneven lighting, use algorithms such as adaptive histogram equalization to adjust brightness and contrast.
For skewed images: Use the DashScope SDK's
enable_rotate: trueparameter to significantly improve recognition performance.For very small or very large images: Use the
min_pixelsandmax_pixelsparameters to control how images are scaled before editing.min_pixels: Enlarges small images to improve detail detection. Keep the default value.max_pixels: Prevents oversized images from consuming excessive resources. For most scenarios, the default value is sufficient. If small text is not recognized clearly, increase themax_pixelsvalue. Note that this increases Token consumption.
Result validation: The model's recognition results may contain errors. For critical business operations, implement a manual review process or add validation rules to verify the accuracy of the model's output. For example, use format validation for ID card and bank card numbers.
Batch calls: In large-scale, non-real-time scenarios, use the Batch API to asynchronously process batch jobs at a lower cost.
FAQ
How to choose a file upload method?
How do I draw detection boxes on the original image after the model outputs text localization results?
API reference
The Qwen-OCR API reference describes the input and output parameters of the Qwen-OCR model.
Error codes
If a call fails, see Error messages for troubleshooting.









