Qwen-OCR is a visual understanding model that specializes in text extraction. It can extract text or parse structured data from various images, such as scanned documents, tables, and receipts. It supports multiple languages and can perform advanced functions, such as information extraction, table parsing, and formula recognition, using specific task instructions.
You can try Qwen-OCR online in the Playground (Singapore or Beijing).
Examples
Input image | Recognition result |
Recognize multiple languages
|
|
Recognize skewed images
| Product Introduction This product is made of imported fiber filaments from South Korea. It does not shrink, deform, mold, or grow bacteria, and will not damage surfaces. It is truly non-stick, highly absorbent, water-resistant, easy to clean, non-toxic, residue-free, and quick-drying. Store experience: Stainless steel, ceramic products, bathtubs, and integrated bathrooms mostly have white, smooth surfaces. Stains are difficult to remove with other cloths, and sharp objects can easily cause scratches. Use this simulated loofah sponge with a small amount of neutral detergent to create a lather, and you can easily wipe these surface stains clean. 6941990612023 Item No.: 2023 |
Locate text position
| Visualization of location results
For more information about how to draw the bounding box for each line of text on the original image, see FAQ. |
Models and pricing
International (Singapore)
Model | Version | Context window | Maximum input | Maximum output | Input/Output price | Free quota |
(Tokens) | (per million tokens) | |||||
qwen-vl-ocr | Stable | 34,096 | 30,000 Max 30,000 for a single image | 4096 | $0.72 | 1 million tokens for each Validity: 90 days after activating Model Studio. |
Chinese mainland (Beijing)
Model | Version | Context window | Maximum input | Maximum output | Input/Output price |
(Tokens) | (per million tokens) | ||||
qwen-vl-ocr Currently has the same capabilities as qwen-vl-ocr-2025-04-13. | Stable | 34,096 | 30,000 Max 30,000 for a single image | 4,096 | $0.717 |
qwen-vl-ocr-latest Always has the same capabilities as the latest snapshot version. | Latest | ||||
qwen-vl-ocr-2025-04-13 Also known as qwen-vl-ocr-0413 Significantly improves text recognition capabilities. Adds six built-in OCR tasks and features such as custom prompts and image rotation correction. | Snapshot | ||||
qwen-vl-ocr-2024-10-28 Also known as qwen-vl-ocr-1028 | Snapshot | ||||
For the qwen-vl-ocr, the max_tokens parameter (maximum output length) defaults to 4096. To increase this value to a range of 4097 to 8192, send an email to modelstudio@service.aliyun.com. Include the following information: your Alibaba Cloud account ID, image type (such as document image, e-commerce image, or contract), model name, estimated queries per second (QPS) and total daily requests, and the percentage of requests where the model output exceeds 4096 tokens.Preparations
If you call the model using the OpenAI SDK or DashScope SDK, install the latest version of the SDK. The minimum version for the DashScope Python SDK is 1.22.2, and for the Java SDK is 2.21.8.
DashScope SDK
Pros: Supports all advanced features, such as image rotation correction and built-in OCR tasks. It offers more comprehensive functionality and a simpler call method.
Scenarios: Projects that require full functionality.
OpenAI compatible SDK
Pros: Convenient for users who are already using the OpenAI SDK or its ecosystem tools to migrate quickly.
Limits: Advanced features such as image rotation correction and built-in OCR tasks are not directly supported through parameters. You must manually simulate them by constructing complex prompts and parse the output results yourself.
Scenarios: Projects with an existing OpenAI integration that do not rely on DashScope's exclusive advanced features.
Getting started
The following example extracts key information from a train ticket image (URL) and returns it in JSON format. For more information, see the sections on how to pass a local file and image limitations.
OpenAI compatible
Python
from openai import OpenAI
import os
PROMPT_TICKET_EXTRACTION = """
Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?).
Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'"},
"""
try:
client = OpenAI(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr-latest",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192
},
# qwen-vl-ocr supports passing a prompt in the following text field. If not provided, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
# If you call qwen-vl-ocr-1028, the model uses a fixed prompt: Read all the text in the image. Custom prompts in the text field are not supported.
{"type": "text",
"text": PROMPT_TICKET_EXTRACTION}
]
}
])
print(completion.choices[0].message.content)
except Exception as e:
print(f"Error message: {e}")Node.js
import OpenAI from 'openai';
// Define the prompt for extracting train ticket information.
const PROMPT_TICKET_EXTRACTION = `
Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?).
Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}
`;
const openai = new OpenAI({
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
// If the environment variable is not set, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});
async function main() {
const response = await openai.chat.completions.create({
model: 'qwen-vl-ocr',
messages: [
{
role: 'user',
content: [
// qwen-vl-ocr supports passing a prompt in the following text field. If not provided, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{ type: 'text', text: PROMPT_TICKET_EXTRACTION},
{
type: 'image_url',
image_url: {
url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
},
// Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
// Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192
}
]
}
],
});
console.log(response.choices[0].message.content)
}
main();curl
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://bailian.console.alibabacloud.com/?tab=model#/api-key
# === Delete this comment before execution ===
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-vl-ocr",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3136,
"max_pixels": 6422528
},
{"type": "text", "text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"}
]
}
]
}'Response example
DashScope
Python
import os
import dashscope
PROMPT_TICKET_EXTRACTION = """
Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?).
Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'"},
"""
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192,
# Enable automatic image rotation correction.
"enable_rotate": False
},
# When no built-in task is set for qwen-vl-ocr, you can pass a prompt in the following text field. If not provided, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{"type": "text", "text": PROMPT_TICKET_EXTRACTION}]
}]
try:
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr',
messages=messages
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
except Exception as e:
print(f"An error occurred: {e}")Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
// Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
// Enable automatic image rotation correction.
map.put("enable_rotate", false);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// When no built-in task is set for qwen-vl-ocr, you can pass a prompt in the following text field. If not provided, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
Collections.singletonMap("text", "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
// If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr",
"input": {
"messages": [
{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3136,
"max_pixels": 6422528,
"enable_rotate": false
},
{
"text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'"
}
]
}
]
}
}'Use built-in tasks
To simplify calls in specific scenarios, the qwen-vl-ocr models include several built-in tasks.
How to use:
DashScope SDK: You do not need to design or pass in a
Prompt. You can set theocr_optionsparameter to call a built-in task.OpenAI compatible SDK: You need to manually enter the
Promptspecified for the task.
The following table lists the task value, prompt, output format, and example for each built-in task:
High-precision recognition
We recommend that you first use the qwen-vl-ocr-2025-08-28 model or the latest model, because this version includes a comprehensive upgrade in text localization capabilities. The high-precision recognition task has the following features:
Recognize text content (extract text)
Detect text position (locate text lines and output coordinates)
After you obtain the coordinates of the text bounding box, see the FAQ for instructions on how to draw the bounding box on the original image.
Value of task | Specified prompt | Output format and example |
| Locate all text lines and return the coordinates of the rotated rectangle |
|
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
# Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192,
# Enable automatic image rotation correction.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-latest',
messages=messages,
# Set the built-in task to high-precision recognition.
ocr_options={"task": "advanced_recognition"}
)
# The multi-language recognition task returns the result as plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])// dashscope SDK version >= 2.21.8
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
// Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
// Enable automatic image rotation correction.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.ADVANCED_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-08-28")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the base URL for the Singapore region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-08-28",
"input": {
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "advanced_recognition"
}
}
}
'Information extraction
The model supports extracting structured information from documents such as receipts, certificates, and forms, and returns the results in JSON format. You can choose from two modes:
Custom field extraction: Extracts specific fields using a custom JSON template (
{result_schema}) that you specify in theocr_options.task_configparameter. The template defines the specific field names (keys), and the model automatically fills in the corresponding values (value). The template supports a maximum of 3 nested layers.Full field extraction: Automatically extract all recognizable fields from the image.
The prompts for the two modes are different:
Value of task | Specified prompt | Output format and example |
| Custom Field Extraction: |
|
Full Field Extraction: |
|
The following are sample codes for making calls using the DashScope SDK and HTTP:
# use [pip install -U dashscope] to update sdk
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role":"user",
"content":[
{
"image":"http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
"min_pixels": 3136,
"max_pixels": 6422528,
"enable_rotate": False
}
]
}
]
params = {
"ocr_options":{
"task": "key_information_extraction",
"task_config": {
"result_schema": {
"Date": "",
"Time": "",
"Fuel Surcharge": ""
}
}
}
}
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr',
messages=messages,
**params)
print(response.output.choices[0].message.content[0]["ocr_result"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.google.gson.JsonObject;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg");
// Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
// Enable automatic image rotation correction.
map.put("enable_rotate", false);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
// Create the main JSON object.
JsonObject resultSchema = new JsonObject();
resultSchema.addProperty("Date", "");
resultSchema.addProperty("Time", "");
resultSchema.addProperty("Fuel Surcharge", "");
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.KEY_INFORMATION_EXTRACTION)
.taskConfig(OcrOptions.TaskConfig.builder()
.resultSchema(resultSchema)
.build())
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
// If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("ocr_result"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
"min_pixels": 3136,
"max_pixels": 6422528,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "key_information_extraction",
"task_config": {
"result_schema": {
"Date": "",
"Time": "",
"Fuel Surcharge": ""
}
}
}
}
}
'If you call the model using the OpenAI SDK and HTTP method, you also need to replace {result_schema} in the specified prompt with the JSON object to be extracted. For more information, see the sample code below:Table parsing
The model parses table elements in an image and returns the recognition result as text in HTML format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following are sample codes for making calls using the DashScope SDK and HTTP:
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
# Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192,
# Enable automatic image rotation correction.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr',
messages=messages,
# Set the built-in task to table parsing.
ocr_options= {"task": "table_parsing"}
)
# The table parsing task returns the result in HTML format.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg");
// Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
// Enable automatic image rotation correction.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.TABLE_PARSING)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// When the task field in ocr_options is set to table parsing, the model uses the content of the following text field as the prompt. Custom prompts are not supported.
Collections.singletonMap("text", "In a safe, sandbox environment, you are tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin."))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
// If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "table_parsing"
}
}
}
'Document parsing
The model supports parsing scanned documents or PDF documents stored as images. It can identify elements such as titles, summaries, and labels, and returns the recognition result as text in LaTeX format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following are sample codes for making calls using the DashScope SDK and HTTP:
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
# Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192,
# Enable automatic image rotation correction.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr',
messages=messages,
# Set the built-in task to document parsing.
ocr_options= {"task": "document_parsing"}
)
# The document parsing task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg");
// Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
// Enable automatic image rotation correction.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.DOCUMENT_PARSING)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
// If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr",
"input": {
"messages": [{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [{
"type": "image",
"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "document_parsing"
}
}
}
'Formula recognition
The model supports parsing formulas in images and returns the recognition result as text in LaTeX format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following are sample codes for making calls using the DashScope SDK and HTTP:
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
# Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192,
# Enable automatic image rotation correction.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr',
messages=messages,
# Set the built-in task to formula recognition.
ocr_options= {"task": "formula_recognition"}
)
# The formula recognition task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg");
// Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
// Enable automatic image rotation correction.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.FORMULA_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
// If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr",
"input": {
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "formula_recognition"
}
}
}
'General text recognition
General text recognition is mainly used for Chinese and English scenarios and returns recognition results in plain text format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following are code examples for making calls using the DashScope SDK and HTTP:
import os
import dashscope
# The following URL is for the Singapore region. To use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
# The minimum pixel threshold for the input image. If an image's total pixel count is below this value, the image is scaled up proportionally until it exceeds min_pixels.
"min_pixels": 28 * 28 * 4,
# The maximum pixel threshold for the input image. If an image's total pixel count exceeds this value, the image is scaled down proportionally until it is below max_pixels.
"max_pixels": 28 * 28 * 8192,
# Enable the automatic image rotation feature.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys differ between the Singapore and Beijing regions. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr',
messages=messages,
# Set the built-in task to text recognition.
ocr_options= {"task": "text_recognition"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following URL is for the Singapore region. To use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
// The maximum pixel threshold for the input image. If an image's total pixel count exceeds this value, the image is scaled down proportionally until it is below max_pixels.
map.put("max_pixels", "6422528");
// The minimum pixel threshold for the input image. If an image's total pixel count is below this value, the image is scaled up proportionally until it exceeds min_pixels.
map.put("min_pixels", "3136");
// Enable the automatic image rotation feature.
map.put("enable_rotate", false);
// Configure the built-in task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.TEXT_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys differ between the Singapore and Beijing regions. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
// If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important Notes =======
# API keys differ between the Singapore and Beijing regions. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following URL is for the Singapore region. To use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete these comments before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
--header "Authorization: Bearer $DASHSCOPE_API_KEY"\
--header 'Content-Type: application/json'\
--data '{
"model": "qwen-vl-ocr",
"input": {
"messages": [
{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 3136,
"max_pixels": 6422528,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "text_recognition"
}
}
}'Multilingual recognition
You can use multilingual recognition for scenarios that involve languages other than Chinese and English. Supported languages include Arabic, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, and Vietnamese. The recognition results are returned in plain text format.
Value of task | Specified prompt | Output format and example |
|
|
|
The following code samples show how to make calls using the DashScope SDK and HTTP.
import os
import dashscope
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
# The minimum pixel threshold for the input image. If the image has fewer pixels, it is scaled up proportionally until its total pixel count exceeds min_pixels.
"min_pixels": 28 * 28 * 4,
# The maximum pixel threshold for the input image. If the image has more pixels, it is scaled down proportionally until its total pixel count is below max_pixels.
"max_pixels": 28 * 28 * 8192,
# Enable the automatic image orientation correction feature.
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and Beijing regions are different. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr',
messages=messages,
# Set the built-in task to multilingual recognition.
ocr_options={"task": "multi_lan"}
)
# The multilingual recognition task returns results in plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
// The maximum pixel threshold for the input image. If the image has more pixels, it is scaled down proportionally until its total pixel count is below max_pixels.
map.put("max_pixels", "6422528");
// The minimum pixel threshold for the input image. If the image has fewer pixels, it is scaled up proportionally until its total pixel count exceeds min_pixels.
map.put("min_pixels", "3136");
// Enable the automatic image rotation feature.
map.put("enable_rotate", false);
// Configure the built-in OCR task.
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.MULTI_LAN)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and Beijing regions are different. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr",
"input": {
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
"min_pixels": 401408,
"max_pixels": 6422528,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "multi_lan"
}
}
}
'Streaming output
When a large model receives input, it generates the final result in parts. The process of sending these parts as they are created is called streaming output. You can use streaming output for requests that might take a long time to prevent timeouts.
OpenAI compatible
To enable streaming output, set the stream parameter to true in your code.
Python
import os
from openai import OpenAI
PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'class_type': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
"""
client = OpenAI(
# API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
# The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192
},
# The qwen-vl-ocr model supports passing a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{"type": "text","text": PROMPT_TICKET_EXTRACTION}
]
}
],
stream=True,
stream_options={"include_usage": True}
)
full_content = ""
print("Streaming output content:")
for chunk in completion:
# If stream_options.include_usage is True, the choices field of the last chunk is an empty list and must be skipped. You can get the token usage from chunk.usage.
if chunk.choices and chunk.choices[0].delta.content != "":
full_content += chunk.choices[0].delta.content
print(chunk.choices[0].delta.content)
print(f"Full content: {full_content}")Node.js
import OpenAI from 'openai';
// Define the prompt for extracting ticket information.
const PROMPT_TICKET_EXTRACTION = `
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'class_type': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
`;
const openai = new OpenAI({
// API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
// If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});
async function main() {
const response = await openai.chat.completions.create({
model: 'qwen-vl-ocr',
messages: [
{
role: 'user',
content: [
// The qwen-vl-ocr model supports passing a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{ type: 'text', text: PROMPT_TICKET_EXTRACTION},
{
type: 'image_url',
image_url: {
url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
},
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192
}
]
}
],
stream: true,
stream_options:{"include_usage": true}
});
let fullContent = ""
console.log("Streaming output content:")
for await (const chunk of response) {
if (chunk.choices[0] && chunk.choices[0].delta.content != null) {
fullContent += chunk.choices[0].delta.content;
console.log(chunk.choices[0].delta.content);
}
}
console.log(`Full output content: ${fullContent}`)
}
main();curl
# ======= Important =======
# API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
# The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-vl-ocr",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3136,
"max_pixels": 6422528
},
{"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'train_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\': \'xxx\', \'seat_number\': \'xxx\', \'class_type\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
]
}
],
"stream": true,
"stream_options": {"include_usage": true}
}'DashScope
Enable streaming output by setting the corresponding parameters for different call methods:
Python SDK: Set the
streamparameter to True.Java SDK: Call the
streamCallinterface.HTTP: In the header, set
X-DashScope-SSEtoenable.
By default, streaming output is non-incremental, which means each response includes all previously generated content. To use incremental streaming output, set theincremental_outputparameter (incrementalOutputfor Java) totrue.
Python
import os
import dashscope
PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'class_type': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
"""
# The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
# The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192},
# When no built-in task is set for qwen-vl-ocr, you can pass a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{
"type": "text",
"text": PROMPT_TICKET_EXTRACTION,
},
],
}
]
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
# If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-vl-ocr",
messages=messages,
stream=True,
incremental_output=True,
)
full_content = ""
print("Streaming output content:")
for response in response:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
full_content += response["output"]["choices"][0]["message"].content[0]["text"]
except:
pass
print(f"Full content: {full_content}")Java
import java.util.*;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
// The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// When no built-in task is set for qwen-vl-ocr, you can pass a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
Collections.singletonMap("text", "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'class_type': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
// If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.incrementalOutput(true)
.build();
Flowable<MultiModalConversationResult> result = conv.streamCall(param);
result.blockingForEach(item -> {
try {
List<Map<String, Object>> contentList = item.getOutput().getChoices().get(0).getMessage().getContent();
if (!contentList.isEmpty()){
System.out.println(contentList.get(0).get("text"));
}//
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
# ======= Important =======
# API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
# The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
--data '{
"model": "qwen-vl-ocr",
"input":{
"messages":[
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3136,
"max_pixels": 6422528
},
{"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'train_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\': \'xxx\', \'seat_number\': \'xxx\', \'class_type\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
]
}
]
},
"parameters": {
"incremental_output": true
}
}'Upload local files (Base64 encoding or file path)
The model supports two methods for uploading local files:
Direct upload using a file path ( more stable transfer, recommended )
Upload using Base64 encoding
Upload using a file path
You can pass the local file path directly to the model. This method is supported only by the DashScope Python and Java SDKs. It is not supported for DashScope HTTP or OpenAI-compatible calls.
Use the following table to specify the file path based on your programming language and operating system.
Upload using Base64 encoding
You can convert the file to a Base64-encoded string, then pass it to the model. This method is applicable for OpenAI, DashScope SDK, and HTTP calls.
Limits
Uploading using a file path is recommended for higher stability. You can also use Base64 encoding for files smaller than 1 MB.
When passing a file path directly, each image must be smaller than 10 MB.
When passing a file using Base64 encoding, the encoded image must be smaller than 10 MB because Base64 encoding increases the data size.
For more information about how to compress a file, see How do I compress an image to the required size?
Pass a file path
Passing a file path is supported only for calls made using the DashScope Python and Java SDKs. It is not supported for DashScope HTTP or OpenAI-compatible calls.
Python
import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Replace xxx/test.jpg with the absolute path of your local image.
local_path = "xxx/test.jpg"
image_path = f"file://{local_path}"
messages = [
{
"role": "user",
"content": [
{
"image": image_path,
# The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192,
},
# If no built-in task is set for qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{
"text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"
},
],
}
]
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-vl-ocr",
messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall(String localPath)
throws ApiException, NoApiKeyException, UploadFileException {
String filePath = "file://"+localPath;
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", filePath);
// The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// If no built-in task is set for qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
Collections.singletonMap("text", "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.topP(0.001)
.temperature(0.1f)
.maxLength(8192)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
// Replace xxx/test.jpg with the absolute path of your local image.
simpleMultiModalConversationCall("xxx/test.jpg");
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}Pass using Base64 encoding
OpenAI compatible
Python
from openai import OpenAI
import os
import base64
# Read a local file and encode it in Base64 format.
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Replace xxx/test.png with the absolute path of your local image.
base64_image = encode_image("xxx/test.png")
client = OpenAI(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
# Note: When passing a Base64-encoded image, the image format (image/{format}) must match the Content Type in the list of supported images. "f" is a string formatting method.
# PNG image: f"data:image/png;base64,{base64_image}"
# JPEG image: f"data:image/jpeg;base64,{base64_image}"
# WEBP image: f"data:image/webp;base64,{base64_image}"
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
# The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192
},
# If you use qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{"type": "text", "text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"},
],
}
],
)
print(completion.choices[0].message.content)Node.js
import OpenAI from "openai";
import {
readFileSync
} from 'fs';
const openai = new OpenAI({
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
apiKey: process.env.DASHSCOPE_API_KEY,
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});
// Read a local file and encode it in Base64 format.
const encodeImage = (imagePath) => {
const imageFile = readFileSync(imagePath);
return imageFile.toString('base64');
};
// Replace xxx/test.png with the absolute path of your local image.
const base64Image = encodeImage("xxx/test.jpg")
async function main() {
const completion = await openai.chat.completions.create({
model: "qwen-vl-ocr",
messages: [{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
// Note: When passing a Base64-encoded image, the image format (image/{format}) must match the Content Type in the list of supported images.
// PNG image: data:image/png;base64,${base64Image}
// JPEG image: data:image/jpeg;base64,${base64Image}
// WEBP image: data:image/webp;base64,${base64Image}
"url": `data:image/jpeg;base64,${base64Image}`
},
// The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
// The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192
},
// If you use qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{
"type": "text",
"text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"
}
]
}]
});
console.log(completion.choices[0].message.content);
}
main();curl
For a method to convert a file to a Base64-encoded string, see the example code.
For demonstration purposes, the Base64-encoded string in the code,
"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...", is truncated. In actual use, pass the complete encoded string.
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen-vl-ocr-latest",
"messages": [
{"role":"system",
"content":[
{"type": "text", "text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."}},
{"type": "text", "text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"}
]
}]
}'DashScope
Python
import os
import base64
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Base64 encoding format
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Replace xxx/test.jpg with the absolute path of your local image.
base64_image = encode_image("xxx/test.jpg")
messages = [
{
"role": "user",
"content": [
{
# Note: When passing a Base64-encoded image, the image format (image/{format}) must match the Content Type in the list of supported images. "f" is a string formatting method.
# PNG image: f"data:image/png;base64,{base64_image}"
# JPEG image: f"data:image/jpeg;base64,{base64_image}"
# WEBP image: f"data:image/webp;base64,{base64_image}"
"image": f"data:image/jpeg;base64,{base64_image}",
# The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
"min_pixels": 28 * 28 * 4,
# The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
"max_pixels": 28 * 28 * 8192,
},
# If no built-in task is set for qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
{
"text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"
},
],
}
]
response = dashscope.MultiModalConversation.call(
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-vl-ocr",
messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])Java
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
// The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
// Base64 encoding format
private static String encodeImageToBase64(String imagePath) throws IOException {
Path path = Paths.get(imagePath);
byte[] imageBytes = Files.readAllBytes(path);
return Base64.getEncoder().encodeToString(imageBytes);
}
public static void simpleMultiModalConversationCall(String localPath)
throws ApiException, NoApiKeyException, UploadFileException, IOException {
String base64Image = encodeImageToBase64(localPath); // Base64 encoding
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "data:image/jpeg;base64," + base64Image);
// The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
map.put("max_pixels", "6422528");
// The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
map.put("min_pixels", "3136");
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
// If no built-in task is set for qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
Collections.singletonMap("text", "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
// If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr")
.message(userMessage)
.topP(0.001)
.temperature(0.1f)
.maxLength(8192)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
// Replace xxx/test.jpg with the absolute path of your local image.
simpleMultiModalConversationCall("xxx/test.jpg");
} catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
For a method to convert a file to a Base64-encoded string, see the example code.
For demonstration purposes, the Base64-encoded string in the code,
"data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...", is truncated. In actual use, pass the complete encoded string.
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-vl-ocr-latest",
"input":{
"messages":[
{"role": "system",
"content": [
{"text": "You are a helpful assistant."}]},
{
"role": "user",
"content": [
{"image": "data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
{"text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"}
]
}
]
}
}'Limits
Image limits
File size: The size of a single image file, or the encoded string if you use Base64 encoding, cannot exceed 10 MB. For more information, see local files.
Dimensions and aspect ratio: The width and height of the image must be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.
Total pixels: The model automatically scales the image, so there is no strict limit on the total number of pixels. By default, an image cannot have more than 15.68 million pixels. If an image exceeds this value, you can adjust the
max_pixelsparameter to a maximum of 23.52 million. This adjustment increases token consumption and processing time.Supported formats:
Image format
Common extensions
MIME Type
BMP
.bmp
image/bmp
JPEG
.jpe, .jpeg, .jpg
image/jpeg
PNG
.png
image/png
TIFF
.tif, .tiff
image/tiff
WEBP
.webp
image/webp
HEIC
.heic
image/heic
Model limits
No multi-turn conversation: The model does not support multi-turn conversations and only answers the most recent question.
Hallucination risk: The model may hallucinate if the text in an image is too small or the resolution is low. The accuracy of answers to questions not related to text extraction is not guaranteed.
Cannot process text files:
For files that contain image data, convert them into an image sequence before processing. For more information, see the recommendations in going live.
For files with plain text or structured data, use a model that can parse long text, such as Qwen-Long.
Billing and rate limiting
Billing: Qwen-OCR is a visual understanding model, and its total cost is calculated based on the number of input and output tokens: (Number of input tokens × Unit price for input) + (Number of output tokens × Unit price for output). Each 28×28 pixel block corresponds to one token, and an image costs a minimum of four tokens. You can view bills or add funds on the Expenses and Costs page in the Alibaba Cloud Management Console.
Rate limiting: For the rate limits of the Qwen-OCR model, see Rate limits.
Free quota (Singapore region only): The 90-day validity period starts on the date you activate Alibaba Cloud Model Studio or your model request is approved. Within this period, the Qwen-OCR model provides a free quota of 1 million tokens.
Going live
Processing multi-page documents, such as PDF files:
Split: Use an image processing library, such as
pdf2imagefor Python, to convert each page of the PDF file into a separate, high-quality image.Submit requests: Send API requests to process the images.
Merge: On the client, merge the recognition results for each page in the correct order.
Image pre-processing:
Ensure that input images are clear, evenly lit, and not excessively compressed:
To prevent information loss, use a lossless format, such as PNG, to store and transfer images.
To improve image definition, use a noise reduction algorithm, such as mean or median filtering, to reduce image noise.
For images with uneven lighting, use an algorithm such as adaptive histogram equalization to adjust brightness and contrast.
For skewed images: You can use the
enable_rotate: trueparameter in the DashScope SDK to significantly improve recognition performance.For images that are too small or too large: You can use the
min_pixelsandmax_pixelsparameters to control scaling behavior before processing.min_pixels: Ensures that small images are enlarged to recognize details. The default value is sufficient for most scenarios.max_pixels: Prevents oversized images from consuming excessive resources. The default value is sufficient for most scenarios. If some small text is not recognized clearly, you can increase themax_pixelsvalue. Note that this increases token consumption.
Result verification: Recognition results from the model may contain errors. For critical business operations, you can implement a manual review step or add validation rules to verify the accuracy of the model's output. For example, you can use format checks for ID card numbers and bank card numbers.
Batch calls: For large-scale, non-real-time scenarios, you can use the Batch API to process batch tasks asynchronously at a lower cost.
FAQ
After the model outputs text localization results, how do I draw the detection boxes on the original image?
API reference
For more information about the request and response parameters of Qwen-OCR, see Qwen.
Error codes
If a call fails, see Error messages for troubleshooting.









