Extract text, structured data, and key information from images using the Qwen-OCR model. Qwen-OCR supports two API protocols: the OpenAI-compatible API and the DashScope API.
For use cases and getting-started guidance, see Text extraction (Qwen-OCR).
OpenAI-compatible API
Endpoints
|
Region |
SDK |
HTTP endpoint |
|
Singapore |
|
|
|
US (Virginia) |
|
|
|
China (Beijing) |
|
|
Prerequisites
Get an API key and set it as an environment variable. If you use the OpenAI SDK, install the SDK.
Quick start
Use the OpenAI-compatible chat completions endpoint. Send a user message with an image URL and text prompt. The model extracts text and returns it in choices[0].message.content.
Non-streaming
Python
from openai import OpenAI
import os
PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'},
"""
try:
client = OpenAI(
# If the environment variable is not configured, replace with: api_key="sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
# Singapore region. For US (Virginia), use https://dashscope-us.aliyuncs.com/compatible-mode/v1
# For China (Beijing), use https://dashscope.aliyuncs.com/compatible-mode/v1
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr-2025-11-20",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
# Minimum pixel count. Images below this are upscaled.
"min_pixels": 32 * 32 * 3,
# Maximum pixel count. Images above this are downscaled.
"max_pixels": 32 * 32 * 8192
},
# Custom prompt. Without this, the model uses: "Please output only the text content from the image without any additional descriptions or formatting."
{"type": "text",
"text": PROMPT_TICKET_EXTRACTION}
]
}
])
print(completion.choices[0].message.content)
except Exception as e:
print(f"Error message: {e}")
Node.js
import OpenAI from 'openai';
const PROMPT_TICKET_EXTRACTION = `
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
`;
const client = new OpenAI({
// If the environment variable is not configured, replace with: apiKey: "sk-xxx"
apiKey: process.env.DASHSCOPE_API_KEY,
// For China (Beijing), use https://dashscope.aliyuncs.com/compatible-mode/v1
baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});
async function main() {
const response = await client.chat.completions.create({
model: 'qwen-vl-ocr-2025-11-20',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: PROMPT_TICKET_EXTRACTION},
{
type: 'image_url',
image_url: {
url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
},
// Minimum pixel count. Images below this are upscaled.
"min_pixels": 32 * 32 * 3,
// Maximum pixel count. Images above this are downscaled.
"max_pixels": 32 * 32 * 8192
}
]
}
],
});
console.log(response.choices[0].message.content)
}
main();
curl
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-vl-ocr-2025-11-20",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
"min_pixels": 3072,
"max_pixels": 8388608
},
{"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\':\'xxx\', \'seat_number\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
]
}
]
}'
Streaming
Set stream to true to receive results incrementally as the model generates them.
Python
import os
from openai import OpenAI
PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx','departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'},
"""
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen-vl-ocr-2025-11-20",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192
},
{"type": "text","text": PROMPT_TICKET_EXTRACTION}
]
}
],
stream=True,
stream_options={"include_usage": True}
)
for chunk in completion:
print(chunk.model_dump_json())
Node.js
import OpenAI from 'openai';
const PROMPT_TICKET_EXTRACTION = `
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
`;
const openai = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});
async function main() {
const response = await openai.chat.completions.create({
model: 'qwen-vl-ocr-2025-11-20',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: PROMPT_TICKET_EXTRACTION},
{
type: 'image_url',
image_url: {
url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
},
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192
}
]
}
],
stream: true,
stream_options:{"include_usage": true}
});
let fullContent = ""
console.log("Streaming output content:")
for await (const chunk of response) {
if (chunk.choices[0] && chunk.choices[0].delta.content != null) {
fullContent += chunk.choices[0].delta.content;
console.log(chunk.choices[0].delta.content);
}
}
console.log(`Full output content: ${fullContent}`)
}
main();
curl
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-vl-ocr-2025-11-20",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
"min_pixels": 3072,
"max_pixels": 8388608
},
{"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\':\'xxx\', \'seat_number\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
]
}
],
"stream": true,
"stream_options": {"include_usage": true}
}'
Request parameters
|
Parameter |
Type |
Required |
Description |
|
|
string |
Yes |
Model name. See Qwen-OCR for supported models. |
|
|
array |
Yes |
An array of message objects that provides context to the model. |
Message object
Each message requires a role (must be user) and a content array with these element types:
|
Parameter |
Type |
Required |
Description |
|
|
string |
Yes |
|
|
|
string |
No |
The text prompt. Default: |
|
|
string |
Yes (when |
URL or Base64-encoded Data URL of the image. For local files, see Text extraction. |
|
|
integer |
No |
Minimum pixel threshold. Images below this value are upscaled. See Image resolution control. |
|
|
integer |
No |
Maximum pixel threshold. Images above this value are downscaled. See Image resolution control. |
Generation parameters
|
Parameter |
Type |
Default |
Description |
|
|
boolean |
|
Set to |
|
|
boolean |
|
When |
|
|
integer |
Varies |
Maximum tokens in the output. Exceeding this truncates the response. See Output token limits. |
|
|
float |
|
Controls output diversity. Higher values produce more varied text. Range: [0, 2). |
|
|
float |
|
Nucleus sampling threshold. Higher values increase diversity. Range: (0, 1.0]. Set either |
|
|
integer |
|
Limits the candidate token set during sampling. If the value is None or greater than 100, the top_k policy is not enabled, and only the top_p policy takes effect. Must be >= 0. Not a standard OpenAI parameter -- pass via |
|
|
float |
|
Penalty for repeated sequences. Values above 1.0 reduce repetition. Not a standard OpenAI parameter -- pass via |
|
|
float |
|
Controls content repetition. Range: [-2.0, 2.0]. Positive values reduce repetition. |
|
|
integer |
-- |
Ensures reproducible results when the same value is used with identical parameters. Range: [0, 2^31 - 1]. |
|
|
boolean |
|
Set to |
|
|
integer |
|
Number of most likely tokens to return per step. Range: [0, 5]. Only effective when |
|
|
string or array |
-- |
Stop words or token IDs. Generation stops when a specified string or |
Response
Non-streaming response (chat.completion)
{
"id": "chatcmpl-ba21fa91-dcd6-4dad-90cc-6d49c3c39094",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "```json\n{\n \"seller_name\": \"null\",\n \"buyer_name\": \"Cai Yingshi\",\n \"price_excluding_tax\": \"230769.23\",\n \"organization_code\": \"null\",\n \"invoice_code\": \"142011726001\"\n}\n```",
"refusal": null,
"role": "assistant",
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": null
}
}
],
"created": 1763283287,
"model": "qwen-vl-ocr-latest",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 72,
"prompt_tokens": 1185,
"total_tokens": 1257,
"completion_tokens_details": {
"accepted_prediction_tokens": null,
"audio_tokens": null,
"reasoning_tokens": null,
"rejected_prediction_tokens": null,
"text_tokens": 72
},
"prompt_tokens_details": {
"audio_tokens": null,
"cached_tokens": null,
"image_tokens": 1001,
"text_tokens": 184
}
}
}
|
Field |
Type |
Description |
|
|
string |
Unique request identifier. |
|
|
array |
Model-generated content. |
|
|
string |
|
|
|
integer |
Position in the |
|
|
string |
Extracted text or structured output from the model. |
|
|
string |
Always |
|
|
string |
Always |
|
|
object |
Always |
|
|
object |
Always |
|
|
array |
Always |
|
|
integer |
UNIX timestamp of the request. |
|
|
string |
Model used. |
|
|
string |
Always |
|
|
string |
Always |
|
|
string |
Always |
|
|
integer |
Output token count. |
|
|
integer |
Input token count. |
|
|
integer |
Sum of |
|
|
integer |
Text output tokens. Other fields in |
|
|
integer |
Image input tokens. |
|
|
integer |
Text input tokens. Other fields in |
Streaming response (chat.completion.chunk)
When stream is true, the response is delivered as a series of Server-Sent Event (SSE) chunks. Each chunk follows the same structure as the non-streaming response, with these differences:
-
objectis alwayschat.completion.chunk. -
choices[].deltareplaceschoices[].message. Thedeltaobject has the same fields asmessage. -
choices[].delta.roleis returned only in the first chunk. -
finish_reasonisnullduring generation,stopon completion, orlengthif truncated. -
When
include_usageistrue, the last chunk has an emptychoicesarray and includes theusageobject.
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[{"delta":{"content":"","function_call":null,"refusal":null,"role":"assistant","tool_calls":null},"finish_reason":null,"index":0,"logprobs":null}],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":null}
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[{"delta":{"content":"```","function_call":null,"refusal":null,"role":null,"tool_calls":null},"finish_reason":null,"index":0,"logprobs":null}],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":null}
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[{"delta":{"content":"json","function_call":null,"refusal":null,"role":null,"tool_calls":null},"finish_reason":null,"index":0,"logprobs":null}],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":null}
......
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[{"delta":{"content":"","function_call":null,"refusal":null,"role":null,"tool_calls":null},"finish_reason":"stop","index":0,"logprobs":null}],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":null}
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":141,"prompt_tokens":513,"total_tokens":654,"completion_tokens_details":{"accepted_prediction_tokens":null,"audio_tokens":null,"reasoning_tokens":null,"rejected_prediction_tokens":null,"text_tokens":141},"prompt_tokens_details":{"audio_tokens":null,"cached_tokens":null,"image_tokens":332,"text_tokens":181}}}
Image resolution control
min_pixels and max_pixels control image resizing before processing. Token-to-pixel ratio depends on model version:
|
Model |
Pixels per token |
|
|
|
|
|
32 x 32 = 1,024 |
3,072 (3 tokens) |
8,388,608 (8,192 tokens) |
30,720,000 (30,000 tokens) |
|
|
28 x 28 = 784 |
3,136 (4 tokens) |
6,422,528 (8,192 tokens) |
23,520,000 (30,000 tokens) |
Resizing behavior:
-
If the image pixel count is below
min_pixels, the image is upscaled until it exceedsmin_pixels. -
If the image pixel count is within
[min_pixels, max_pixels], the original image is used without resizing. -
If the image pixel count exceeds
max_pixels, the image is downscaled belowmax_pixels.
Output token limits
|
Model |
Default and maximum |
|
|
Same as the model's maximum output length. See Availability. |
|
|
4,096 |
To increase max_tokens to a value between 4,097 and 8,192, email modelstudio@service.aliyun.com with the following details: your Alibaba Cloud account ID, the image type (such as document, e-commerce, or contract), the model name, your estimated QPS and daily request volume, and the percentage of requests where output exceeds 4,096 tokens.
DashScope API
Endpoints
|
Region |
HTTP endpoint |
|
Singapore |
|
|
US (Virginia) |
|
|
China (Beijing) |
|
SDK base URL configuration:
Python:
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
Java (Method 1 -- constructor):
import com.alibaba.dashscope.protocol.Protocol;
MultiModalConversation conv = new MultiModalConversation(Protocol.HTTP.getValue(), "https://dashscope-intl.aliyuncs.com/api/v1");
Java (Method 2 -- static block):
import com.alibaba.dashscope.utils.Constants;
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
Replace the domain withdashscope-us.aliyuncs.comfor the US (Virginia) region ordashscope.aliyuncs.comfor the China (Beijing) region. For the China (Beijing) region, you do not need to setbase_urlfor SDK calls.
Get an API key and set it as an environment variable. If you use the DashScope SDK, you must also install the DashScope SDK.
Built-in tasks
The DashScope API provides built-in OCR tasks via the ocr_options parameter. Each task uses an optimized default prompt, eliminating the need for a text message.
|
Task |
|
Output format |
|
General text recognition |
|
Plain text |
|
High-precision recognition |
|
Plain text with bounding boxes |
|
Information extraction |
|
Structured key-value pairs |
|
Table parsing |
|
Table structure |
|
Document parsing |
|
Document structure |
|
Formula recognition |
|
LaTeX formulas |
|
Multilingual recognition |
|
Multilingual text |
High-precision recognition
Returns text with positional data for each recognized line.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192,
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
ocr_options={"task": "advanced_recognition"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
// dashscope SDK version >= 2.21.8
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
map.put("max_pixels", 8388608);
map.put("min_pixels", 3072);
map.put("enable_rotate", false);
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.ADVANCED_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "advanced_recognition"
}
}
}
'
Information extraction
Extracts structured key-value data from images. Specify fields to extract in task_config.result_schema.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role":"user",
"content":[
{
"image":"http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": False
}
]
}
]
params = {
"ocr_options":{
"task": "key_information_extraction",
"task_config": {
"result_schema": {
"Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
"Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
"Invoice Number": "Extract the number from the invoice, usually composed of only digits."
}
}
}
}
response = dashscope.MultiModalConversation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
**params)
print(response.output.choices[0].message.content[0]["ocr_result"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.JsonObject;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg");
map.put("max_pixels", 8388608);
map.put("min_pixels", 3072);
map.put("enable_rotate", false);
JsonObject resultSchema = new JsonObject();
resultSchema.addProperty("Ride Date", "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05");
resultSchema.addProperty("Invoice Code", "Extract the invoice code from the image, usually a combination of numbers or letters");
resultSchema.addProperty("Invoice Number", "Extract the number from the invoice, usually composed of only digits.");
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.KEY_INFORMATION_EXTRACTION)
.taskConfig(OcrOptions.TaskConfig.builder().resultSchema(resultSchema).build())
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("ocr_result"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "key_information_extraction",
"task_config": {
"result_schema": {
"Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
"Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
"Invoice Number": "Extract the number from the invoice, usually composed of only digits."
}
}
}
}
}
'
Table parsing
Extracts table structure from images.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192,
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
ocr_options={"task": "table_parsing"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg");
map.put("max_pixels", 8388608);
map.put("min_pixels", 3072);
map.put("enable_rotate", false);
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.TABLE_PARSING)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "table_parsing"
}
}
}
'
Document parsing
Extracts the structural layout and text from documents.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192,
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
ocr_options={"task": "document_parsing"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg");
map.put("max_pixels", 8388608);
map.put("min_pixels", 3072);
map.put("enable_rotate", false);
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.DOCUMENT_PARSING)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "document_parsing"
}
}
}
'
Formula recognition
Extracts mathematical formulas from images and returns them in LaTeX format.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192,
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
ocr_options={"task": "formula_recognition"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg");
map.put("max_pixels", 8388608);
map.put("min_pixels", 3072);
map.put("enable_rotate", false);
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.FORMULA_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "formula_recognition"
}
}
}
'
General text recognition
Extracts plain text from images without structural formatting.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192,
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
ocr_options={"task": "text_recognition"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
map.put("max_pixels", 8388608);
map.put("min_pixels", 3072);
map.put("enable_rotate", false);
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.TEXT_RECOGNITION)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "text_recognition"
}
}
}
'
Multilingual recognition
Recognizes text in multiple languages from images.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
"role": "user",
"content": [{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192,
"enable_rotate": False}]
}]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='qwen-vl-ocr-2025-11-20',
messages=messages,
ocr_options={"task": "multi_lan"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
map.put("max_pixels", 8388608);
map.put("min_pixels", 3072);
map.put("enable_rotate", false);
OcrOptions ocrOptions = OcrOptions.builder()
.task(OcrOptions.Task.MULTI_LAN)
.build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.ocrOptions(ocrOptions)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
"min_pixels": 3072,
"max_pixels": 8388608,
"enable_rotate": false
}
]
}
]
},
"parameters": {
"ocr_options": {
"task": "multi_lan"
}
}
}
'
Streaming (DashScope)
Enable streaming output to receive results incrementally. The method varies by SDK:
-
Python SDK: Set
stream=Trueandincremental_output=True. -
Java SDK: Use the
streamCallinterface. -
HTTP: Set the
X-DashScope-SSE: enableheader.
Python
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx','departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'},
"""
messages = [
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 32 * 32 * 3,
"max_pixels": 32 * 32 * 8192},
{
"type": "text",
"text": PROMPT_TICKET_EXTRACTION
}
]
}
]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen-vl-ocr-2025-11-20",
messages=messages,
stream=True,
incremental_output=True,
)
full_content = ""
print("Streaming output content:")
for response in response:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
full_content += response["output"]["choices"][0]["message"].content[0]["text"]
except:
pass
print(f"Full content: {full_content}")
Java
import java.util.*;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;
public class Main {
static {
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
}
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
Map<String, Object> map = new HashMap<>();
map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
map.put("max_pixels", 8388608);
map.put("min_pixels", 3072);
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
map,
Collections.singletonMap("text", "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\':\'xxx\', \'seat_number\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen-vl-ocr-2025-11-20")
.message(userMessage)
.incrementalOutput(true)
.build();
Flowable<MultiModalConversationResult> result = conv.streamCall(param);
result.blockingForEach(item -> {
try {
List<Map<String, Object>> contentList = item.getOutput().getChoices().get(0).getMessage().getContent();
if (!contentList.isEmpty()){
System.out.println(contentList.get(0).get("text"));
}//
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
curl
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--header 'X-DashScope-SSE: enable' \
--data '
{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
"min_pixels": 3072,
"max_pixels": 8388608
},
{"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\':\'xxx\', \'seat_number\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
]
}
]
},
"parameters": {
"incremental_output": true
}
}'
Request parameters
|
Parameter |
Type |
Required |
Description |
|
|
string |
Yes |
Model name. See Qwen-OCR for supported models. |
|
|
array |
Yes |
An array of message objects. |
Message object
Each message requires a role (must be user) and a content field (string or array). Use a string for text-only input. Use an array if the input includes image data, with these fields:
|
Parameter |
Type |
Required |
Description |
|
|
string |
No |
URL, Base64 Data URL, or local path of the image. See Passing local files. |
|
|
string |
No |
The text prompt. Default: |
|
|
boolean |
No |
Set to |
|
|
integer |
No |
Minimum pixel threshold. See Image resolution control. |
|
|
integer |
No |
Maximum pixel threshold. See Image resolution control. |
Generation parameters
Set these in the parameters object for HTTP calls.
|
Parameter |
Type |
Default |
Description |
|
|
integer |
Varies |
Maximum tokens in the output. See Output token limits. In the Java SDK, use |
|
|
boolean |
|
Enable streaming output. Python SDK only. For Java, use |
|
|
boolean |
|
When |
|
|
float |
|
Controls output diversity. Range: [0, 2). |
|
|
float |
|
Nucleus sampling threshold. Range: (0, 1.0]. Set either |
|
|
integer |
|
Limits the candidate token set during sampling. If the value is None or greater than 100, the top_k policy is not enabled, and only the top_p policy takes effect. Must be >= 0. |
|
|
float |
|
Penalty for repeated sequences. Values above 1.0 reduce repetition. |
|
|
float |
|
Controls content repetition. Range: [-2.0, 2.0]. |
|
|
integer |
-- |
Ensures reproducible results. Range: [0, 2^31 - 1]. |
|
|
boolean |
|
Set to |
|
|
integer |
|
Number of most likely tokens per step. Range: [0, 5]. Only effective when |
|
|
string or array |
-- |
Stop words or token IDs. Generation stops when a specified string or |
Built-in task parameters (ocr_options)
When using a built-in task, pass ocr_options in parameters (HTTP), as a keyword argument (Python SDK), or via the OcrOptions builder (Java SDK).
|
Parameter |
Type |
Required |
Description |
|
|
string |
Yes |
Built-in task name. Valid values: |
|
|
object |
No |
Configuration for |
|
|
object |
No |
JSON object specifying fields to extract. Keys are field names, values are optional descriptions for improved accuracy. Supports up to three nesting levels. |
result_schema example:
"result_schema": {
"invoice_number": "The unique identification number of the invoice, usually a combination of numbers and letters.",
"issue_date": "The date the invoice was issued. Extract it in YYYY-MM-DD format, for example, 2023-10-26.",
"seller_name": "The full company name of the seller shown on the invoice.",
"total_amount": "The total amount on the invoice, including tax. Extract the numerical value and keep two decimal places, for example, 123.45."
}
In the Java SDK, this parameter isOcrOptions. The minimum DashScope Python SDK version is 1.22.2. The minimum Java SDK version is 2.18.4. Foradvanced_recognition, Java SDK >= 2.21.8 is required.
Response
The DashScope API uses identical response format for streaming and non-streaming output.
{
"status_code": 200,
"request_id": "8f8c0f6e-6805-4056-bb65-d26d66080a41",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"ocr_result": {
"kv_result": {
"price_excluding_tax": "230769.23",
"invoice_code": "142011726001",
"organization_code": "null",
"buyer_name": "Cai Yingshi",
"seller_name": "null"
}
},
"text": "```json\n{\n \"price_excluding_tax\": \"230769.23\",\n \"invoice_code\": \"142011726001\",\n \"organization_code\": \"null\",\n \"buyer_name\": \"Cai Yingshi\",\n \"seller_name\": \"null\"\n}\n```"
}
]
}
}
],
"audio": null
},
"usage": {
"input_tokens": 926,
"output_tokens": 72,
"characters": 0,
"image_tokens": 754,
"input_tokens_details": {
"image_tokens": 754,
"text_tokens": 172
},
"output_tokens_details": {
"text_tokens": 72
},
"total_tokens": 998
}
}
|
Field |
Type |
Description |
|
|
string |
|
|
|
string |
Unique request identifier. In the Java SDK, this is |
|
|
string |
Error code. Empty on success. Only the Python SDK returns this field. |
|
|
string |
Always |
|
|
string |
|
|
|
string |
Same values as |
|
|
string |
Always |
|
|
string |
Extracted text or formatted output from the model. |
|
|
object |
Returned for built-in tasks ( |
|
|
object |
Key-value extraction results (for |
|
|
array |
Text line results with positional data (for |
|
|
array |
|
|
|
array |
|
|
|
string |
Content of the text line. |
|
|
object |
Log probability information, returned when |
|
|
integer |
Input token count. |
|
|
integer |
Output token count. |
|
|
integer |
Fixed to 0. |
|
|
integer |
Sum of |
|
|
integer |
Tokens corresponding to the image input. |
|
|
integer |
Image input tokens. |
|
|
integer |
Text input tokens. |
|
|
integer |
Text output tokens. |
Supported models
|
Model |
Description |
|
|
Always points to the latest version. |
|
|
Latest dated snapshot. |
|
|
Previous version. |
|
|
Previous version. |
|
|
Previous version. |
|
|
Base model. |
Error codes
If a model call returns an error, see Error messages to resolve the issue.