All Products
Search
Document Center

Alibaba Cloud Model Studio:Qwen-OCR API reference

Last Updated:Mar 15, 2026

Extract text, structured data, and key information from images using the Qwen-OCR model. Qwen-OCR supports two API protocols: the OpenAI-compatible API and the DashScope API.

For use cases and getting-started guidance, see Text extraction (Qwen-OCR).

OpenAI-compatible API

Endpoints

Region

SDK base_url

HTTP endpoint

Singapore

https://dashscope-intl.aliyuncs.com/compatible-mode/v1

POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions

US (Virginia)

https://dashscope-us.aliyuncs.com/compatible-mode/v1

POST https://dashscope-us.aliyuncs.com/compatible-mode/v1/chat/completions

China (Beijing)

https://dashscope.aliyuncs.com/compatible-mode/v1

POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions

Prerequisites

Get an API key and set it as an environment variable. If you use the OpenAI SDK, install the SDK.

Quick start

Use the OpenAI-compatible chat completions endpoint. Send a user message with an image URL and text prompt. The model extracts text and returns it in choices[0].message.content.

Non-streaming

Python

from openai import OpenAI
import os

PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'},
"""

try:
    client = OpenAI(
        # If the environment variable is not configured, replace with: api_key="sk-xxx"
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        # Singapore region. For US (Virginia), use https://dashscope-us.aliyuncs.com/compatible-mode/v1
        # For China (Beijing), use https://dashscope.aliyuncs.com/compatible-mode/v1
        base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    )
    completion = client.chat.completions.create(
        model="qwen-vl-ocr-2025-11-20",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
                        # Minimum pixel count. Images below this are upscaled.
                        "min_pixels": 32 * 32 * 3,
                        # Maximum pixel count. Images above this are downscaled.
                        "max_pixels": 32 * 32 * 8192
                    },
                    # Custom prompt. Without this, the model uses: "Please output only the text content from the image without any additional descriptions or formatting."
                    {"type": "text",
                     "text": PROMPT_TICKET_EXTRACTION}
                ]
            }
        ])
    print(completion.choices[0].message.content)
except Exception as e:
    print(f"Error message: {e}")

Node.js

import OpenAI from 'openai';

const PROMPT_TICKET_EXTRACTION = `
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
`;

const client = new OpenAI({
  // If the environment variable is not configured, replace with: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // For China (Beijing), use https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});

async function main() {
  const response = await client.chat.completions.create({
    model: 'qwen-vl-ocr-2025-11-20',
    messages: [
      {
        role: 'user',
        content: [
          { type: 'text', text: PROMPT_TICKET_EXTRACTION},
          {
            type: 'image_url',
            image_url: {
              url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
            },
              // Minimum pixel count. Images below this are upscaled.
              "min_pixels": 32 * 32 * 3,
              // Maximum pixel count. Images above this are downscaled.
              "max_pixels": 32 * 32 * 8192
          }
        ]
      }
    ],
  });
  console.log(response.choices[0].message.content)
}

main();

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr-2025-11-20",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
                    "min_pixels": 3072,
                    "max_pixels": 8388608
                },
                {"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\':\'xxx\', \'seat_number\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
            ]
        }
    ]
}'

Streaming

Set stream to true to receive results incrementally as the model generates them.

Python

import os
from openai import OpenAI

PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx','departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'},
"""

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-ocr-2025-11-20",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
                    "min_pixels": 32 * 32 * 3,
                    "max_pixels": 32 * 32 * 8192
                },
                {"type": "text","text": PROMPT_TICKET_EXTRACTION}
            ]
        }
    ],
    stream=True,
    stream_options={"include_usage": True}
)

for chunk in completion:
    print(chunk.model_dump_json())

Node.js

import OpenAI from 'openai';

const PROMPT_TICKET_EXTRACTION = `
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
`;

const openai = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr-2025-11-20',
    messages: [
      {
        role: 'user',
        content: [
          { type: 'text', text: PROMPT_TICKET_EXTRACTION},
          {
            type: 'image_url',
            image_url: {
              url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
            },
              "min_pixels": 32 * 32 * 3,
              "max_pixels": 32 * 32 * 8192
          }
        ]
      }
    ],
    stream: true,
    stream_options:{"include_usage": true}
  });
  let fullContent = ""
  console.log("Streaming output content:")
  for await (const chunk of response) {
    if (chunk.choices[0] && chunk.choices[0].delta.content != null) {
      fullContent += chunk.choices[0].delta.content;
      console.log(chunk.choices[0].delta.content);
    }
  }
  console.log(`Full output content: ${fullContent}`)
}

main();

curl

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr-2025-11-20",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
                    "min_pixels": 3072,
                    "max_pixels": 8388608
                },
                {"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\':\'xxx\', \'seat_number\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
            ]
        }
    ],
    "stream": true,
    "stream_options": {"include_usage": true}
}'

Request parameters

Parameter

Type

Required

Description

model

string

Yes

Model name. See Qwen-OCR for supported models.

messages

array

Yes

An array of message objects that provides context to the model.

Message object

Each message requires a role (must be user) and a content array with these element types:

Parameter

Type

Required

Description

type

string

Yes

text for text input, image_url for image input.

text

string

No

The text prompt. Default: "Please output only the text content from the image without any additional descriptions or formatting".

image_url.url

string

Yes (when type is image_url)

URL or Base64-encoded Data URL of the image. For local files, see Text extraction.

min_pixels

integer

No

Minimum pixel threshold. Images below this value are upscaled. See Image resolution control.

max_pixels

integer

No

Maximum pixel threshold. Images above this value are downscaled. See Image resolution control.

Generation parameters

Parameter

Type

Default

Description

stream

boolean

false

Set to true to receive incremental responses as the model generates output.

stream_options.include_usage

boolean

false

When stream is true, set this to true to include token usage in the last chunk.

max_tokens

integer

Varies

Maximum tokens in the output. Exceeding this truncates the response. See Output token limits.

temperature

float

0.01

Controls output diversity. Higher values produce more varied text. Range: [0, 2).

top_p

float

0.001

Nucleus sampling threshold. Higher values increase diversity. Range: (0, 1.0]. Set either temperature or top_p, not both.

top_k

integer

1

Limits the candidate token set during sampling. If the value is None or greater than 100, the top_k policy is not enabled, and only the top_p policy takes effect. Must be >= 0. Not a standard OpenAI parameter -- pass via extra_body in the Python SDK: extra_body={"top_k": xxx}. In the Node.js SDK or HTTP, pass at the top level.

repetition_penalty

float

1.0

Penalty for repeated sequences. Values above 1.0 reduce repetition. Not a standard OpenAI parameter -- pass via extra_body in the Python SDK.

presence_penalty

float

0.0

Controls content repetition. Range: [-2.0, 2.0]. Positive values reduce repetition.

seed

integer

--

Ensures reproducible results when the same value is used with identical parameters. Range: [0, 2^31 - 1].

logprobs

boolean

false

Set to true to return log probabilities of output tokens.

top_logprobs

integer

0

Number of most likely tokens to return per step. Range: [0, 5]. Only effective when logprobs is true.

stop

string or array

--

Stop words or token IDs. Generation stops when a specified string or token_id appears. Do not mix strings and token_ids in the same array.

Response

Non-streaming response (chat.completion)

{
  "id": "chatcmpl-ba21fa91-dcd6-4dad-90cc-6d49c3c39094",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "```json\n{\n    \"seller_name\": \"null\",\n    \"buyer_name\": \"Cai Yingshi\",\n    \"price_excluding_tax\": \"230769.23\",\n    \"organization_code\": \"null\",\n    \"invoice_code\": \"142011726001\"\n}\n```",
        "refusal": null,
        "role": "assistant",
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1763283287,
  "model": "qwen-vl-ocr-latest",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 72,
    "prompt_tokens": 1185,
    "total_tokens": 1257,
    "completion_tokens_details": {
      "accepted_prediction_tokens": null,
      "audio_tokens": null,
      "reasoning_tokens": null,
      "rejected_prediction_tokens": null,
      "text_tokens": 72
    },
    "prompt_tokens_details": {
      "audio_tokens": null,
      "cached_tokens": null,
      "image_tokens": 1001,
      "text_tokens": 184
    }
  }
}

Field

Type

Description

id

string

Unique request identifier.

choices

array

Model-generated content.

choices[].finish_reason

string

stop when generation completed normally, length when truncated due to token limit.

choices[].index

integer

Position in the choices array.

choices[].message.content

string

Extracted text or structured output from the model.

choices[].message.role

string

Always assistant.

choices[].message.refusal

string

Always null.

choices[].message.audio

object

Always null.

choices[].message.function_call

object

Always null.

choices[].message.tool_calls

array

Always null.

created

integer

UNIX timestamp of the request.

model

string

Model used.

object

string

Always chat.completion.

service_tier

string

Always null.

system_fingerprint

string

Always null.

usage.completion_tokens

integer

Output token count.

usage.prompt_tokens

integer

Input token count.

usage.total_tokens

integer

Sum of prompt_tokens and completion_tokens.

usage.completion_tokens_details.text_tokens

integer

Text output tokens. Other fields in completion_tokens_details are always null.

usage.prompt_tokens_details.image_tokens

integer

Image input tokens.

usage.prompt_tokens_details.text_tokens

integer

Text input tokens. Other fields in prompt_tokens_details are always null.

Streaming response (chat.completion.chunk)

When stream is true, the response is delivered as a series of Server-Sent Event (SSE) chunks. Each chunk follows the same structure as the non-streaming response, with these differences:

  • object is always chat.completion.chunk.

  • choices[].delta replaces choices[].message. The delta object has the same fields as message.

  • choices[].delta.role is returned only in the first chunk.

  • finish_reason is null during generation, stop on completion, or length if truncated.

  • When include_usage is true, the last chunk has an empty choices array and includes the usage object.

{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[{"delta":{"content":"","function_call":null,"refusal":null,"role":"assistant","tool_calls":null},"finish_reason":null,"index":0,"logprobs":null}],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":null}
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[{"delta":{"content":"```","function_call":null,"refusal":null,"role":null,"tool_calls":null},"finish_reason":null,"index":0,"logprobs":null}],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":null}
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[{"delta":{"content":"json","function_call":null,"refusal":null,"role":null,"tool_calls":null},"finish_reason":null,"index":0,"logprobs":null}],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":null}
......
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[{"delta":{"content":"","function_call":null,"refusal":null,"role":null,"tool_calls":null},"finish_reason":"stop","index":0,"logprobs":null}],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":null}
{"id":"chatcmpl-f6fbdc0d-78d6-418f-856f-f099c2e4859b","choices":[],"created":1764139204,"model":"qwen-vl-ocr-latest","object":"chat.completion.chunk","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":141,"prompt_tokens":513,"total_tokens":654,"completion_tokens_details":{"accepted_prediction_tokens":null,"audio_tokens":null,"reasoning_tokens":null,"rejected_prediction_tokens":null,"text_tokens":141},"prompt_tokens_details":{"audio_tokens":null,"cached_tokens":null,"image_tokens":332,"text_tokens":181}}}

Image resolution control

min_pixels and max_pixels control image resizing before processing. Token-to-pixel ratio depends on model version:

Model

Pixels per token

min_pixels default (minimum)

max_pixels default

max_pixels maximum

qwen-vl-ocr-latest, qwen-vl-ocr-2025-11-20

32 x 32 = 1,024

3,072 (3 tokens)

8,388,608 (8,192 tokens)

30,720,000 (30,000 tokens)

qwen-vl-ocr, qwen-vl-ocr-2025-08-28, and earlier

28 x 28 = 784

3,136 (4 tokens)

6,422,528 (8,192 tokens)

23,520,000 (30,000 tokens)

Resizing behavior:

  • If the image pixel count is below min_pixels, the image is upscaled until it exceeds min_pixels.

  • If the image pixel count is within [min_pixels, max_pixels], the original image is used without resizing.

  • If the image pixel count exceeds max_pixels, the image is downscaled below max_pixels.

Output token limits

Model

Default and maximum max_tokens

qwen-vl-ocr-latest, qwen-vl-ocr-2025-11-20, qwen-vl-ocr-2024-10-28

Same as the model's maximum output length. See Availability.

qwen-vl-ocr, qwen-vl-ocr-2025-04-13, qwen-vl-ocr-2025-08-28

4,096

To increase max_tokens to a value between 4,097 and 8,192, email modelstudio@service.aliyun.com with the following details: your Alibaba Cloud account ID, the image type (such as document, e-commerce, or contract), the model name, your estimated QPS and daily request volume, and the percentage of requests where output exceeds 4,096 tokens.

DashScope API

Endpoints

Region

HTTP endpoint

Singapore

POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

US (Virginia)

POST https://dashscope-us.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

China (Beijing)

POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

SDK base URL configuration:

Python:

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

Java (Method 1 -- constructor):

import com.alibaba.dashscope.protocol.Protocol;
MultiModalConversation conv = new MultiModalConversation(Protocol.HTTP.getValue(), "https://dashscope-intl.aliyuncs.com/api/v1");

Java (Method 2 -- static block):

import com.alibaba.dashscope.utils.Constants;
Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
Replace the domain with dashscope-us.aliyuncs.com for the US (Virginia) region or dashscope.aliyuncs.com for the China (Beijing) region. For the China (Beijing) region, you do not need to set base_url for SDK calls.

Get an API key and set it as an environment variable. If you use the DashScope SDK, you must also install the DashScope SDK.

Built-in tasks

The DashScope API provides built-in OCR tasks via the ocr_options parameter. Each task uses an optimized default prompt, eliminating the need for a text message.

Task

ocr_options.task value

Output format

General text recognition

text_recognition

Plain text

High-precision recognition

advanced_recognition

Plain text with bounding boxes

Information extraction

key_information_extraction

Structured key-value pairs

Table parsing

table_parsing

Table structure

Document parsing

document_parsing

Document structure

Formula recognition

formula_recognition

LaTeX formulas

Multilingual recognition

multi_lan

Multilingual text

High-precision recognition

Returns text with positional data for each recognized line.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
                "min_pixels": 32 * 32 * 3,
                "max_pixels": 32 * 32 * 8192,
                "enable_rotate": False}]
            }]

response = dashscope.MultiModalConversation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    ocr_options={"task": "advanced_recognition"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

// dashscope SDK version >= 2.21.8
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
        map.put("max_pixels", 8388608);
        map.put("min_pixels", 3072);
        map.put("enable_rotate", false);

        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.ADVANCED_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "advanced_recognition"
    }
  }
}
'

Information extraction

Extracts structured key-value data from images. Specify fields to extract in task_config.result_schema.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
      {
        "role":"user",
        "content":[
          {
              "image":"http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
              "min_pixels": 3072,
              "max_pixels": 8388608,
              "enable_rotate": False
          }
        ]
      }
    ]

params = {
  "ocr_options":{
    "task": "key_information_extraction",
    "task_config": {
      "result_schema": {
          "Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
          "Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
          "Invoice Number": "Extract the number from the invoice, usually composed of only digits."
      }
    }
  }
}

response = dashscope.MultiModalConversation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    **params)

print(response.output.choices[0].message.content[0]["ocr_result"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.JsonObject;

public class Main {

    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg");
        map.put("max_pixels", 8388608);
        map.put("min_pixels", 3072);
        map.put("enable_rotate", false);

        JsonObject resultSchema = new JsonObject();
        resultSchema.addProperty("Ride Date", "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05");
        resultSchema.addProperty("Invoice Code", "Extract the invoice code from the image, usually a combination of numbers or letters");
        resultSchema.addProperty("Invoice Number", "Extract the number from the invoice, usually composed of only digits.");

        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.KEY_INFORMATION_EXTRACTION)
                .taskConfig(OcrOptions.TaskConfig.builder().resultSchema(resultSchema).build())
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("ocr_result"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "key_information_extraction",
      "task_config": {
        "result_schema": {
          "Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
          "Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
          "Invoice Number": "Extract the number from the invoice, usually composed of only digits."
        }
      }
    }
  }
}
'

Table parsing

Extracts table structure from images.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
                "min_pixels": 32 * 32 * 3,
                "max_pixels": 32 * 32 * 8192,
                "enable_rotate": False}]
            }]

response = dashscope.MultiModalConversation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    ocr_options={"task": "table_parsing"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg");
        map.put("max_pixels", 8388608);
        map.put("min_pixels", 3072);
        map.put("enable_rotate", false);

        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.TABLE_PARSING)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "table_parsing"
    }
  }
}
'

Document parsing

Extracts the structural layout and text from documents.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
                "min_pixels": 32 * 32 * 3,
                "max_pixels": 32 * 32 * 8192,
                "enable_rotate": False}]
            }]

response = dashscope.MultiModalConversation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    ocr_options={"task": "document_parsing"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg");
        map.put("max_pixels", 8388608);
        map.put("min_pixels", 3072);
        map.put("enable_rotate", false);

        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.DOCUMENT_PARSING)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "document_parsing"
    }
  }
}
'

Formula recognition

Extracts mathematical formulas from images and returns them in LaTeX format.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
                "min_pixels": 32 * 32 * 3,
                "max_pixels": 32 * 32 * 8192,
                "enable_rotate": False}]
            }]

response = dashscope.MultiModalConversation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    ocr_options={"task": "formula_recognition"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg");
        map.put("max_pixels", 8388608);
        map.put("min_pixels", 3072);
        map.put("enable_rotate", false);

        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.FORMULA_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "formula_recognition"
    }
  }
}
'

General text recognition

Extracts plain text from images without structural formatting.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
                "min_pixels": 32 * 32 * 3,
                "max_pixels": 32 * 32 * 8192,
                "enable_rotate": False}]
            }]

response = dashscope.MultiModalConversation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    ocr_options={"task": "text_recognition"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
        map.put("max_pixels", 8388608);
        map.put("min_pixels", 3072);
        map.put("enable_rotate", false);

        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.TEXT_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "text_recognition"
    }
  }
}
'

Multilingual recognition

Recognizes text in multiple languages from images.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
                "min_pixels": 32 * 32 * 3,
                "max_pixels": 32 * 32 * 8192,
                "enable_rotate": False}]
            }]

response = dashscope.MultiModalConversation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    ocr_options={"task": "multi_lan"}
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
        map.put("max_pixels", 8388608);
        map.put("min_pixels", 3072);
        map.put("enable_rotate", false);

        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.MULTI_LAN)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "multi_lan"
    }
  }
}
'

Streaming (DashScope)

Enable streaming output to receive results incrementally. The method varies by SDK:

  • Python SDK: Set stream=True and incremental_output=True.

  • Java SDK: Use the streamCall interface.

  • HTTP: Set the X-DashScope-SSE: enable header.

Python

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx','departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time':'xxx', 'seat_number': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'},
"""

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                "min_pixels": 32 * 32 * 3,
                "max_pixels": 32 * 32 * 8192},
            {
                "type": "text",
                "text": PROMPT_TICKET_EXTRACTION
            }
        ]
    }
]

response = dashscope.MultiModalConversation.call(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-ocr-2025-11-20",
    messages=messages,
    stream=True,
    incremental_output=True,
)
full_content = ""
print("Streaming output content:")
for response in response:
    try:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
    except:
        pass
print(f"Full content: {full_content}")

Java

import java.util.*;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
        map.put("max_pixels", 8388608);
        map.put("min_pixels", 3072);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        Collections.singletonMap("text", "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\':\'xxx\', \'seat_number\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                List<Map<String, Object>> contentList = item.getOutput().getChoices().get(0).getMessage().getContent();
                if (!contentList.isEmpty()){
                    System.out.println(contentList.get(0).get("text"));
                }//
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--header 'X-DashScope-SSE: enable' \
--data '
{
    "model": "qwen-vl-ocr-2025-11-20",
    "input": {
        "messages": [
            {
              "role": "user",
              "content": [
                  {
                      "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                      "min_pixels": 3072,
                      "max_pixels": 8388608
                  },
                  {"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. You must accurately extract the key information. Do not omit or fabricate information. Replace any single character that is blurry or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\':\'xxx\', \'seat_number\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
              ]
            }
        ]
    },
    "parameters": {
        "incremental_output": true
    }
}'

Request parameters

Parameter

Type

Required

Description

model

string

Yes

Model name. See Qwen-OCR for supported models.

input.messages

array

Yes

An array of message objects.

Message object

Each message requires a role (must be user) and a content field (string or array). Use a string for text-only input. Use an array if the input includes image data, with these fields:

Parameter

Type

Required

Description

image

string

No

URL, Base64 Data URL, or local path of the image. See Passing local files.

text

string

No

The text prompt. Default: "Please output only the text content from the image without any additional descriptions or formatting". Not required when using a built-in task.

enable_rotate

boolean

No

Set to true to correct skewed images. Default: false.

min_pixels

integer

No

Minimum pixel threshold. See Image resolution control.

max_pixels

integer

No

Maximum pixel threshold. See Image resolution control.

Generation parameters

Set these in the parameters object for HTTP calls.

Parameter

Type

Default

Description

max_tokens

integer

Varies

Maximum tokens in the output. See Output token limits. In the Java SDK, use maxTokens.

stream

boolean

false

Enable streaming output. Python SDK only. For Java, use streamCall. For HTTP, set X-DashScope-SSE: enable.

incremental_output

boolean

false

When true (recommended), each chunk contains only new content. When false, each chunk contains the full sequence so far. In the Java SDK, use incrementalOutput.

temperature

float

0.01

Controls output diversity. Range: [0, 2).

top_p

float

0.001

Nucleus sampling threshold. Range: (0, 1.0]. Set either temperature or top_p, not both.

top_k

integer

1

Limits the candidate token set during sampling. If the value is None or greater than 100, the top_k policy is not enabled, and only the top_p policy takes effect. Must be >= 0.

repetition_penalty

float

1.0

Penalty for repeated sequences. Values above 1.0 reduce repetition.

presence_penalty

float

0.0

Controls content repetition. Range: [-2.0, 2.0].

seed

integer

--

Ensures reproducible results. Range: [0, 2^31 - 1].

logprobs

boolean

false

Set to true to return log probabilities. Supported models: qwen-vl-ocr-2025-04-13 and later. In the Java SDK, use the same name. For HTTP, place in parameters.

top_logprobs

integer

0

Number of most likely tokens per step. Range: [0, 5]. Only effective when logprobs is true. In the Java SDK, use topLogprobs. For HTTP, place in parameters.

stop

string or array

--

Stop words or token IDs. Generation stops when a specified string or token_id appears. Do not mix strings and token_ids in the same array.

Built-in task parameters (ocr_options)

When using a built-in task, pass ocr_options in parameters (HTTP), as a keyword argument (Python SDK), or via the OcrOptions builder (Java SDK).

Parameter

Type

Required

Description

ocr_options.task

string

Yes

Built-in task name. Valid values: text_recognition, key_information_extraction, document_parsing, table_parsing, formula_recognition, multi_lan, advanced_recognition.

ocr_options.task_config

object

No

Configuration for key_information_extraction.

ocr_options.task_config.result_schema

object

No

JSON object specifying fields to extract. Keys are field names, values are optional descriptions for improved accuracy. Supports up to three nesting levels.

result_schema example:

"result_schema": {
     "invoice_number": "The unique identification number of the invoice, usually a combination of numbers and letters.",
     "issue_date": "The date the invoice was issued. Extract it in YYYY-MM-DD format, for example, 2023-10-26.",
     "seller_name": "The full company name of the seller shown on the invoice.",
     "total_amount": "The total amount on the invoice, including tax. Extract the numerical value and keep two decimal places, for example, 123.45."
}
In the Java SDK, this parameter is OcrOptions. The minimum DashScope Python SDK version is 1.22.2. The minimum Java SDK version is 2.18.4. For advanced_recognition, Java SDK >= 2.21.8 is required.

Response

The DashScope API uses identical response format for streaming and non-streaming output.

{
  "status_code": 200,
  "request_id": "8f8c0f6e-6805-4056-bb65-d26d66080a41",
  "code": "",
  "message": "",
  "output": {
    "text": null,
    "finish_reason": null,
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "ocr_result": {
                "kv_result": {
                  "price_excluding_tax": "230769.23",
                  "invoice_code": "142011726001",
                  "organization_code": "null",
                  "buyer_name": "Cai Yingshi",
                  "seller_name": "null"
                }
              },
              "text": "```json\n{\n    \"price_excluding_tax\": \"230769.23\",\n    \"invoice_code\": \"142011726001\",\n    \"organization_code\": \"null\",\n    \"buyer_name\": \"Cai Yingshi\",\n    \"seller_name\": \"null\"\n}\n```"
            }
          ]
        }
      }
    ],
    "audio": null
  },
  "usage": {
    "input_tokens": 926,
    "output_tokens": 72,
    "characters": 0,
    "image_tokens": 754,
    "input_tokens_details": {
      "image_tokens": 754,
      "text_tokens": 172
    },
    "output_tokens_details": {
      "text_tokens": 72
    },
    "total_tokens": 998
  }
}

Field

Type

Description

status_code

string

200 indicates success. The Java SDK throws an exception instead of returning this field.

request_id

string

Unique request identifier. In the Java SDK, this is requestId.

code

string

Error code. Empty on success. Only the Python SDK returns this field.

output.text

string

Always null.

output.finish_reason

string

null during generation, stop when complete, length when truncated.

output.choices[].finish_reason

string

Same values as output.finish_reason.

output.choices[].message.role

string

Always assistant.

output.choices[].message.content[].text

string

Extracted text or formatted output from the model.

output.choices[].message.content[].ocr_result

object

Returned for built-in tasks (key_information_extraction, advanced_recognition).

output.choices[].message.content[].ocr_result.kv_result

object

Key-value extraction results (for key_information_extraction).

output.choices[].message.content[].ocr_result.words_info

array

Text line results with positional data (for advanced_recognition).

output.choices[].message.content[].ocr_result.words_info[].rotate_rect

array

[center_x, center_y, width, height, angle] -- rotated bounding rectangle.

output.choices[].message.content[].ocr_result.words_info[].location

array

[x1, y1, x2, y2, x3, y3, x4, y4] -- four vertices clockwise from the top-left.

output.choices[].message.content[].ocr_result.words_info[].text

string

Content of the text line.

output.choices[].message.logprobs

object

Log probability information, returned when logprobs is true.

usage.input_tokens

integer

Input token count.

usage.output_tokens

integer

Output token count.

usage.characters

integer

Fixed to 0.

usage.total_tokens

integer

Sum of input_tokens and output_tokens.

usage.image_tokens

integer

Tokens corresponding to the image input.

usage.input_tokens_details.image_tokens

integer

Image input tokens.

usage.input_tokens_details.text_tokens

integer

Text input tokens.

usage.output_tokens_details.text_tokens

integer

Text output tokens.

Supported models

Model

Description

qwen-vl-ocr-latest

Always points to the latest version.

qwen-vl-ocr-2025-11-20

Latest dated snapshot.

qwen-vl-ocr-2025-08-28

Previous version.

qwen-vl-ocr-2025-04-13

Previous version.

qwen-vl-ocr-2024-10-28

Previous version.

qwen-vl-ocr

Base model.

Error codes

If a model call returns an error, see Error messages to resolve the issue.