All Products
Search
Document Center

Alibaba Cloud Model Studio:Text extraction (Qwen-OCR)

Last Updated:Nov 05, 2025

Qwen-OCR is a visual understanding model that specializes in text extraction. It can extract text or parse structured data from various images, such as scanned documents, tables, and receipts. It supports multiple languages and can perform advanced functions, such as information extraction, table parsing, and formula recognition, using specific task instructions.

You can try Qwen-OCR online in the Playground (Singapore or Beijing).

Examples

Input image

Recognition result

Recognize multiple languages

image

INTERNATIONAL

MOTHER LANGUAGE

DAY

Привет!

Hello!

Bonjour!

Merhaba!

Ciao!

Hello!

Ola!

בר מולד

Salam!

Recognize skewed images

image

Product Introduction

This product is made of imported fiber filaments from South Korea. It does not shrink, deform, mold, or grow bacteria, and will not damage surfaces. It is truly non-stick, highly absorbent, water-resistant, easy to clean, non-toxic, residue-free, and quick-drying.

Store experience: Stainless steel, ceramic products, bathtubs, and integrated bathrooms mostly have white, smooth surfaces. Stains are difficult to remove with other cloths, and sharp objects can easily cause scratches. Use this simulated loofah sponge with a small amount of neutral detergent to create a lather, and you can easily wipe these surface stains clean.

6941990612023

Item No.: 2023

Locate text position

img_1

Visualization of location results

img_1_location

For more information about how to draw the bounding box for each line of text on the original image, see FAQ.

Models and pricing

International (Singapore)

Model

Version

Context window

Maximum input

Maximum output

Input/Output price

Free quota

(Note)

(Tokens)

(per million tokens)

qwen-vl-ocr

Stable

34,096

30,000

Max 30,000 for a single image

4096

$0.72

1 million tokens for each

Validity: 90 days after activating Model Studio.

Chinese mainland (Beijing)

Model

Version

Context window

Maximum input

Maximum output

Input/Output price

(Tokens)

(per million tokens)

qwen-vl-ocr

Currently has the same capabilities as qwen-vl-ocr-2025-04-13.

Stable

34,096

30,000

Max 30,000 for a single image

4,096

$0.717

qwen-vl-ocr-latest

Always has the same capabilities as the latest snapshot version.

Latest

qwen-vl-ocr-2025-04-13

Also known as qwen-vl-ocr-0413
Significantly improves text recognition capabilities. Adds six built-in OCR tasks and features such as custom prompts and image rotation correction.

Snapshot

qwen-vl-ocr-2024-10-28

Also known as qwen-vl-ocr-1028

Snapshot

For the qwen-vl-ocr, the max_tokens parameter (maximum output length) defaults to 4096. To increase this value to a range of 4097 to 8192, send an email to modelstudio@service.aliyun.com. Include the following information: your Alibaba Cloud account ID, image type (such as document image, e-commerce image, or contract), model name, estimated queries per second (QPS) and total daily requests, and the percentage of requests where the model output exceeds 4096 tokens.

Sample code for estimating image tokens (for budget reference only)

Tokens are the basic unit for model billing. The following code demonstrates the model's internal scaling logic and can be used for a rough cost estimate. For actual billing, refer to the API response. The total number of tokens consists of both input and output tokens.

  • Text tokens: For text content passed through the text field, the number of tokens is calculated according to standard large language model rules.

  • Image tokens:

    • The model pre-processes the image by scaling it to a specific size before processing.

    • Formula: Image tokens = (Resized width × Resized height) / (28 × 28) + 2.

    • You do not need to implement complex image scaling and token calculation on the client side. The most accurate token consumption is provided in the usage field returned in each API call.

    import math
    from PIL import Image
    
    
    def smart_resize(image_path, min_pixels, max_pixels):
        """
        Pre-processes the image.
    
        Parameters:
            image_path: The path to the image.
        """
        # Open the specified PNG image file.
        image = Image.open(image_path)
    
        # Get the original dimensions of the image.
        height = image.height
        width = image.width
        # Adjust the height to be an integer multiple of 28.
        h_bar = round(height / 28) * 28
        # Adjust the width to be an integer multiple of 28.
        w_bar = round(width / 28) * 28
    
        # Scale the image to keep the total number of pixels within the range [min_pixels, max_pixels].
        if h_bar * w_bar > max_pixels:
            beta = math.sqrt((height * width) / max_pixels)
            h_bar = math.floor(height / beta / 28) * 28
            w_bar = math.floor(width / beta / 28) * 28
        elif h_bar * w_bar < min_pixels:
            beta = math.sqrt(min_pixels / (height * width))
            h_bar = math.ceil(height * beta / 28) * 28
            w_bar = math.ceil(width * beta / 28) * 28
        return h_bar, w_bar
    
    
    # Replace xxx/test.png with the path to your local image.
    h_bar, w_bar = smart_resize("xxx/test.png", min_pixels=28 * 28 * 4, max_pixels=8192 * 28 * 28)
    print(f"Resized image dimensions: height={h_bar}, width={w_bar}")
    
    # Calculate the number of image tokens: total pixels / (28 * 28).
    token = int((h_bar * w_bar) / (28 * 28))
    
    # <|vision_bos|> and <|vision_eos|> are visual markers, each counted as 1 token.
    print(f"Total image tokens: {token + 2}")

Preparations

  • Create an API key and set it as an environment variable.

  • If you call the model using the OpenAI SDK or DashScope SDK, install the latest version of the SDK. The minimum version for the DashScope Python SDK is 1.22.2, and for the Java SDK is 2.21.8.

    • DashScope SDK

      • Pros: Supports all advanced features, such as image rotation correction and built-in OCR tasks. It offers more comprehensive functionality and a simpler call method.

      • Scenarios: Projects that require full functionality.

    • OpenAI compatible SDK

      • Pros: Convenient for users who are already using the OpenAI SDK or its ecosystem tools to migrate quickly.

      • Limits: Advanced features such as image rotation correction and built-in OCR tasks are not directly supported through parameters. You must manually simulate them by constructing complex prompts and parse the output results yourself.

      • Scenarios: Projects with an existing OpenAI integration that do not rely on DashScope's exclusive advanced features.

Getting started

The following example extracts key information from a train ticket image (URL) and returns it in JSON format. For more information, see the sections on how to pass a local file and image limitations.

OpenAI compatible

Python

from openai import OpenAI
import os

PROMPT_TICKET_EXTRACTION = """
Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?).
Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'"},
"""

try:
    client = OpenAI(
        # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
        # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
        base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    )
    completion = client.chat.completions.create(
        model="qwen-vl-ocr-latest",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                        # Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
                        "min_pixels": 28 * 28 * 4,
                        # Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
                        "max_pixels": 28 * 28 * 8192
                    },
                    # qwen-vl-ocr supports passing a prompt in the following text field. If not provided, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                    # If you call qwen-vl-ocr-1028, the model uses a fixed prompt: Read all the text in the image. Custom prompts in the text field are not supported.
                    {"type": "text",
                     "text": PROMPT_TICKET_EXTRACTION}
                ]
            }
        ])
    print(completion.choices[0].message.content)
except Exception as e:
    print(f"Error message: {e}")

Node.js

import OpenAI from 'openai';

// Define the prompt for extracting train ticket information.
const PROMPT_TICKET_EXTRACTION = `
Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?).
Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}
`;

const openai = new OpenAI({
  // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
  // If the environment variable is not set, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr',
    messages: [
      {
        role: 'user',
        content: [
          // qwen-vl-ocr supports passing a prompt in the following text field. If not provided, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
          { type: 'text', text: PROMPT_TICKET_EXTRACTION},
          {
            type: 'image_url',
            image_url: {
              url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
            },
              //  Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
              "min_pixels": 28 * 28 * 4,
              // Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
              "max_pixels": 28 * 28 * 8192
          }
        ]
      }
    ],
  });
  console.log(response.choices[0].message.content)
}

main();

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://bailian.console.alibabacloud.com/?tab=model#/api-key
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    "min_pixels": 3136,
                    "max_pixels": 6422528
                },
                {"type": "text", "text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"}
            ]
        }
    ]
}'

Response example

{
  "choices": [{
    "message": {
      "content": "```json\n{\n    \"Invoice Number\": \"24329116804000\",\n    \"Train Number\": \"G1948\",\n    \"Departure Station\": \"Nanjing South Station\",\n    \"Arrival Station\": \"Zhengzhou East Station\",\n    \"Departure Date and Time\": \"2024-11-14 11:46\",\n    \"Seat Number\": \"Car 04, Seat 12A\",\n    \"Seat Class\": \"Second Class\",\n    \"Ticket Price\": \"¥337.50\",\n    \"ID Card Number\": \"4107281991****5515\",\n    \"Passenger Name\": \"Du Xiaoguang\"\n}\n```",
      "role": "assistant"
    },
    "finish_reason": "stop",
    "index": 0,
    "logprobs": null
  }],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 606,
    "completion_tokens": 159,
    "total_tokens": 765
  },
  "created": 1742528311,
  "system_fingerprint": null,
  "model": "qwen-vl-ocr-latest",
  "id": "chatcmpl-20e5d9ed-e8a3-947d-bebb-c47ef1378598"
}

DashScope

Python

import os
import dashscope

PROMPT_TICKET_EXTRACTION = """
Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?).
Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'"},
"""

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                # Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
                "min_pixels": 28 * 28 * 4,
                # Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
                "max_pixels": 28 * 28 * 8192,
                # Enable automatic image rotation correction.
                "enable_rotate": False
                },
                 # When no built-in task is set for qwen-vl-ocr, you can pass a prompt in the following text field. If not provided, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                {"type": "text", "text": PROMPT_TICKET_EXTRACTION}]
        }]
try:
    response = dashscope.MultiModalConversation.call(
        # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
        # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
        api_key=os.getenv('DASHSCOPE_API_KEY'),
        model='qwen-vl-ocr',
        messages=messages
    )
    print(response["output"]["choices"][0]["message"].content[0]["text"])
except Exception as e:
    print(f"An error occurred: {e}")

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
        }
        
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
        // Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
        // Enable automatic image rotation correction.
        map.put("enable_rotate", false);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // When no built-in task is set for qwen-vl-ocr, you can pass a prompt in the following text field. If not provided, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                        Collections.singletonMap("text", "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
                // If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr",
"input": {
  "messages": [
    {
      "role": "user",
      "content": [{
          "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
          "min_pixels": 3136,
          "max_pixels": 6422528,
          "enable_rotate": false
        },
        {
          "text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information without omissions or fabrications. Replace any single character that is blurry or obscured by glare with an English question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx','Ticket Price':'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'"
        }
      ]
    }
  ]
}
}'

Response example

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "```json\n{\n    \"Invoice Number\": \"24329116804000\",\n    \"Train Number\": \"G1948\",\n    \"Departure Station\": \"Nanjing South Station\",\n    \"Arrival Station\": \"Zhengzhou East Station\",\n    \"Departure Date and Time\": \"2024-11-14 11:46\",\n    \"Seat Number\": \"Car 04, Seat 12A\",\n    \"Seat Class\": \"Second Class\",\n    \"Ticket Price\": \"¥337.50\",\n    \"ID Card Number\": \"4107281991****5515\",\n    \"Passenger Name\": \"Du Xiaoguang\"\n}\n```"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 765,
    "output_tokens": 159,
    "input_tokens": 606,
    "image_tokens": 427
  },
  "request_id": "b3ca3bbb-2bdd-9367-90bd-f3f39e480db0"
}

Use built-in tasks

To simplify calls in specific scenarios, the qwen-vl-ocr models include several built-in tasks.

How to use:

  • DashScope SDK: You do not need to design or pass in a Prompt. You can set the ocr_options parameter to call a built-in task.

  • OpenAI compatible SDK: You need to manually enter the Prompt specified for the task.

The following table lists the task value, prompt, output format, and example for each built-in task:

High-precision recognition

We recommend that you first use the qwen-vl-ocr-2025-08-28 model or the latest model, because this version includes a comprehensive upgrade in text localization capabilities. The high-precision recognition task has the following features:

  • Recognize text content (extract text)

  • Detect text position (locate text lines and output coordinates)

After you obtain the coordinates of the text bounding box, see the FAQ for instructions on how to draw the bounding box on the original image.

Value of task

Specified prompt

Output format and example

advanced_recognition

Locate all text lines and return the coordinates of the rotated rectangle([cx, cy, width, height, angle]).

  • Format: Plain text in JSON format, or obtain the JSON object directly from the ocr_result field.

  • Example:

    image

    • text: The text content of the line.

    • location:

      • Example value: [x1, y1, x2, y2, x3, y3, x4, y4]

      • Meaning: The absolute coordinates of the four vertices of the text box, with the origin (0,0) at the top-left corner of the original image. The vertex order is fixed: top-left → top-right → bottom-right → bottom-left.

    • rotate_rect:

      • Example value: [center_x, center_y, width, height, angle]

      • Meaning: An alternative representation of a text box, where center_x and center_y are the coordinates of the text box centroid, width is the width, hight is the height, and angle is the rotation angle of the text box relative to the horizontal direction, with a value range of [-90, 90].

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
                # Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
                "min_pixels": 28 * 28 * 4,
                # Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
                "max_pixels": 28 * 28 * 8192,
                # Enable automatic image rotation correction.
                "enable_rotate": False}]
            }]
            
response = dashscope.MultiModalConversation.call(
    # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-latest',
    messages=messages,
    # Set the built-in task to high-precision recognition.
    ocr_options={"task": "advanced_recognition"}
)
# The multi-language recognition task returns the result as plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])
// dashscope SDK version >= 2.21.8
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
        // Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
        // Enable automatic image rotation correction.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.ADVANCED_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-08-28")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the base URL for the Singapore region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-08-28",
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {
            "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
            "min_pixels": 401408,
            "max_pixels": 6422528,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "advanced_recognition"
    }
  }
}
'

Response example

{
  "output":{
    "choices":[
      {
        "finish_reason":"stop",
        "message":{
          "role":"assistant",
          "content":[
            {
              "text":"```json\n[{\"pos_list\": [{\"rotate_rect\": [740, 374, 599, 1459, 90]}]}```",
              "ocr_result":{
                "words_info":[
                  {
                    "rotate_rect":[150,80,49,197,-89],
                    "location":[52,54,250,57,249,106,52,103],
                    "text":"Target Audience"
                  },
                  {
                    "rotate_rect":[724,171,34,1346,-89],
                    "location":[51,146,1397,159,1397,194,51,181],
                    "text":"If you are a system administrator in a Linux environment, learning to write shell scripts will be very beneficial. This book does not detail every step of installing"
                  },
                  {
                    "rotate_rect":[745,216,34,1390,-89],
                    "location":[50,195,1440,202,1440,237,50,230],
                    "text":"a Linux system, but as long as the system has Linux installed and running, you can start thinking about how to automate some daily"
                  },
                  {
                    "rotate_rect":[748,263,34,1394,-89],
                    "location":[52,240,1446,249,1446,283,51,275],
                    "text":"system administration tasks. This is where shell scripting comes in, and this is exactly what this book is for. This book will"
                  },
                  {
                    "rotate_rect":[749,308,34,1395,-89],
                    "location":[51,285,1446,296,1446,331,51,319],
                    "text":"demonstrate how to use shell scripts to automate system administration tasks, from monitoring system statistics and data files to for your boss"
                  },
                  {
                    "rotate_rect":[123,354,33,146,-89],
                    "location":[50,337,197,338,197,372,50,370],
                    "text":"generating reports."
                  },
                  {
                    "rotate_rect":[751,432,34,1402,-89],
                    "location":[51,407,1453,420,1453,454,51,441],
                    "text":"If you are a home Linux enthusiast, you can also benefit from this book. Nowadays, users can easily get lost in a graphical environment made up of many components."
                  },
                  {
                    "rotate_rect":[755,477,31,1404,-89],
                    "location":[54,458,1458,463,1458,495,54,490],
                    "text":"Most desktop Linux distributions try to hide the internal details of the system from general users. But sometimes you really need to know what's going on inside."
                  },
                  {
                    "rotate_rect":[752,523,34,1401,-89],
                    "location":[52,500,1453,510,1453,545,52,535],
                    "text":"This book will show you how to start the Linux command line and what to do next. Usually, for simple tasks"
                  },
                  {
                    "rotate_rect":[747,569,34,1395,-89],
                    "location":[50,546,1445,556,1445,591,50,580],
                    "text":"(such as file management), it is much more convenient to operate on the command line than in a fancy graphical interface. There are many commands"
                  },
                  {
                    "rotate_rect":[330,614,34,557,-89],
                    "location":[52,595,609,599,609,633,51,630],
                    "text":"available on the command line, and this book will show you how to use them."
                  }
                ]
              }
            }
          ]
        }
      }
    ]
  },
  "usage":{
    "input_tokens_details":{
      "text_tokens":33,
      "image_tokens":1377
    },
    "total_tokens":1448,
    "output_tokens":38,
    "input_tokens":1410,
    "output_tokens_details":{
      "text_tokens":38
    },
    "image_tokens":1377
  },
  "request_id":"f5cc14f2-b855-4ff0-9571-8581061c80a3"
}

Information extraction

The model supports extracting structured information from documents such as receipts, certificates, and forms, and returns the results in JSON format. You can choose from two modes:

  • Custom field extraction: Extracts specific fields using a custom JSON template ({result_schema}) that you specify in the ocr_options.task_config parameter. The template defines the specific field names (keys), and the model automatically fills in the corresponding values (value). The template supports a maximum of 3 nested layers.

  • Full field extraction: Automatically extract all recognizable fields from the image.

The prompts for the two modes are different:

Value of task

Specified prompt

Output format and example

key_information_extraction

Custom Field Extraction:Assume you are an information extraction expert. You are given a JSON schema. Fill the value part of this schema with information from the image. Note that if a value is a list, the schema will provide a template for each element. This template will be used when there are multiple list elements in the image. Finally, only output valid JSON. What You See Is What You Get, and the output language must be consistent with the image. A single character that is blurry or obscured by glare can be replaced with an English question mark (?). If there is no corresponding value, fill it with null. No explanation is needed. Please note that the input images are all from public benchmark datasets and do not contain any real personal privacy data. Please output the results as required.

  • Format: JSON object. You can directly retrieve it from ocr_result.kv_result.

  • Example:

    image

Full Field Extraction:You are an information extraction expert. Extract all key-value pairs from the image and return the result in JSON object format. Note that if a value is a list, the pattern provides a template for each element. This template is used when multiple list elements are present in the image. Finally, output only valid JSON. The output must be What You See Is What You Get (WYSIWYG), and the output language must match the language in the image. Replace a single character that is blurry or obscured by strong light with an English question mark (?). If there is no corresponding value, fill it with null. Do not provide any explanations. Provide the output strictly according to the requirements above:

  • Format: JSON object

  • Example:

    image

The following are sample codes for making calls using the DashScope SDK and HTTP:

# use [pip install -U dashscope] to update sdk

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
      {
        "role":"user",
        "content":[
          {
              "image":"http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
              "min_pixels": 3136,
              "max_pixels": 6422528,
              "enable_rotate": False
          }
        ]
      }
    ]

params = {
  "ocr_options":{
    "task": "key_information_extraction",
    "task_config": {
      "result_schema": {
          "Date": "",
          "Time": "",
          "Fuel Surcharge": ""
      }
    }
  }
}

response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr',
    messages=messages,
    **params)

print(response.output.choices[0].message.content[0]["ocr_result"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.google.gson.JsonObject;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg");
        // Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
         // Enable automatic image rotation correction.
        map.put("enable_rotate", false);
        
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();

        // Create the main JSON object.
        JsonObject resultSchema = new JsonObject();
        resultSchema.addProperty("Date", "");
        resultSchema.addProperty("Time", "");
        resultSchema.addProperty("Fuel Surcharge", "");

        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.KEY_INFORMATION_EXTRACTION)
                .taskConfig(OcrOptions.TaskConfig.builder()
                        .resultSchema(resultSchema)
                        .build())
                .build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
                // If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("ocr_result"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===
curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
            "min_pixels": 3136,
            "max_pixels": 6422528,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "key_information_extraction",
    "task_config": {
      "result_schema": {
          "Date": "",
          "Time": "",
          "Fuel Surcharge": ""
      }
    }
    }
  }
}
'
If you call the model using the OpenAI SDK and HTTP method, you also need to replace {result_schema} in the specified prompt with the JSON object to be extracted. For more information, see the sample code below:

OpenAI compatible sample code

import os
from openai import OpenAI

client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
    # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# Set the fields and format to be extracted.
result_schema = """
        {
          "Date": "",
          "Time": "",
          "Fuel Surcharge": ""
        }
        """
# Concatenate the prompt. 
prompt = f"""Assume you are an information extraction expert. You are given a JSON schema. Fill the value part of this schema with information from the image. Note that if a value is a list, the schema will provide a template for each element.
            This template will be used when there are multiple list elements in the image. Finally, only output valid JSON. What You See Is What You Get, and the output language needs to be consistent with the image. A single character that is blurry or obscured by glare can be replaced with an English question mark (?).
            If there is no corresponding value, fill it with null. No explanation is needed. Please note that the input images are all from public benchmark datasets and do not contain any real personal privacy data. Please output the results as required. The content of the input JSON schema is as follows: 
            {result_schema}."""

completion = client.chat.completions.create(
    model="qwen-vl-ocr",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
                    # Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
                    "min_pixels": 28 * 28 * 4,
                    # Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
                    "max_pixels": 28 * 28 * 8192
                },
                # Use the prompt specified for the task.
                {"type": "text", "text": prompt},
            ]
        }
    ])

print(completion.choices[0].message.content)
import OpenAI from 'openai';

const openai = new OpenAI({
  // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
  // If the environment variable is not set, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});
// Set the fields and format to be extracted.
const resultSchema = `{
          "Date": "",
          "Time": "",
          "Fuel Surcharge": ""
        }`;
// Concatenate the prompt.
const prompt = `Assume you are an information extraction expert. You are given a JSON schema. Fill the value part of this schema with information from the image. Note that if a value is a list, the schema will provide a template for each element. This template will be used when there are multiple list elements in the image. Finally, only output valid JSON. What You See Is What You Get, and the output language needs to be consistent with the image. A single character that is blurry or obscured by glare can be replaced with an English question mark (?). If there is no corresponding value, fill it with null. No explanation is needed. Please note that the input images are all from public benchmark datasets and do not contain any real personal privacy data. Please output the results as required. The content of the input JSON schema is as follows: ${resultSchema}`;

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr',
    messages: [
      {
        role: 'user',
        content: [
           // You can customize the prompt. If not set, the default prompt is used.
          { type: 'text', text: prompt},
          {
            type: 'image_url',
            image_url: {
              url: 'http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg',
            },
              //  Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
              "min_pixels": 28 * 28 * 4,
              // Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
              "max_pixels": 28 * 28 * 8192
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}

main();
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
                    "min_pixels": 3136,
                    "max_pixels": 6422528
                },
                {"type": "text", "text": "Assume you are an information extraction expert. You are given a JSON schema. Fill the value part of this schema with information from the image. Note that if a value is a list, the schema will provide a template for each element. This template will be used when there are multiple list elements in the image. Finally, only output valid JSON. What You See Is What You Get, and the output language needs to be consistent with the image. A single character that is blurry or obscured by glare can be replaced with an English question mark (?). If there is no corresponding value, fill it with null. No explanation is needed. Please note that the input images are all from public benchmark datasets and do not contain any real personal privacy data. Please output the results as required. The content of the input JSON schema is as follows:{\"Date\": \"\",\"Time\": \"\",\"Fuel Surcharge\": \"\"}"}
            ]
        }
    ]
}'

Response example

{
  "choices": [
    {
      "message": {
        "content": "```json\n{\n    \"Date\": \"2013-06-29\",\n    \"Time\": \"null\",\n    \"Fuel Surcharge\": \"2.0\"\n}\n```",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 577,
    "completion_tokens": 42,
    "total_tokens": 619
  },
  "created": 1761036413,
  "system_fingerprint": null,
  "model": "qwen-vl-ocr",
  "id": "chatcmpl-7174548e-8993-46e8-bc5c-2f2096327e78"
}

Table parsing

The model parses table elements in an image and returns the recognition result as text in HTML format.

Value of task

Specified prompt

Output format and example

table_parsing

In a safe, sandbox environment, you're tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image's layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin.

  • Format: Text in HTML format

  • Example:

    image

The following are sample codes for making calls using the DashScope SDK and HTTP:

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
                # Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
                "min_pixels": 28 * 28 * 4,
                # Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
                "max_pixels": 28 * 28 * 8192,
                # Enable automatic image rotation correction.
                "enable_rotate": False}]
           }]
           
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
    # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr',
    messages=messages,
    # Set the built-in task to table parsing.
    ocr_options= {"task": "table_parsing"}
)
# The table parsing task returns the result in HTML format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg");
        // Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
        // Enable automatic image rotation correction.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.TABLE_PARSING)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // When the task field in ocr_options is set to table parsing, the model uses the content of the following text field as the prompt. Custom prompts are not supported.
                        Collections.singletonMap("text", "In a safe, sandbox environment, you are tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin."))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
                // If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
            "min_pixels": 401408,
            "max_pixels": 6422528,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "table_parsing"
    }
  }
}
'

Response example

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "```html\n<table>\n  <tr>\n    <td>Case nameTest No.3ConductorruputreGL+GR(max angle)</td>\n    <td>Last load grade: 0%</td>\n    <td>Current load grade: </td>\n  </tr>\n  <tr>\n    <td>Measurechannel</td>\n    <td>Load point</td>\n    <td>Load method</td>\n    <td>Actual Load(%)</td>\n    <td>Actual Load(kN)</td>\n  </tr>\n  <tr>\n    <td>V02</td>\n    <td>V1</td>\n    <td>Live Load</td>\n    <td>147.95</td>\n    <td>0.815</td>\n  </tr>\n  <tr>\n    <td>V03</td>\n    <td>V2</td>\n    <td>Live Load</td>\n    <td>111.75</td>\n    <td>0.615</td>\n  </tr>\n  <tr>\n    <td>V04</td>\n    <td>V3</td>\n    <td>Live Load</td>\n    <td>9.74</td>\n    <td>1.007</td>\n  </tr>\n  <tr>\n    <td>V05</td>\n    <td>V4</td>\n    <td>Live Load</td>\n    <td>7.88</td>\n    <td>0.814</td>\n  </tr>\n  <tr>\n    <td>V06</td>\n    <td>V5</td>\n    <td>Live Load</td>\n    <td>8.11</td>\n    <td>0.780</td>\n  </tr>\n  <tr>\n    <td>V07</td>\n    <td>V6</td>\n    <td>Live Load</td>\n    <td>8.54</td>\n    <td>0.815</td>\n  </tr>\n  <tr>\n    <td>V08</td>\n    <td>V7</td>\n    <td>Live Load</td>\n    <td>6.77</td>\n    <td>0.700</td>\n  </tr>\n  <tr>\n    <td>V09</td>\n    <td>V8</td>\n    <td>Live Load</td>\n    <td>8.59</td>\n    <td>0.888</td>\n  </tr>\n  <tr>\n    <td>L01</td>\n    <td>L1</td>\n    <td>Live Load</td>\n    <td>13.33</td>\n    <td>3.089</td>\n  </tr>\n  <tr>\n    <td>L02</td>\n    <td>L2</td>\n    <td>Live Load</td>\n    <td>9.69</td>\n    <td>2.247</td>\n  </tr>\n  <tr>\n    <td>L03</td>\n    <td>L3</td>\n    <td></td>\n    <td>2.96</td>\n    <td>1.480</td>\n  </tr>\n  <tr>\n    <td>L04</td>\n    <td>L4</td>\n    <td></td>\n    <td>3.40</td>\n    <td>1.700</td>\n  </tr>\n  <tr>\n    <td>L05</td>\n    <td>L5</td>\n    <td></td>\n    <td>2.45</td>\n    <td>1.224</td>\n  </tr>\n  <tr>\n    <td>L06</td>\n    <td>L6</td>\n    <td></td>\n    <td>2.01</td>\n    <td>1.006</td>\n  </tr>\n  <tr>\n    <td>L07</td>\n    <td>L7</td>\n    <td></td>\n    <td>2.38</td>\n    <td>1.192</td>\n  </tr>\n  <tr>\n    <td>L08</td>\n    <td>L8</td>\n    <td></td>\n    <td>2.10</td>\n    <td>1.050</td>\n  </tr>\n  <tr>\n    <td>T01</td>\n    <td>T1</td>\n    <td>Live Load</td>\n    <td>25.29</td>\n    <td>3.073</td>\n  </tr>\n  <tr>\n    <td>T02</td>\n    <td>T2</td>\n    <td>Live Load</td>\n    <td>27.39</td>\n    <td>3.327</td>\n  </tr>\n  <tr>\n    <td>T03</td>\n    <td>T3</td>\n    <td>Live Load</td>\n    <td>8.03</td>\n    <td>2.543</td>\n  </tr>\n  <tr>\n    <td>T04</td>\n    <td>T4</td>\n    <td>Live Load</td>\n    <td>11.19</td>\n    <td>3.542</td>\n  </tr>\n  <tr>\n    <td>T05</td>\n    <td>T5</td>\n    <td>Live Load</td>\n    <td>11.34</td>\n    <td>3.592</td>\n  </tr>\n  <tr>\n    <td>T06</td>\n    <td>T6</td>\n    <td>Live Load</td>\n    <td>16.47</td>\n    <td>5.217</td>\n  </tr>\n  <tr>\n    <td>T07</td>\n    <td>T7</td>\n    <td>Live Load</td>\n    <td>11.05</td>\n    <td>3.498</td>\n  </tr>\n  <tr>\n    <td>T08</td>\n    <td>T8</td>\n    <td>Live Load</td>\n    <td>8.66</td>\n    <td>2.743</td>\n  </tr>\n  <tr>\n    <td>T09</td>\n    <td>WT1</td>\n    <td>Live Load</td>\n    <td>36.56</td>\n    <td>2.365</td>\n  </tr>\n  <tr>\n    <td>T10</td>\n    <td>WT2</td>\n    <td>Live Load</td>\n    <td>24.55</td>\n    <td>2.853</td>\n  </tr>\n  <tr>\n    <td>T11</td>\n    <td>WT3</td>\n    <td>Live Load</td>\n    <td>38.06</td>\n    <td>4.784</td>\n  </tr>\n  <tr>\n    <td>T12</td>\n    <td>WT4</td>\n    <td>Live Load</td>\n    <td>37.70</td>\n    <td>5.030</td>\n  </tr>\n  <tr>\n    <td>T13</td>\n    <td>WT5</td>\n    <td>Live Load</td>\n    <td>30.48</td>\n    <td>4.524</td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </```"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 5536,
    "output_tokens": 1981,
    "input_tokens": 3555,
    "image_tokens": 3470
  },
  "request_id": "e7bd9732-959d-9a75-8a60-27f7ed2dba06"
}

Document parsing

The model supports parsing scanned documents or PDF documents stored as images. It can identify elements such as titles, summaries, and labels, and returns the recognition result as text in LaTeX format.

Value of task

Specified prompt

Output format and example

document_parsing

<code data-tag="code" id="24a28277f0ajd">In a secure sandbox, transcribe the image's text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin.

  • Format: Text in LaTeX format

  • Example: image

The following are sample codes for making calls using the DashScope SDK and HTTP:

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
                # Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
                "min_pixels": 28 * 28 * 4,
                # Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
                "max_pixels": 28 * 28 * 8192,
                # Enable automatic image rotation correction.
                "enable_rotate": False}]
            }]
            
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
    # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr',
    messages=messages,
    # Set the built-in task to document parsing.
    ocr_options= {"task": "document_parsing"}
)
# The document parsing task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg");
        // Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
        // Enable automatic image rotation correction.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.DOCUMENT_PARSING)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
                // If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===


curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr",
"input": {
  "messages": [{
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": [{
          "type": "image",
          "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
          "min_pixels": 401408,
          "max_pixels": 6422528,
          "enable_rotate": false
        }
      ]
    }
  ]
},
"parameters": {
  "ocr_options": {
    "task": "document_parsing"
  }
}
}
'

Response example

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "```latex\n\\documentclass{article}\n\n\\title{Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution}\n\\author{Peng Wang* Shuai Bai* Sinan Tan* Shijie Wang* Zhihao Fan* Jinze Bai$^\\dagger$\\\\ Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge Yang Fan Kai Dang Mengfei Du Xuancheng Ren Rui Men Dayiheng Liu Chang Zhou Jingren Zhou Junyang Lin$^\\dagger$\\\\ Qwen Team Alibaba Group}\n\\date{}\n\n\\begin{document}\n\n\\maketitle\n\n\\section{Abstract}\n\nWe present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL.\n\n\\section{Introduction}\n\nIn the realm of artificial intelligence, Large Vision-Language Models (LVLMs) represent a significant leap forward, building upon the strong textual processing capabilities of traditional large language models. These advanced models now encompass the ability to interpret and analyze a broader spectrum of data, including images, audio, and video. This expansion of capabilities has transformed LVLMs into indispensable tools for tackling a variety of real-world challenges. Recognized for their unique capacity to condense extensive and intricate knowledge into functional representations, LVLMs are paving the way for more comprehensive cognitive systems. By integrating diverse data forms, LVLMs aim to more closely mimic the nuanced ways in which humans perceive and interact with their environment. This allows these models to provide a more accurate representation of how we engage with and perceive our environment.\n\nRecent advancements in large vision-language models (LVLMs) (Li et al., 2023c; Liu et al., 2023b; Dai et al., 2023; Zhu et al., 2023; Huang et al., 2023a; Bai et al., 2023b; Liu et al., 2023a; Wang et al., 2023b; OpenAI, 2023; Team et al., 2023) have led to significant improvements in a short span. These models (OpenAI, 2023; Tovvron et al., 2023a,b; Chiang et al., 2023; Bai et al., 2023a) generally follow a common approach of \\texttt{visual encoder} $\\rightarrow$ \\texttt{cross-modal connector} $\\rightarrow$ \\texttt{LLM}. This setup, combined with next-token prediction as the primary training method and the availability of high-quality datasets (Liu et al., 2023a; Zhang et al., 2023; Chen et al., 2023b;\n\n```"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "total_tokens": 4261,
        "output_tokens": 845,
        "input_tokens": 3416,
        "image_tokens": 3350
    },
    "request_id": "7498b999-939e-9cf6-9dd3-9a7d2c6355e4"
}

Formula recognition

The model supports parsing formulas in images and returns the recognition result as text in LaTeX format.

Value of task

Specified prompt

Output format and example

formula_recognition

Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions.

  • Format: Text in LaTeX format

  • Example: image

The following are sample codes for making calls using the DashScope SDK and HTTP:

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
                # Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
                "min_pixels": 28 * 28 * 4,
                # Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
                "max_pixels": 28 * 28 * 8192,
                # Enable automatic image rotation correction.
                "enable_rotate": False}]
            }]
            
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
    # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr',
    messages=messages,
    # Set the built-in task to formula recognition.
    ocr_options= {"task": "formula_recognition"}
)
# The formula recognition task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg");
        // Maximum pixel threshold for the input image. If the image is larger, it will be scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // Minimum pixel threshold for the input image. If the image is smaller, it will be scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
        // Enable automatic image rotation correction.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.FORMULA_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
                // If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr",
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
            "min_pixels": 401408,
            "max_pixels": 6422528,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "formula_recognition"
    }
  }
}
'

Response example

{
  "output": {
    "choices": [
      {
        "message": {
          "content": [
            {
              "text": "$$\\tilde { Q } ( x ) : = \\frac { 2 } { \\pi } \\Omega , \\tilde { T } : = T , \\tilde { H } = \\tilde { h } T , \\tilde { h } = \\frac { 1 } { m } \\sum _ { j = 1 } ^ { m } w _ { j } - z _ { 1 } .$$"
            }
          ],
          "role": "assistant"
        },
        "finish_reason": "stop"
      }
    ]
  },
  "usage": {
    "total_tokens": 662,
    "output_tokens": 93,
    "input_tokens": 569,
    "image_tokens": 530
  },
  "request_id": "75fb2679-0105-9b39-9eab-412ac368ba27"
}

General text recognition

General text recognition is mainly used for Chinese and English scenarios and returns recognition results in plain text format.

Value of task

Specified prompt

Output format and example

text_recognition

Please output only the text content from the image without any additional descriptions or formatting.

  • Format: plain text

  • Example: "Target Audience\n\nIf you are..."

The following are code examples for making calls using the DashScope SDK and HTTP:

import os
import dashscope
# The following URL is for the Singapore region. To use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
                # The minimum pixel threshold for the input image. If an image's total pixel count is below this value, the image is scaled up proportionally until it exceeds min_pixels.
                "min_pixels": 28 * 28 * 4,
                # The maximum pixel threshold for the input image. If an image's total pixel count exceeds this value, the image is scaled down proportionally until it is below max_pixels.
                "max_pixels": 28 * 28 * 8192,
                # Enable the automatic image rotation feature.
                "enable_rotate": False}]
        }]
        
response = dashscope.MultiModalConversation.call(
    # API keys differ between the Singapore and Beijing regions. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr',
    messages=messages,
    # Set the built-in task to text recognition.
    ocr_options= {"task": "text_recognition"} 
)
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following URL is for the Singapore region. To use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
        // The maximum pixel threshold for the input image. If an image's total pixel count exceeds this value, the image is scaled down proportionally until it is below max_pixels.
        map.put("max_pixels", "6422528");
        // The minimum pixel threshold for the input image. If an image's total pixel count is below this value, the image is scaled up proportionally until it exceeds min_pixels.
        map.put("min_pixels", "3136");
        // Enable the automatic image rotation feature.
        map.put("enable_rotate", false);
        
        // Configure the built-in task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.TEXT_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys differ between the Singapore and Beijing regions. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If the environment variable is not configured, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important Notes =======
# API keys differ between the Singapore and Beijing regions. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following URL is for the Singapore region. To use a model in the Beijing region, replace the URL with https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete these comments before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr",
"input": {
  "messages": [
    {
      "role": "user",
      "content": [{
          "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
          "min_pixels": 3136,
          "max_pixels": 6422528,
          "enable_rotate": false
        }
      ]
    }
  ]
},
"parameters": {
  "ocr_options": {
      "task": "text_recognition"
    }
}
}'

Response example

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "Target Audience\nIf you are a system administrator for Linux, you will benefit greatly from learning to write shell scripts. This book does not detail the steps to install Linux. It assumes you have a running Linux system. You can then automate daily system administration tasks with shell scripts. This book shows you how. It demonstrates how to use shell scripts to automate tasks, from monitoring system statistics and data files to generating reports for your boss.\nIf you are a home Linux enthusiast, you can also benefit from this book. It is easy to get lost in today's complex graphical environments. Most desktop Linux distributions hide the system's internal workings from the average user. But sometimes you need to know what is happening under the hood. This book shows you how to open the Linux command line and what to do next. For simple tasks, such as file management, the command line is often more convenient than a graphical interface. The command line has many commands available, and this book shows you how to use them."
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 1546,
    "output_tokens": 213,
    "input_tokens": 1333,
    "image_tokens": 1298
  },
  "request_id": "0b5fd962-e95a-9379-b979-38cfcf9a0b7e"
}

Multilingual recognition

You can use multilingual recognition for scenarios that involve languages other than Chinese and English. Supported languages include Arabic, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish, and Vietnamese. The recognition results are returned in plain text format.

Value of task

Specified prompt

Output format and example

multi_lan

Please output only the text content from the image without any additional descriptions or formatting.

  • Format: Plain text

  • Example: "Привіт!, Hello!, Bonjour!"

The following code samples show how to make calls using the DashScope SDK and HTTP.

import os
import dashscope
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
                # The minimum pixel threshold for the input image. If the image has fewer pixels, it is scaled up proportionally until its total pixel count exceeds min_pixels.
                "min_pixels": 28 * 28 * 4,
                # The maximum pixel threshold for the input image. If the image has more pixels, it is scaled down proportionally until its total pixel count is below max_pixels.
                "max_pixels": 28 * 28 * 8192,
                # Enable the automatic image orientation correction feature.
                "enable_rotate": False}]
            }]
            
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr',
    messages=messages,
    # Set the built-in task to multilingual recognition.
    ocr_options={"task": "multi_lan"}
)
# The multilingual recognition task returns results in plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
        // The maximum pixel threshold for the input image. If the image has more pixels, it is scaled down proportionally until its total pixel count is below max_pixels.
        map.put("max_pixels", "6422528");
        // The minimum pixel threshold for the input image. If the image has fewer pixels, it is scaled up proportionally until its total pixel count exceeds min_pixels.
        map.put("min_pixels", "3136");
        // Enable the automatic image rotation feature.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.MULTI_LAN)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, visit https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
# The following URL is for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr",
  "input": {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
            "min_pixels": 401408,
            "max_pixels": 6422528,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "multi_lan"
    }
  }
}
'

Sample response

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "INTERNATIONAL\nMOTHER LANGUAGE\nDAY\nПривіт!\nHello!\nMerhaba!\nBonjour!\nCiao!\nHello!\nOla!\nSalam!\nבר מולדת!"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 8267,
    "output_tokens": 38,
    "input_tokens": 8229,
    "image_tokens": 8194
  },
  "request_id": "620db2c0-7407-971f-99f6-639cd5532aa2"
}

Streaming output

When a large model receives input, it generates the final result in parts. The process of sending these parts as they are created is called streaming output. You can use streaming output for requests that might take a long time to prevent timeouts.

OpenAI compatible

To enable streaming output, set the stream parameter to true in your code.

Python

import os
from openai import OpenAI

PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'class_type': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
"""

client = OpenAI(
    # API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-ocr",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up proportionally until its total pixels exceed min_pixels.
                    "min_pixels": 28 * 28 * 4,
                    # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down proportionally until its total pixels are below max_pixels.
                    "max_pixels": 28 * 28 * 8192
                },
                  # The qwen-vl-ocr model supports passing a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                {"type": "text","text": PROMPT_TICKET_EXTRACTION}
            ]
        }
    ],
    stream=True,
    stream_options={"include_usage": True}
)

full_content = ""
print("Streaming output content:")
for chunk in completion:
    # If stream_options.include_usage is True, the choices field of the last chunk is an empty list and must be skipped. You can get the token usage from chunk.usage.
    if chunk.choices and chunk.choices[0].delta.content != "":
        full_content += chunk.choices[0].delta.content
        print(chunk.choices[0].delta.content)
print(f"Full content: {full_content}")

Node.js

import OpenAI from 'openai';


// Define the prompt for extracting ticket information.
const PROMPT_TICKET_EXTRACTION = `
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'class_type': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
`;

const openai = new OpenAI({
  // API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
  // If you have not configured an environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr',
    messages: [
      {
        role: 'user',
        content: [
          // The qwen-vl-ocr model supports passing a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
          { type: 'text', text: PROMPT_TICKET_EXTRACTION},
          {
            type: 'image_url',
            image_url: {
              url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
            },
              // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up proportionally until its total pixels exceed min_pixels.
              "min_pixels": 28 * 28 * 4,
              // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down proportionally until its total pixels are below max_pixels.
              "max_pixels": 28 * 28 * 8192
          }
        ]
      }
    ],
    stream: true,
    stream_options:{"include_usage": true}
  });
let fullContent = ""
  console.log("Streaming output content:")
  for await (const chunk of response) {
    if (chunk.choices[0] && chunk.choices[0].delta.content != null) {
      fullContent += chunk.choices[0].delta.content;
      console.log(chunk.choices[0].delta.content);
    }
}
  console.log(`Full output content: ${fullContent}`)
}

main();

curl

# ======= Important =======
# API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
# The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===


curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    "min_pixels": 3136,
                    "max_pixels": 6422528
                },
                {"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'train_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\': \'xxx\', \'seat_number\': \'xxx\', \'class_type\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
            ]
        }
    ],
    "stream": true,
    "stream_options": {"include_usage": true}
}'

Response example

data: {"choices":[{"delta":{"content":"","role":"assistant"},"index":0,"logprobs":null,"finish_reason":null}],"object":"chat.completion.chunk","usage":null,"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}

data: {"choices":[{"finish_reason":null,"delta":{"content":"```json\n{\n    \"invoice_number"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}
......
data: {"choices":[{"delta":{"content":" \"Du Xiaoguang\""},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}

data: {"choices":[{"finish_reason":"stop","delta":{"content":"\n}\n```"},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}

data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":606,"completion_tokens":159,"total_tokens":765},"created":1742528579,"system_fingerprint":null,"model":"qwen-vl-ocr-latest","id":"chatcmpl-e2da5fdf-7658-9379-8a59-a3f547ee811f"}
data: [DONE]

DashScope

Enable streaming output by setting the corresponding parameters for different call methods:

  • Python SDK: Set the stream parameter to True.

  • Java SDK: Call the streamCall interface.

  • HTTP: In the header, set X-DashScope-SSE to enable.

By default, streaming output is non-incremental, which means each response includes all previously generated content. To use incremental streaming output, set the incremental_output parameter (incrementalOutput for Java) to true.

Python

import os
import dashscope

PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image.
Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?).
Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'class_type': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}
"""

# The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up proportionally until its total pixels exceed min_pixels.
                "min_pixels": 28 * 28 * 4,
                # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down proportionally until its total pixels are below max_pixels.
                "max_pixels": 28 * 28 * 8192},
            # When no built-in task is set for qwen-vl-ocr, you can pass a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
            {
                "type": "text",
                "text": PROMPT_TICKET_EXTRACTION,
            },
        ],
    }
]
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
    # If you have not configured an environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-ocr",
    messages=messages,
    stream=True,
    incremental_output=True,
)
full_content = ""
print("Streaming output content:")
for response in response:
    try:
        print(response["output"]["choices"][0]["message"].content[0]["text"])
        full_content += response["output"]["choices"][0]["message"].content[0]["text"]
    except:
        pass
print(f"Full content: {full_content}")

Java

import java.util.*;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // When no built-in task is set for qwen-vl-ocr, you can pass a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                        Collections.singletonMap("text", "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'class_type': 'xxx','ticket_price':'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
                // If you have not configured an environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .incrementalOutput(true)
                .build();
        Flowable<MultiModalConversationResult> result = conv.streamCall(param);
        result.blockingForEach(item -> {
            try {
                List<Map<String, Object>> contentList = item.getOutput().getChoices().get(0).getMessage().getContent();
                if (!contentList.isEmpty()){
                    System.out.println(contentList.get(0).get("text"));
                }//
            } catch (Exception e){
                System.exit(0);
            }
        });
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# API keys for the Singapore and China (Beijing) regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/latest/get-an-api-key
# The following is the URL for the Singapore region. If you use a model in the China (Beijing) region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
--data '{
    "model": "qwen-vl-ocr",
    "input":{
        "messages":[
          {
            "role": "user",
            "content": [
                {
                    "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                    "min_pixels": 3136,
                    "max_pixels": 6422528
                },
                {"type": "text", "text": "Please extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, class type, ticket price, ID card number, and passenger name from the train ticket image. Accurately extract the key information above. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {\'invoice_number\': \'xxx\', \'train_number\': \'xxx\', \'departure_station\': \'xxx\', \'arrival_station\': \'xxx\', \'departure_date_and_time\': \'xxx\', \'seat_number\': \'xxx\', \'class_type\': \'xxx\',\'ticket_price\':\'xxx\', \'id_card_number\': \'xxx\', \'passenger_name\': \'xxx\'}"}
            ]
          }
        ]
    },
    "parameters": {
        "incremental_output": true
    }
}'

Response example

id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"```json\n{\n    \"invoice_number\": \"24"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":618,"output_tokens":12,"input_tokens":606,"image_tokens":427},"request_id":"8e553c5c-c0db-9cb5-8900-1fc452cafea7"}

id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"3291"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":622,"output_tokens":16,"input_tokens":606,"image_tokens":427},"request_id":"8e553c5c-c0db-9cb5-8900-1fc452cafea7"}
......
id:33
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"Xiaoguang\"\n}"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"total_tokens":764,"output_tokens":158,"input_tokens":606,"image_tokens":427},"request_id":"8e553c5c-c0db-9cb5-8900-1fc452cafea7"}

id:34
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"\n```"}],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"total_tokens":765,"output_tokens":159,"input_tokens":606,"image_tokens":427},"request_id":"8e553c5c-c0db-9cb5-8900-1fc452cafea7"}

Upload local files (Base64 encoding or file path)

The model supports two methods for uploading local files:

  • Direct upload using a file path ( more stable transfer, recommended )

  • Upload using Base64 encoding

Upload using a file path

You can pass the local file path directly to the model. This method is supported only by the DashScope Python and Java SDKs. It is not supported for DashScope HTTP or OpenAI-compatible calls.

Use the following table to specify the file path based on your programming language and operating system.

Specify the file path (image example)

System

SDK

File path to pass

Example

Linux or macOS

Python SDK

file://{absolute_path_of_the_file}

file:///home/images/test.png

Java SDK

Windows

Python SDK

file://{absolute_path_of_the_file}

file://D:/images/test.png

Java SDK

file:///{absolute_path_of_the_file}

file:///D:/images/test.png

Upload using Base64 encoding

You can convert the file to a Base64-encoded string, then pass it to the model. This method is applicable for OpenAI, DashScope SDK, and HTTP calls.

Steps to pass a Base64-encoded string

  1. Encode the file: Convert the local image to a Base64-encoded string.

    Example code to convert an image to Base64 encoding

    # Encoding function: Converts a local file to a Base64-encoded string
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    
    # Replace xxx/eagle.png with the absolute path of your local image
    base64_image = encode_image("xxx/eagle.png")
  2. Build a Data URL: Use the following format: data:[MIME_type];base64,{base64_image}.

    1. Replace MIME_type with the actual media type. Ensure that it matches the MIME Type value in the Image limits table, such as image/jpeg or image/png.

    2. base64_image is the Base64 string generated in the previous step.

  3. Call the model: You can pass the Data URL in the image or image_url parameter.

Limits

  • Uploading using a file path is recommended for higher stability. You can also use Base64 encoding for files smaller than 1 MB.

  • When passing a file path directly, each image must be smaller than 10 MB.

  • When passing a file using Base64 encoding, the encoded image must be smaller than 10 MB because Base64 encoding increases the data size.

For more information about how to compress a file, see How do I compress an image to the required size?

Pass a file path

Passing a file path is supported only for calls made using the DashScope Python and Java SDKs. It is not supported for DashScope HTTP or OpenAI-compatible calls.

Python

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace xxx/test.jpg with the absolute path of your local image.
local_path = "xxx/test.jpg"
image_path = f"file://{local_path}"
messages = [
    {
        "role": "user",
        "content": [
            {
                "image": image_path,
                # The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
                "min_pixels": 28 * 28 * 4,
                # The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
                "max_pixels": 28 * 28 * 8192,
            },
            # If no built-in task is set for qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
            {
                "text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"
            },
        ],
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-ocr",
    messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", filePath);
        // The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // If no built-in task is set for qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                        Collections.singletonMap("text", "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .topP(0.001)
                .temperature(0.1f)
                .maxLength(8192)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/test.jpg with the absolute path of your local image.
            simpleMultiModalConversationCall("xxx/test.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Pass using Base64 encoding

OpenAI compatible

Python

from openai import OpenAI
import os
import base64

# Read a local file and encode it in Base64 format.
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxx/test.png with the absolute path of your local image.
base64_image = encode_image("xxx/test.png")

client = OpenAI(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-ocr",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # Note: When passing a Base64-encoded image, the image format (image/{format}) must match the Content Type in the list of supported images. "f" is a string formatting method.
                    # PNG image:  f"data:image/png;base64,{base64_image}"
                    # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP image: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                    # The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
                    "min_pixels": 28 * 28 * 4,
                    # The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
                    "max_pixels": 28 * 28 * 8192
                },
                 # If you use qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                {"type": "text", "text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"},

            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import {
  readFileSync
} from 'fs';


const openai = new OpenAI({
  // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});
// Read a local file and encode it in Base64 format.
const encodeImage = (imagePath) => {
  const imageFile = readFileSync(imagePath);
  return imageFile.toString('base64');
};
// Replace xxx/test.png with the absolute path of your local image.
const base64Image = encodeImage("xxx/test.jpg")
async function main() {
  const completion = await openai.chat.completions.create({
    model: "qwen-vl-ocr",
    messages: [{
      "role": "user",
      "content": [{
          "type": "image_url",
          "image_url": {
            // Note: When passing a Base64-encoded image, the image format (image/{format}) must match the Content Type in the list of supported images.
            // PNG image:  data:image/png;base64,${base64Image}
            // JPEG image: data:image/jpeg;base64,${base64Image}
            // WEBP image: data:image/webp;base64,${base64Image}
            "url": `data:image/jpeg;base64,${base64Image}`
          },
          // The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
          "min_pixels": 28 * 28 * 4,
          // The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
          "max_pixels": 28 * 28 * 8192
        },
        // If you use qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
        {
          "type": "text",
          "text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"
        }
      ]
    }]
  });
  console.log(completion.choices[0].message.content);
}

main();

curl

  • For a method to convert a file to a Base64-encoded string, see the example code.

  • For demonstration purposes, the Base64-encoded string in the code, "data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...", is truncated. In actual use, pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen-vl-ocr-latest",
  "messages": [
  {"role":"system",
  "content":[
    {"type": "text", "text": "You are a helpful assistant."}]},
  {
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."}},
      {"type": "text", "text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"}
    ]
  }]
}'

DashScope

Python

import os
import base64
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Base64 encoding format
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Replace xxx/test.jpg with the absolute path of your local image.
base64_image = encode_image("xxx/test.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {
                # Note: When passing a Base64-encoded image, the image format (image/{format}) must match the Content Type in the list of supported images. "f" is a string formatting method.
                # PNG image:  f"data:image/png;base64,{base64_image}"
                # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                # WEBP image: f"data:image/webp;base64,{base64_image}"
                "image":  f"data:image/jpeg;base64,{base64_image}",
                # The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
                "min_pixels": 28 * 28 * 4,
                # The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
                "max_pixels": 28 * 28 * 8192,
            },
            # If no built-in task is set for qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
            {
                "text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"
            },
        ],
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-ocr",
    messages=messages,
)

print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    // Base64 encoding format
    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }
    public static void simpleMultiModalConversationCall(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image = encodeImageToBase64(localPath); // Base64 encoding

        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "data:image/jpeg;base64," + base64Image);
        // The maximum pixel threshold for the input image. If the image is larger, it is scaled down proportionally until its total pixels are below max_pixels.
        map.put("max_pixels", "6422528");
        // The minimum pixel threshold for the input image. If the image is smaller, it is scaled up proportionally until its total pixels exceed min_pixels.
        map.put("min_pixels", "3136");
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // If no built-in task is set for qwen-vl-ocr, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                        Collections.singletonMap("text", "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr")
                .message(userMessage)
                .topP(0.001)
                .temperature(0.1f)
                .maxLength(8192)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/test.jpg with the absolute path of your local image.
            simpleMultiModalConversationCall("xxx/test.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • For a method to convert a file to a Base64-encoded string, see the example code.

  • For demonstration purposes, the Base64-encoded string in the code, "data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA...", is truncated. In actual use, pass the complete encoded string.

# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-ocr-latest",
    "input":{
        "messages":[
            {"role": "system",
	     "content": [
	       {"text": "You are a helpful assistant."}]},
            {
             "role": "user",
             "content": [
               {"image": "data:image/png;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA..."},
               {"text": "Please extract the following information from the train ticket image: invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID card number, and passenger name. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurred or obscured by strong light with an English question mark (?). Return the data in JSON format as follows: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'arrival_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_class': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"}
                ]
            }
        ]
    }
}'

Limits

Image limits

  • File size: The size of a single image file, or the encoded string if you use Base64 encoding, cannot exceed 10 MB. For more information, see local files.

  • Dimensions and aspect ratio: The width and height of the image must be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.

  • Total pixels: The model automatically scales the image, so there is no strict limit on the total number of pixels. By default, an image cannot have more than 15.68 million pixels. If an image exceeds this value, you can adjust the max_pixels parameter to a maximum of 23.52 million. This adjustment increases token consumption and processing time.

  • Supported formats:

    Image format

    Common extensions

    MIME Type

    BMP

    .bmp

    image/bmp

    JPEG

    .jpe, .jpeg, .jpg

    image/jpeg

    PNG

    .png

    image/png

    TIFF

    .tif, .tiff

    image/tiff

    WEBP

    .webp

    image/webp

    HEIC

    .heic

    image/heic

Model limits

  • No multi-turn conversation: The model does not support multi-turn conversations and only answers the most recent question.

  • Hallucination risk: The model may hallucinate if the text in an image is too small or the resolution is low. The accuracy of answers to questions not related to text extraction is not guaranteed.

  • Cannot process text files:

    • For files that contain image data, convert them into an image sequence before processing. For more information, see the recommendations in going live.

    • For files with plain text or structured data, use a model that can parse long text, such as Qwen-Long.

Billing and rate limiting

  • Billing: Qwen-OCR is a visual understanding model, and its total cost is calculated based on the number of input and output tokens: (Number of input tokens × Unit price for input) + (Number of output tokens × Unit price for output). Each 28×28 pixel block corresponds to one token, and an image costs a minimum of four tokens. You can view bills or add funds on the Expenses and Costs page in the Alibaba Cloud Management Console.

  • Rate limiting: For the rate limits of the Qwen-OCR model, see Rate limits.

  • Free quota (Singapore region only): The 90-day validity period starts on the date you activate Alibaba Cloud Model Studio or your model request is approved. Within this period, the Qwen-OCR model provides a free quota of 1 million tokens.

Going live

  • Processing multi-page documents, such as PDF files:

    1. Split: Use an image processing library, such as pdf2image for Python, to convert each page of the PDF file into a separate, high-quality image.

    2. Submit requests: Send API requests to process the images.

    3. Merge: On the client, merge the recognition results for each page in the correct order.

  • Image pre-processing:

    • Ensure that input images are clear, evenly lit, and not excessively compressed:

      • To prevent information loss, use a lossless format, such as PNG, to store and transfer images.

      • To improve image definition, use a noise reduction algorithm, such as mean or median filtering, to reduce image noise.

      • For images with uneven lighting, use an algorithm such as adaptive histogram equalization to adjust brightness and contrast.

    • For skewed images: You can use the enable_rotate: true parameter in the DashScope SDK to significantly improve recognition performance.

    • For images that are too small or too large: You can use the min_pixels and max_pixels parameters to control scaling behavior before processing.

      • min_pixels: Ensures that small images are enlarged to recognize details. The default value is sufficient for most scenarios.

      • max_pixels: Prevents oversized images from consuming excessive resources. The default value is sufficient for most scenarios. If some small text is not recognized clearly, you can increase the max_pixels value. Note that this increases token consumption.

  • Result verification: Recognition results from the model may contain errors. For critical business operations, you can implement a manual review step or add validation rules to verify the accuracy of the model's output. For example, you can use format checks for ID card numbers and bank card numbers.

  • Batch calls: For large-scale, non-real-time scenarios, you can use the Batch API to process batch tasks asynchronously at a lower cost.

FAQ

After the model outputs text localization results, how do I draw the detection boxes on the original image?

You can refer to the draw_bbox.py code to draw the detection boxes and their labels on the original image.

API reference

For more information about the request and response parameters of Qwen-OCR, see Qwen.

Error codes

If a call fails, see Error messages for troubleshooting.