全部产品
Search
文档中心

Alibaba Cloud Model Studio:Ekstraksi teks (Qwen-OCR)

更新时间:Nov 29, 2025

Qwen-OCR adalah model Pemahaman visual yang dirancang untuk ekstraksi teks. Model ini mengekstraksi teks dan mengurai data terstruktur dari berbagai citra, seperti dokumen hasil pindaian, tabel, dan struk. Model ini mendukung beberapa bahasa serta mampu menjalankan fungsi lanjutan—termasuk ekstraksi informasi, penguraian tabel, dan pengenalan rumus—dengan menggunakan instruksi tugas tertentu.

Anda dapat mencoba Qwen-OCR secara online di Playground (Singapura atau Beijing).

Contoh

Citra input

Hasil pengenalan

Kenali beberapa bahasa

image

INTERNATIONAL

MOTHER LANGUAGE

DAY

Привет!

你好!

Bonjour!

Merhaba!

Ciao!

Hello!

Ola!

בר מולד

Salam!

Kenali citra miring

image

Perkenalan Produk

Filamen serat impor dari Korea Selatan.

6941990612023

No. Item: 2023

Temukan posisi teks

img_1

pengenalan presisi tinggi mendukung lokalisasi teks.

Visualisasi lokalisasi

img_1_location

Untuk informasi selengkapnya, lihat FAQ tentang cara menggambar Kotak pembatas setiap baris teks ke citra asli.

Model dan harga

Internasional (Singapura)

Model

Versi

Jendela konteks

Input maks

Output maks

Biaya input

Biaya output

Kuota gratis

(Catatan)

(Token)

(Juta token)

qwen-vl-ocr

Stabil

34.096

30.000

Hingga 30.000 per citra

4096

$0,72

$0,72

1 juta token masing-masing

Berlaku selama 90 hari setelah aktivasi

qwen-vl-ocr-2025-11-20

Juga qwen-vl-ocr-1120
Berdasarkan Qwen3-VL. Secara signifikan meningkatkan penguraian dokumen dan lokalisasi teks.

Snapshot

38.192

8.192

$0,07

$0,16

China Daratan (Beijing)

Model

Versi

Jendela konteks

Input maks

Output maksimum

Biaya input

Biaya output

Kuota gratis

(Catatan)

(Token)

(Juta token)

qwen-vl-ocr

Saat ini memiliki kemampuan yang sama dengan qwen-vl-ocr-2025-08-28

Stabil

34.096

30.000

Hingga 30.000 per citra

4.096

$0,717

$0,717

Tidak ada kuota gratis

qwen-vl-ocr-latest

Selalu memiliki kemampuan yang sama dengan snapshot terbaru

Terbaru

38.192

8.192

$0,043

$0,072

qwen-vl-ocr-2025-11-20

Juga qwen-vl-ocr-1120
Berdasarkan Qwen3-VL. Secara signifikan meningkatkan penguraian dokumen dan lokalisasi teks.

Snapshot

qwen-vl-ocr-2025-08-28

Juga qwen-vl-ocr-0828

34.096

4.096

$0,717

$0,717

qwen-vl-ocr-2025-04-13

Juga qwen-vl-ocr-0413

qwen-vl-ocr-2024-10-28

Juga qwen-vl-ocr-1028
qwen-vl-ocr, qwen-vl-ocr-2025-04-13, dan qwen-vl-ocr-2025-08-28, parameter max_tokens (panjang output maksimum) secara default bernilai 4.096. Untuk menaikkan nilai ini ke rentang 4.097 hingga 8.192, kirim email ke modelstudio@service.aliyun.com dan sertakan informasi berikut: ID akun Alibaba Cloud Anda, jenis citra (seperti citra dokumen, citra e-commerce, atau kontrak), nama model, perkiraan QPS dan total permintaan harian, serta persentase permintaan di mana output model melebihi 4.096 token.

Kode contoh untuk memperkirakan token citra secara manual (hanya sebagai referensi anggaran)

Rumus: Token citra = (h_bar * w_bar) / token_pixels + 2.

  • h_bar * w_bar merepresentasikan dimensi citra yang diskalakan. Model melakukan Pra-pemrosesan citra dengan menskalakannya ke batas piksel tertentu, yang bergantung pada nilai parameter max_pixels.

  • token_pixels merepresentasikan nilai piksel per Token.

    • Untuk qwen-vl-ocr-2025-11-20 dan qwen-vl-ocr-latest, nilainya tetap pada 32*32 (yaitu 1.024).

    • Untuk model lain, nilainya tetap pada 28 * 28 (yaitu 784).

Kode berikut menunjukkan logika penskalaan citra perkiraan yang digunakan model. Gunakan kode ini untuk memperkirakan jumlah token untuk suatu citra. Jumlah tagihan aktual didasarkan pada respons API.

import math
from PIL import Image

def smart_resize(image_path, min_pixels, max_pixels):
    """
    Melakukan pra-pemrosesan citra.

    Parameter:
        image_path: Jalur menuju file citra.
    """
    # Membuka file citra PNG yang ditentukan.
    image = Image.open(image_path)

    # Mendapatkan dimensi asli citra.
    height = image.height
    width = image.width
    # Menyesuaikan tinggi agar menjadi kelipatan 28 atau 32.
    h_bar = round(height / 32) * 32
    # Menyesuaikan lebar agar menjadi kelipatan 28 atau 32.
    w_bar = round(width / 32) * 32

    # Menskalakan citra untuk menyesuaikan jumlah total piksel agar berada dalam rentang [min_pixels, max_pixels].
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / 32) * 32
        w_bar = math.floor(width / beta / 32) * 32
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / 32) * 32
        w_bar = math.ceil(width * beta / 32) * 32
    return h_bar, w_bar


# Ganti xxx/test.png dengan jalur menuju citra lokal Anda.
h_bar, w_bar = smart_resize("xxx/test.png", min_pixels=32 * 32 * 3, max_pixels=8192 * 32 * 32)
print(f"Dimensi citra yang diskalakan adalah: tinggi {h_bar}, lebar {w_bar}")

# Menghitung jumlah token citra: total piksel dibagi 32 * 32.
token = int((h_bar * w_bar) / (32 * 32))

# `<|vision_bos|>` dan `<|vision_eos|>` adalah penanda visual, masing-masing dihitung sebagai 1 token.
print(f"Jumlah total token citra adalah {token + 2}")

Persiapan

  • Buat Kunci API dan ekspor Kunci API sebagai Variabel lingkungan.

  • Sebelum memanggil model menggunakan SDK OpenAI atau SDK DashScope, instal versi terbaru SDK. Versi minimum yang diperlukan adalah 1.22.2 untuk SDK Python DashScope dan 2.21.8 untuk SDK Java.

    • SDK DashScope

      • Keunggulan: Mendukung semua fitur lanjutan, seperti rotasi citra otomatis dan tugas OCR bawaan. Menawarkan fungsionalitas komprehensif dan metode yang lebih sederhana untuk memanggil model.

      • Skenario: Ideal untuk proyek yang membutuhkan fungsionalitas lengkap.

    • SDK kompatibel OpenAI

      • Keunggulan: Nyaman bagi pengguna yang sudah menggunakan SDK OpenAI atau alat ekosistemnya, yang memungkinkan migrasi cepat.

      • Batasan: Fitur lanjutan (rotasi citra otomatis dan tugas OCR bawaan) tidak dapat dipanggil langsung menggunakan parameter. Anda harus mensimulasikan fitur-fitur ini secara manual dengan menyusun prompt kompleks dan mengurai hasil output sendiri.

      • Skenario: Ideal untuk proyek yang sudah memiliki integrasi OpenAI dan tidak bergantung pada fitur lanjutan yang eksklusif untuk DashScope.

Memulai

Contoh berikut mengekstraksi informasi utama dari citra tiket kereta (URL) dan mengembalikannya dalam format JSON. Untuk informasi selengkapnya, lihat cara meneruskan file lokal dan batasan citra.

Kompatibel OpenAI

Python

from openai import OpenAI
import os

PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image.
Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?).
Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}
"""

try:
    client = OpenAI(
        # The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
        # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
        base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    )
    completion = client.chat.completions.create(
        model="qwen-vl-ocr-2025-11-20",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
                        # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
                        "min_pixels": 32 * 32 * 3,
                        # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
                        "max_pixels": 32 * 32 * 8192
                    },
                    # The model supports passing a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.    
                    {"type": "text",
                     "text": PROMPT_TICKET_EXTRACTION}
                ]
            }
        ])
    print(completion.choices[0].message.content)
except Exception as e:
    print(f"Error message: {e}")

Node.js

import OpenAI from 'openai';

// Define the prompt to extract train ticket information.
const PROMPT_TICKET_EXTRACTION = `
Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image.
Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?).
Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}
`;

const openai = new OpenAI({
  // The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr-2025-11-20',
    messages: [
      {
        role: 'user',
        content: [
          // The model supports passing a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
          { type: 'text', text: PROMPT_TICKET_EXTRACTION},
          {
            type: 'image_url',
            image_url: {
              url: 'https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg',
            },
              //  The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
              min_pixels: 32 * 32 * 3,
             // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
             max_pixels: 32 * 32 * 8192
          }
        ]
      }
    ],
  });
  console.log(response.choices[0].message.content)
}

main();

curl

# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before running ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr-2025-11-20",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url":"https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg"},
                    "min_pixels": 3072,
                    "max_pixels": 8388608
                },
                {"type": "text", "text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"}
            ]
        }
    ]
}'

Contoh respons

{
  "choices": [{
    "message": {
      "content": "```json\n{\n    \"Invoice Number\": \"24329116804000\",\n    \"Train Number\": \"G1948\",\n    \"Departure Station\": \"Nanjing South Station\",\n    \"Destination Station\": \"Zhengzhou East Station\",\n    \"Departure Date and Time\": \"2024-11-14 11:46\",\n    \"Seat Number\": \"Car 04, Seat 12A\",\n    \"Seat Type\": \"Second Class\",\n    \"Ticket Price\": \"¥337.50\",\n    \"ID Card Number\": \"4107281991****5515\",\n    \"Passenger Name\": \"Du Xiaoguang\"\n}\n```",
      "role": "assistant"
    },
    "finish_reason": "stop",
    "index": 0,
    "logprobs": null
  }],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 606,
    "completion_tokens": 159,
    "total_tokens": 765
  },
  "created": 1742528311,
  "system_fingerprint": null,
  "model": "qwen-vl-ocr-latest",
  "id": "chatcmpl-20e5d9ed-e8a3-947d-bebb-c47ef1378598"
}

DashScope

Python

import os
import dashscope

PROMPT_TICKET_EXTRACTION = """
Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image.
Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?).
Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}
"""

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
                "min_pixels": 32 * 32 * 3,
                 # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
                "max_pixels": 32 * 32 * 8192,
                # Enables automatic image rotation.
                "enable_rotate": False
                },
                 # When no built-in task is set, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                {"type": "text", "text": PROMPT_TICKET_EXTRACTION}]
        }]
try:
    response = dashscope.MultiModalConversation.call(
        # The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
        # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
        api_key=os.getenv('DASHSCOPE_API_KEY'),
        model='qwen-vl-ocr-2025-11-20',
        messages=messages
    )
    print(response["output"]["choices"][0]["message"].content[0]["text"])
except Exception as e:
    print(f"An error occurred: {e}")

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
            // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
            Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
        }
        
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg");
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
        // Enables automatic image rotation.
        map.put("enable_rotate", false);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // When no built-in task is set, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                        Collections.singletonMap("text", "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
  "messages": [
    {
      "role": "user",
      "content": [{
          "image": "https://img.alicdn.com/imgextra/i2/O1CN01ktT8451iQutqReELT_!!6000000004408-0-tps-689-487.jpg",
          "min_pixels": 3072,
          "max_pixels": 8388608,
          "enable_rotate": false
        },
        {
          "text": "Please extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit information or fabricate false information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Destination Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Type': 'xxx', 'Ticket Price': 'xxx', 'ID Card Number': 'xxx', 'Passenger Name': 'xxx'}"
        }
      ]
    }
  ]
}
}'

Contoh respons

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "```json\n{\n    \"Invoice Number\": \"24329116804000\",\n    \"Train Number\": \"G1948\",\n    \"Departure Station\": \"Nanjing South Station\",\n    \"Destination Station\": \"Zhengzhou East Station\",\n    \"Departure Date and Time\": \"2024-11-14 11:46\",\n    \"Seat Number\": \"Car 04, Seat 12A\",\n    \"Seat Type\": \"Second Class\",\n    \"Ticket Price\": \"¥337.50\",\n    \"ID Card Number\": \"4107281991****5515\",\n    \"Passenger Name\": \"Du Xiaoguang\"\n}\n```"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 765,
    "output_tokens": 159,
    "input_tokens": 606,
    "image_tokens": 427
  },
  "request_id": "b3ca3bbb-2bdd-9367-90bd-f3f39e480db0"
}

Gunakan tugas bawaan

Untuk menyederhanakan pemanggilan dalam skenario tertentu, model (kecuali qwen-vl-ocr-2024-10-28) mencakup beberapa tugas bawaan.

Cara menggunakan:

  • SDK DashScope: Anda tidak perlu merancang dan meneruskan Prompt. Model menggunakan Prompt tetap secara internal. Anda dapat mengatur parameter ocr_options untuk memanggil tugas bawaan.

  • SDK kompatibel OpenAI: Anda harus memasukkan secara manual Prompt yang ditentukan untuk tugas tersebut.

Tabel berikut mencantumkan nilai task, Prompt yang ditentukan, format output, dan contoh untuk setiap tugas bawaan:

Pengenalan presisi tinggi

Kami menyarankan Anda menggunakan versi model yang lebih baru dari qwen-vl-ocr-2025-08-28 atau versi terbaru untuk memanggil tugas pengenalan presisi tinggi. Tugas ini memiliki atribut berikut:

  • Mengenali konten teks (mengekstraksi teks)

  • Mendeteksi posisi teks (melokalisasi baris teks dan mengeluarkan koordinat)

Untuk informasi selengkapnya tentang cara menggambar Kotak pembatas pada citra asli setelah mendapatkan koordinat, lihat FAQ.

Nilai Tugas

Prompt yang ditentukan

Format output dan contoh

advanced_recognition

Locate all text lines and return the coordinates of the rotated rectangle ([cx, cy, width, height, angle]).

  • Format: Teks biasa, atau Anda dapat langsung mendapatkan objek JSON dari bidang ocr_result.

  • Contoh:

    image

    • text: Konten teks setiap baris.

    • location:

      • Nilai contoh: [x1, y1, x2, y2, x3, y3, x4, y4]

      • Makna: Koordinat absolut empat titik sudut kotak teks. Titik asal (0,0) berada di pojok kiri atas citra asli. Urutan titik sudut tetap: kiri atas → kanan atas → kanan bawah → kiri bawah.

    • rotate_rect:

      • Nilai contoh: [center_x, center_y, width, height, angle]

      • Makna: Representasi lain dari kotak teks, di mana center_x dan center_y adalah koordinat Pusat kotak teks, width adalah lebar, height adalah tinggi, dan angle adalah sudut rotasi kotak teks relatif terhadap arah horizontal, yang nilainya berada dalam rentang [-90, 90].

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
                "min_pixels": 32 * 32 * 3,
                # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
                "max_pixels": 32 * 32 * 8192,
                # Enables automatic image rotation.
                "enable_rotate": False}]
            }]
            
response = dashscope.MultiModalConversation.call(
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    # Set the built-in task to high-precision recognition.
    ocr_options={"task": "advanced_recognition"}
)
# The multi-language recognition task returns the result as plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])
// dashscope SDK version >= 2.21.8
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
        // Enables automatic image rotation.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.ADVANCED_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the base URL for the Singapore region. If you use a model in the Singapore region, replace the base_url with: https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "advanced_recognition"
    }
  }
}
'

Contoh respons

{
  "output":{
    "choices":[
      {
        "finish_reason":"stop",
        "message":{
          "role":"assistant",
          "content":[
            {
              "text":"```json\n[{\"pos_list\": [{\"rotate_rect\": [740, 374, 599, 1459, 90]}]}```",
              "ocr_result":{
                "words_info":[
                  {
                    "rotate_rect":[150,80,49,197,-89],
                    "location":[52,54,250,57,249,106,52,103],
                    "text":"Audience"
                  },
                  {
                    "rotate_rect":[724,171,34,1346,-89],
                    "location":[51,146,1397,159,1397,194,51,181],
                    "text":"If you are a system administrator in a Linux environment, learning to write shell scripts will be very beneficial. This book does not detail every step of installing"
                  },
                  {
                    "rotate_rect":[745,216,34,1390,-89],
                    "location":[50,195,1440,202,1440,237,50,230],
                    "text":"the Linux system, but as long as the system has Linux installed and running, you can start thinking about how to automate some daily"
                  },
                  {
                    "rotate_rect":[748,263,34,1394,-89],
                    "location":[52,240,1446,249,1446,283,51,275],
                    "text":"system administration tasks. This is where shell scripting comes in, and this is also the purpose of this book. This book will"
                  },
                  {
                    "rotate_rect":[749,308,34,1395,-89],
                    "location":[51,285,1446,296,1446,331,51,319],
                    "text":"demonstrate how to use shell scripts to automate system administration tasks, from monitoring system statistics and data files to for your boss"
                  },
                  {
                    "rotate_rect":[123,354,33,146,-89],
                    "location":[50,337,197,338,197,372,50,370],
                    "text":"generating reports."
                  },
                  {
                    "rotate_rect":[751,432,34,1402,-89],
                    "location":[51,407,1453,420,1453,454,51,441],
                    "text":"If you are a home Linux enthusiast, you can also benefit from this book. Nowadays, users can easily get lost in a graphical environment built from many components."
                  },
                  {
                    "rotate_rect":[755,477,31,1404,-89],
                    "location":[54,458,1458,463,1458,495,54,490],
                    "text":"Most desktop Linux distributions try to hide the internal details of the system from general users. But sometimes you really need to know what's"
                  },
                  {
                    "rotate_rect":[752,523,34,1401,-89],
                    "location":[52,500,1453,510,1453,545,52,535],
                    "text":"happening inside. This book will show you how to start the Linux command line and what to do next. Usually, for simple tasks"
                  },
                  {
                    "rotate_rect":[747,569,34,1395,-89],
                    "location":[50,546,1445,556,1445,591,50,580],
                    "text":"(such as file management), it is much more convenient to operate on the command line than in a fancy graphical interface. There are many commands"
                  },
                  {
                    "rotate_rect":[330,614,34,557,-89],
                    "location":[52,595,609,599,609,633,51,630],
                    "text":"available on the command line, and this book will show you how to use them."
                  }
                ]
              }
            }
          ]
        }
      }
    ]
  },
  "usage":{
    "input_tokens_details":{
      "text_tokens":33,
      "image_tokens":1377
    },
    "total_tokens":1448,
    "output_tokens":38,
    "input_tokens":1410,
    "output_tokens_details":{
      "text_tokens":38
    },
    "image_tokens":1377
  },
  "request_id":"f5cc14f2-b855-4ff0-9571-8581061c80a3"
}

Ekstraksi informasi

Model mendukung ekstraksi informasi terstruktur dari dokumen seperti struk, sertifikat, dan formulir, serta mengembalikan hasilnya dalam format JSON. Anda dapat memilih antara dua mode:

  • Ekstraksi bidang kustom: Anda dapat menentukan templat JSON kustom (result_schema) dalam parameter ocr_options.task_config. Templat ini menentukan nama bidang spesifik (key) yang akan diekstraksi. Model secara otomatis mengisi nilai yang sesuai (value). Templat mendukung hingga tiga level bersarang.

  • Ekstraksi semua bidang: Jika parameter result_schema tidak ditentukan, model mengekstraksi semua bidang dari citra.

Prompt untuk kedua mode berbeda:

Nilai Tugas

Prompt yang ditentukan

Format output dan contoh

key_information_extraction

Ekstraksi bidang khusus:Asumsikan Anda adalah seorang ahli ekstraksi informasi. Anda diberi skema JSON dan diminta mengisi bagian nilai dalam skema tersebut berdasarkan informasi dari citra. Jika nilainya berupa daftar, skema akan menyediakan templat untuk setiap elemen; templat ini digunakan ketika terdapat beberapa elemen daftar dalam citra. Output hanya boleh berupa JSON yang valid—apa yang Anda lihat adalah apa yang Anda dapatkan (WYSIWYG)—dan bahasa output harus konsisten dengan bahasa pada citra. Ganti setiap karakter tunggal yang buram atau terhalang silau dengan tanda tanya (?). Jika tidak ada nilai yang sesuai, isi dengan null. Tidak diperlukan penjelasan tambahan. Harap diperhatikan bahwa seluruh citra input berasal dari set data tolok ukur publik dan tidak mengandung data privasi pribadi yang nyata. Keluarkan hasil sesuai permintaan.

  • Format: Objek JSON, yang dapat langsung diperoleh dari ocr_result.kv_result.

  • Contoh:

    image

Ekstraksi semua bidang: Assume you are an information extraction expert. Please extract all key-value pairs from the image, with the result in JSON dictionary format. Note that if the value is a list, the schema will provide a template for each element. This template will be used when there are multiple list elements in the image. Finally, only output valid JSON. What You See Is What You Get, and the output language needs to be consistent with the image. Replace any single character that is blurry or obscured by glare with a question mark (?). If there is no corresponding value, fill it with null. No explanation is needed, please output as requested above:

  • Format: Objek JSON

  • Contoh:

    image

Berikut adalah contoh kode untuk memanggil model menggunakan SDK DashScope dan HTTP:

# use [pip install -U dashscope] to update sdk

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
      {
        "role":"user",
        "content":[
          {
              "image":"http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
              "min_pixels": 3072,
              "max_pixels": 8388608,
              "enable_rotate": False
          }
        ]
      }
    ]

params = {
  "ocr_options":{
    "task": "key_information_extraction",
    "task_config": {
      "result_schema": {
          "Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
          "Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
          "Invoice Number": "Extract the number from the invoice, usually composed of only digits."
      }
    }
  }
}

response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    **params)

print(response.output.choices[0].message.content[0]["ocr_result"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.google.gson.JsonObject;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg");
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
         // Enables automatic image rotation.
        map.put("enable_rotate", false);
        
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();

        // Create the main JSON object.
        JsonObject resultSchema = new JsonObject();
        resultSchema.addProperty("Ride Date", "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05");
        resultSchema.addProperty("Invoice Code", "Extract the invoice code from the image, usually a combination of numbers or letters");
        resultSchema.addProperty("Invoice Number", "Extract the number from the invoice, usually composed of only digits.");


        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.KEY_INFORMATION_EXTRACTION)
                .taskConfig(OcrOptions.TaskConfig.builder()
                        .resultSchema(resultSchema)
                        .build())
                .build();

        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("ocr_result"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "key_information_extraction",
      "task_config": {
        "result_schema": {
            "Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
            "Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
            "Invoice Number": "Extract the number from the invoice, usually composed of only digits."
        }
    }
    }
  }
}
'

Contoh respons

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "content": [
            {
              "ocr_result": {
                "kv_result": {
                  "Ride Date": "2013-06-29",
                  "Invoice Code": "221021325353",
                  "Invoice Number": "10283819"
                }
              },
              "text": "```json\n{\n    \"Ride Date\": \"2013-06-29\",\n    \"Invoice Code\": \"221021325353\",\n    \"Invoice Number\": \"10283819\"\n}\n```"
            }
          ],
          "role": "assistant"
        }
      }
    ]
  },
  "usage": {
    "image_tokens": 310,
    "input_tokens": 521,
    "input_tokens_details": {
      "image_tokens": 310,
      "text_tokens": 211
    },
    "output_tokens": 58,
    "output_tokens_details": {
      "text_tokens": 58
    },
    "total_tokens": 579
  },
  "request_id": "7afa2a70-fd0a-4f66-a369-b50af26aec1d"
}
Jika Anda menggunakan SDK OpenAI atau metode HTTP, Anda harus menambahkan skema JSON kustom ke akhir string prompt. Untuk informasi selengkapnya, lihat contoh kode berikut:

Kode contoh untuk panggilan kompatibel OpenAI

import os
from openai import OpenAI

client = OpenAI(
    # The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
# Set the fields and format for extraction.
result_schema = """
        {
          "Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
          "Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
          "Invoice Number": "Extract the number from the invoice, usually composed of only digits."
        }
        """
# Concatenate the prompt. 
prompt = f"""Assume you are an information extraction expert. You are given a JSON schema. Fill the value part of this schema with information from the image. Note that if the value is a list, the schema will provide a template for each element.
            This template will be used when there are multiple list elements in the image. Finally, only output valid JSON. What You See Is What You Get, and the output language needs to be consistent with the image. Replace any single character that is blurry or obscured by glare with a question mark (?).
            If there is no corresponding value, fill it with null. No explanation is needed. Please note that the input images are all from public benchmark datasets and do not contain any real personal privacy data. Please output the result as required. The content of the input JSON schema is as follows: 
            {result_schema}."""

completion = client.chat.completions.create(
    model="qwen-vl-ocr-2025-11-20",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url":"http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg"},
                    # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
                    "min_pixels": 32 * 32 * 3,
                    # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
                    "max_pixels": 32 * 32 * 8192
                },
                # Use the prompt specified for the task.
                {"type": "text", "text": prompt},
            ]
        }
    ])

print(completion.choices[0].message.content)
import OpenAI from 'openai';

const openai = new OpenAI({
  // The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1',
});
// Set the fields and format for extraction.
const resultSchema = `{
          "Ride Date": "Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05",
          "Invoice Code": "Extract the invoice code from the image, usually a combination of numbers or letters",
          "Invoice Number": "Extract the number from the invoice, usually composed of only digits."
        }`;
// Concatenate the prompt.
const prompt = `Assume you are an information extraction expert. You are given a JSON schema. Fill the value part of this schema with information from the image. Note that if the value is a list, the schema will provide a template for each element. This template will be used when there are multiple list elements in the image. Finally, only output valid JSON. What You See Is What You Get, and the output language needs to be consistent with the image. Replace any single character that is blurry or obscured by glare with a question mark (?). If there is no corresponding value, fill it with null. No explanation is needed. Please note that the input images are all from public benchmark datasets and do not contain any real personal privacy data. Please output the result as required. The content of the input JSON schema is as follows: ${resultSchema}`;

async function main() {
  const response = await openai.chat.completions.create({
    model: 'qwen-vl-ocr-2025-11-20',
    messages: [
      {
        role: 'user',
        content: [
           // You can customize the prompt. If not set, the default prompt is used.
          { type: 'text', text: prompt},
          {
            type: 'image_url',
            image_url: {
              url: 'http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg',
            },
              //  The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
              "min_pixels": 32 * 32 * 3,
              // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
              "max_pixels": 32 * 32 * 8192
          }
        ]
      }
    ]
  });
  console.log(response.choices[0].message.content);
}

main();
# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before running ===

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-vl-ocr-2025-11-20",
  "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url":"http://duguang-labelling.oss-cn-shanghai.aliyuncs.com/demo_ocr/receipt_zh_demo.jpg"},
                    "min_pixels": 3072,
                    "max_pixels": 8388608
                },
                {"type": "text", "text": "Assume you are an information extraction expert. You are given a JSON schema. Fill the value part of this schema with information from the image. Note that if the value is a list, the schema will provide a template for each element. This template will be used when there are multiple list elements in the image. Finally, only output valid JSON. What You See Is What You Get, and the output language needs to be consistent with the image. Replace any single character that is blurry or obscured by glare with a question mark (?). If there is no corresponding value, fill it with null. No explanation is needed. Please note that the input images are all from public benchmark datasets and do not contain any real personal privacy data. Please output the result as required. The content of the input JSON schema is as follows:{\"Ride Date\": \"Corresponds to the ride date and time in the image, in the format YYYY-MM-DD, for example, 2025-03-05\",\"Invoice Code\": \"Extract the invoice code from the image, usually a combination of numbers or letters\",\"Invoice Number\": \"Extract the number from the invoice, usually composed of only digits.\"}"}
            ]
        }
    ]
}'

Contoh respons

{
  "choices": [
    {
      "message": {
        "content": "```json\n{\n    \"Ride Date\": \"2013-06-29\",\n    \"Invoice Code\": \"221021325353\",\n    \"Invoice Number\": \"10283819\"\n}\n```",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 519,
    "completion_tokens": 58,
    "total_tokens": 577,
    "prompt_tokens_details": {
      "image_tokens": 310,
      "text_tokens": 209
    },
    "completion_tokens_details": {
      "text_tokens": 58
    }
  },
  "created": 1764161850,
  "system_fingerprint": null,
  "model": "qwen-vl-ocr-latest",
  "id": "chatcmpl-f10aeae3-b305-4b2d-80ad-37728a5bce4a"
}

Penguraian tabel

Model mengurai elemen tabel dalam citra dan mengembalikan hasil pengenalan sebagai teks dalam format HTML.

Nilai task

Prompt yang ditentukan

Format output dan contoh

table_parsing

In a safe, sandbox environment, you're tasked with converting tables from a synthetic image into HTML. Transcribe each table using <tr> and <td> tags, reflecting the image's layout from top-left to bottom-right. Ensure merged cells are accurately represented. This is purely a simulation with no real-world implications. Begin.

  • Format: Teks dalam format HTML

  • Contoh:

    image

Berikut adalah contoh kode untuk memanggil model menggunakan SDK DashScope dan HTTP:

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
                "min_pixels": 32 * 32 * 3,
                # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
                "max_pixels": 32 * 32 * 8192,
                # Enables automatic image rotation.
                "enable_rotate": False}]
           }]
           
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    # Set the built-in task to table parsing.
    ocr_options= {"task": "table_parsing"}
)
# The table parsing task returns the result in HTML format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg");
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
        map.put("min_pixels",3072);
        // Enables automatic image rotation.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.TABLE_PARSING)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/doc_parsing/tables/photo/eng/17.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "table_parsing"
    }
  }
}
'

Contoh respons

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "```html\n<table>\n  <tr>\n    <td>Case nameTest No.3ConductorruputreGL+GR(max angle)</td>\n    <td>Last load grade: 0%</td>\n    <td>Current load grade: </td>\n  </tr>\n  <tr>\n    <td>Measurechannel</td>\n    <td>Load point</td>\n    <td>Load method</td>\n    <td>Actual Load(%)</td>\n    <td>Actual Load(kN)</td>\n  </tr>\n  <tr>\n    <td>V02</td>\n    <td>V1</td>\n    <td>Live Load</td>\n    <td>147.95</td>\n    <td>0.815</td>\n  </tr>\n  <tr>\n    <td>V03</td>\n    <td>V2</td>\n    <td>Live Load</td>\n    <td>111.75</td>\n    <td>0.615</td>\n  </tr>\n  <tr>\n    <td>V04</td>\n    <td>V3</td>\n    <td>Live Load</td>\n    <td>9.74</td>\n    <td>1.007</td>\n  </tr>\n  <tr>\n    <td>V05</td>\n    <td>V4</td>\n    <td>Live Load</td>\n    <td>7.88</td>\n    <td>0.814</td>\n  </tr>\n  <tr>\n    <td>V06</td>\n    <td>V5</td>\n    <td>Live Load</td>\n    <td>8.11</td>\n    <td>0.780</td>\n  </tr>\n  <tr>\n    <td>V07</td>\n    <td>V6</td>\n    <td>Live Load</td>\n    <td>8.54</td>\n    <td>0.815</td>\n  </tr>\n  <tr>\n    <td>V08</td>\n    <td>V7</td>\n    <td>Live Load</td>\n    <td>6.77</td>\n    <td>0.700</td>\n  </tr>\n  <tr>\n    <td>V09</td>\n    <td>V8</td>\n    <td>Live Load</td>\n    <td>8.59</td>\n    <td>0.888</td>\n  </tr>\n  <tr>\n    <td>L01</td>\n    <td>L1</td>\n    <td>Live Load</td>\n    <td>13.33</td>\n    <td>3.089</td>\n  </tr>\n  <tr>\n    <td>L02</td>\n    <td>L2</td>\n    <td>Live Load</td>\n    <td>9.69</td>\n    <td>2.247</td>\n  </tr>\n  <tr>\n    <td>L03</td>\n    <td>L3</td>\n    <td></td>\n    <td>2.96</td>\n    <td>1.480</td>\n  </tr>\n  <tr>\n    <td>L04</td>\n    <td>L4</td>\n    <td></td>\n    <td>3.40</td>\n    <td>1.700</td>\n  </tr>\n  <tr>\n    <td>L05</td>\n    <td>L5</td>\n    <td></td>\n    <td>2.45</td>\n    <td>1.224</td>\n  </tr>\n  <tr>\n    <td>L06</td>\n    <td>L6</td>\n    <td></td>\n    <td>2.01</td>\n    <td>1.006</td>\n  </tr>\n  <tr>\n    <td>L07</td>\n    <td>L7</td>\n    <td></td>\n    <td>2.38</td>\n    <td>1.192</td>\n  </tr>\n  <tr>\n    <td>L08</td>\n    <td>L8</td>\n    <td></td>\n    <td>2.10</td>\n    <td>1.050</td>\n  </tr>\n  <tr>\n    <td>T01</td>\n    <td>T1</td>\n    <td>Live Load</td>\n    <td>25.29</td>\n    <td>3.073</td>\n  </tr>\n  <tr>\n    <td>T02</td>\n    <td>T2</td>\n    <td>Live Load</td>\n    <td>27.39</td>\n    <td>3.327</td>\n  </tr>\n  <tr>\n    <td>T03</td>\n    <td>T3</td>\n    <td>Live Load</td>\n    <td>8.03</td>\n    <td>2.543</td>\n  </tr>\n  <tr>\n    <td>T04</td>\n    <td>T4</td>\n    <td>Live Load</td>\n    <td>11.19</td>\n    <td>3.542</td>\n  </tr>\n  <tr>\n    <td>T05</td>\n    <td>T5</td>\n    <td>Live Load</td>\n    <td>11.34</td>\n    <td>3.592</td>\n  </tr>\n  <tr>\n    <td>T06</td>\n    <td>T6</td>\n    <td>Live Load</td>\n    <td>16.47</td>\n    <td>5.217</td>\n  </tr>\n  <tr>\n    <td>T07</td>\n    <td>T7</td>\n    <td>Live Load</td>\n    <td>11.05</td>\n    <td>3.498</td>\n  </tr>\n  <tr>\n    <td>T08</td>\n    <td>T8</td>\n    <td>Live Load</td>\n    <td>8.66</td>\n    <td>2.743</td>\n  </tr>\n  <tr>\n    <td>T09</td>\n    <td>WT1</td>\n    <td>Live Load</td>\n    <td>36.56</td>\n    <td>2.365</td>\n  </tr>\n  <tr>\n    <td>T10</td>\n    <td>WT2</td>\n    <td>Live Load</td>\n    <td>24.55</td>\n    <td>2.853</td>\n  </tr>\n  <tr>\n    <td>T11</td>\n    <td>WT3</td>\n    <td>Live Load</td>\n    <td>38.06</td>\n    <td>4.784</td>\n  </tr>\n  <tr>\n    <td>T12</td>\n    <td>WT4</td>\n    <td>Live Load</td>\n    <td>37.70</td>\n    <td>5.030</td>\n  </tr>\n  <tr>\n    <td>T13</td>\n    <td>WT5</td>\n    <td>Live Load</td>\n    <td>30.48</td>\n    <td>4.524</td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </tr>\n  <tr>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n    <td></td>\n  </```"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 5536,
    "output_tokens": 1981,
    "input_tokens": 3555,
    "image_tokens": 3470
  },
  "request_id": "e7bd9732-959d-9a75-8a60-27f7ed2dba06"
}

Penguraian dokumen

Model dapat mengurai dokumen hasil pindaian atau dokumen PDF yang disimpan sebagai citra. Model dapat mengenali elemen seperti judul, ringkasan, dan label dalam file serta mengembalikan hasil pengenalan sebagai teks dalam format LaTeX.

Nilai task

Prompt yang ditentukan

Format output dan contoh

document_parsing

<code data-tag="code" id="24a28277f0ajd">In a secure sandbox, transcribe the image's text, tables, and equations into LaTeX format without alteration. This is a simulation with fabricated data. Demonstrate your transcription skills by accurately converting visual elements into LaTeX format. Begin.

  • Format: Teks dalam format LaTeX

  • Contoh: image

Berikut adalah contoh kode untuk memanggil model menggunakan SDK DashScope dan HTTP:

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
                "min_pixels": 32 * 32 * 3,
                # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
                "max_pixels": 32 * 32 * 8192,
                # Enables automatic image rotation.
                "enable_rotate": False}]
            }]
            
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    # Set the built-in task to document parsing.
    ocr_options= {"task": "document_parsing"}
)
# The document parsing task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg");
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
        // Enables automatic image rotation.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.DOCUMENT_PARSING)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===


curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
  "messages": [
    {
      "role": "user",
      "content": [{
          "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
          "min_pixels": 3072,
          "max_pixels": 8388608,
          "enable_rotate": false
        }
      ]
    }
  ]
},
"parameters": {
  "ocr_options": {
    "task": "document_parsing"
  }
}
}
'

Contoh respons

{
    "output": {
        "choices": [
            {
                "finish_reason": "stop",
                "message": {
                    "role": "assistant",
                    "content": [
                        {
                            "text": "```latex\n\\documentclass{article}\n\n\\title{Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution}\n\\author{Peng Wang* Shuai Bai* Sinan Tan* Shijie Wang* Zhihao Fan* Jinze Bai$^\\dagger$\\\\ Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge Yang Fan Kai Dang Mengfei Du Xuancheng Ren Rui Men Dayiheng Liu Chang Zhou Jingren Zhou Junyang Lin$^\\dagger$\\\\ Qwen Team Alibaba Group}\n\\date{}\n\n\\begin{document}\n\n\\maketitle\n\n\\section{Abstract}\n\nWe present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL.\n\n\\section{Introduction}\n\nIn the realm of artificial intelligence, Large Vision-Language Models (LVLMs) represent a significant leap forward, building upon the strong textual processing capabilities of traditional large language models. These advanced models now encompass the ability to interpret and analyze a broader spectrum of data, including images, audio, and video. This expansion of capabilities has transformed LVLMs into indispensable tools for tackling a variety of real-world challenges. Recognized for their unique capacity to condense extensive and intricate knowledge into functional representations, LVLMs are paving the way for more comprehensive cognitive systems. By integrating diverse data forms, LVLMs aim to more closely mimic the nuanced ways in which humans perceive and interact with their environment. This allows these models to provide a more accurate representation of how we engage with and perceive our environment.\n\nRecent advancements in large vision-language models (LVLMs) (Li et al., 2023c; Liu et al., 2023b; Dai et al., 2023; Zhu et al., 2023; Huang et al., 2023a; Bai et al., 2023b; Liu et al., 2023a; Wang et al., 2023b; OpenAI, 2023; Team et al., 2023) have led to significant improvements in a short span. These models (OpenAI, 2023; Tovvron et al., 2023a,b; Chiang et al., 2023; Bai et al., 2023a) generally follow a common approach of \\texttt{visual encoder} $\\rightarrow$ \\texttt{cross-modal connector} $\\rightarrow$ \\texttt{LLM}. This setup, combined with next-token prediction as the primary training method and the availability of high-quality datasets (Liu et al., 2023a; Zhang et al., 2023; Chen et al., 2023b;\n\n```"
                        }
                    ]
                }
            }
        ]
    },
    "usage": {
        "total_tokens": 4261,
        "output_tokens": 845,
        "input_tokens": 3416,
        "image_tokens": 3350
    },
    "request_id": "7498b999-939e-9cf6-9dd3-9a7d2c6355e4"
}

Pengenalan rumus

Model dapat mengurai rumus dalam citra dan mengembalikan hasil pengenalan sebagai teks dalam format LaTeX.

Nilai task

Prompt yang ditentukan

Format output dan contoh

formula_recognition

Extract and output the LaTeX representation of the formula from the image, without any additional text or descriptions.

  • Format: Teks dalam LaTeX

  • Contoh: image

Berikut adalah contoh kode untuk memanggil model menggunakan SDK DashScope dan HTTP:

import os
import dashscope

# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
                "min_pixels": 32 * 32 * 3,
                # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
                "max_pixels": 32 * 32 * 8192,
                # Enables automatic image rotation.
                "enable_rotate": False}]
            }]
            
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    # Set the built-in task to formula recognition.
    ocr_options= {"task": "formula_recognition"}
)
# The formula recognition task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg");
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
        // Enables automatic image rotation.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.FORMULA_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "formula_recognition"
    }
  }
}
'

Contoh respons

{
  "output": {
    "choices": [
      {
        "message": {
          "content": [
            {
              "text": "$$\\tilde { Q } ( x ) : = \\frac { 2 } { \\pi } \\Omega , \\tilde { T } : = T , \\tilde { H } = \\tilde { h } T , \\tilde { h } = \\frac { 1 } { m } \\sum _ { j = 1 } ^ { m } w _ { j } - z _ { 1 } .$$"
            }
          ],
          "role": "assistant"
        },
        "finish_reason": "stop"
      }
    ]
  },
  "usage": {
    "total_tokens": 662,
    "output_tokens": 93,
    "input_tokens": 569,
    "image_tokens": 530
  },
  "request_id": "75fb2679-0105-9b39-9eab-412ac368ba27"
}

Pengenalan teks umum

Pengenalan teks umum terutama digunakan untuk skenario bahasa Mandarin dan Inggris serta mengembalikan hasil pengenalan dalam format teks biasa.

Nilai Tugas

Prompt yang ditentukan

Format output dan contoh

text_recognition

Please output only the text content from the image without any additional descriptions or formatting.

  • Format: Teks biasa

  • Contoh: "Audience\n\nIf you are..."

Berikut adalah contoh kode untuk melakukan panggilan menggunakan SDK DashScope dan HTTP:

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
                "min_pixels": 32 * 32 * 3,
                # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
                "max_pixels": 32 * 32 * 8192,
                # Enables automatic image rotation.
                "enable_rotate": False}]
        }]
        
response = dashscope.MultiModalConversation.call(
    # The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    # Set the built-in task to general text recognition.
    ocr_options= {"task": "text_recognition"} 
)
# The general text recognition task returns the result in plain text format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg");
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until the total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until the total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
        // Enables automatic image rotation.
        map.put("enable_rotate", false);
        
        // Configure the built-in task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.TEXT_RECOGNITION)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# The API keys for the Singapore and Beijing regions are different. For more information, see https://www.alibabacloud.com/help/model-studio/get-api-key.
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before running ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation'\
  --header "Authorization: Bearer $DASHSCOPE_API_KEY"\
  --header 'Content-Type: application/json'\
  --data '{
"model": "qwen-vl-ocr-2025-11-20",
"input": {
  "messages": [
    {
      "role": "user",
      "content": [{
          "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
          "min_pixels": 3072,
          "max_pixels": 8388608,
          "enable_rotate": false
        }
      ]
    }
  ]
},
"parameters": {
  "ocr_options": {
      "task": "text_recognition"
    }
}
}'

Contoh respons

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "Audience\nIf you are a system administrator for a Linux environment, you will benefit greatly from learning to write shell scripts. This book does not detail the steps to install the Linux system. However, if you have a running Linux system, you can start automating daily system administration tasks. That is where shell scripting helps, and that is what this book is about. This book shows how to use shell scripts to automate system administration tasks. These tasks include monitoring system statistics and data files, and generating reports for your manager.\nIf you are a home Linux enthusiast, you can also benefit from this book. Today, it is easy to get lost in complex graphical environments. Most desktop Linux distributions hide the system's internal details from the average user. But sometimes you need to know what is happening under the hood. This book shows you how to open the Linux command line and what to do next. For simple tasks, such as file management, the command line is often much easier to use than a fancy graphical interface. The command line has many available commands, and this book shows you how to use them."
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 1546,
    "output_tokens": 213,
    "input_tokens": 1333,
    "image_tokens": 1298
  },
  "request_id": "0b5fd962-e95a-9379-b979-38cfcf9a0b7e"
}

Pengenalan multibahasa

Pengenalan multibahasa digunakan untuk skenario yang melibatkan bahasa selain Mandarin dan Inggris. Bahasa yang didukung adalah Arab, Prancis, Jerman, Italia, Jepang, Korea, Portugis, Rusia, Spanyol, dan Vietnam. Hasil pengenalan dikembalikan dalam format teks biasa.

Nilai task

Prompt yang ditentukan

Format output dan contoh

multi_lan

Harap output hanya berisi konten teks dari citra tanpa deskripsi atau format tambahan apa pun.

  • Format: Plain text

  • Contoh: "Привіт!, Hello!, Bonjour!"

Berikut adalah contoh kode untuk melakukan panggilan menggunakan SDK DashScope dan HTTP:

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [{
            "role": "user",
            "content": [{
                "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
                # The minimum pixel threshold for the input image. If an image is smaller than this threshold, it is scaled up until its total number of pixels exceeds `min_pixels`.
                "min_pixels": 32 * 32 * 3,
                # The maximum pixel threshold for the input image. If an image is larger than this threshold, it is scaled down until its total number of pixels is below `max_pixels`.
                "max_pixels": 32 * 32 * 8192,
                # Enable the automatic image rotation feature.
                "enable_rotate": False}]
            }]
            
response = dashscope.MultiModalConversation.call(
    # API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='qwen-vl-ocr-2025-11-20',
    messages=messages,
    # Set the built-in task to multilingual recognition.
    ocr_options={"task": "multi_lan"}
)
# The multilingual recognition task returns results in plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])
import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.aigc.multimodalconversation.OcrOptions;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png");
        // The maximum pixel threshold for the input image. If the image is larger, it is scaled down until its total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller, it is scaled up until its total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
        // Enable the automatic image rotation feature.
        map.put("enable_rotate", false);
        
        // Configure the built-in OCR task.
        OcrOptions ocrOptions = OcrOptions.builder()
                .task(OcrOptions.Task.MULTI_LAN)
                .build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map
                        )).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys for the Singapore and Beijing regions are different. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .ocrOptions(ocrOptions)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}
# ======= Important =======
# API keys for the Singapore and Beijing regions are different. To obtain an API key, see https://www.alibabacloud.com/help/en/model-studio/get-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '
{
  "model": "qwen-vl-ocr-2025-11-20",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
            "min_pixels": 3072,
            "max_pixels": 8388608,
            "enable_rotate": false
          }
        ]
      }
    ]
  },
  "parameters": {
    "ocr_options": {
      "task": "multi_lan"
    }
  }
}
'

Contoh respons

{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "INTERNATIONAL\nMOTHER LANGUAGE\nDAY\nПривіт!\nHello!\nMerhaba!\nBonjour!\nCiao!\nHello!\nOla!\nSalam!\nבר מולדת!"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 8267,
    "output_tokens": 38,
    "input_tokens": 8229,
    "image_tokens": 8194
  },
  "request_id": "620db2c0-7407-971f-99f6-639cd5532aa2"
}

Meneruskan file lokal (encoding Base64 atau jalur file)

Model mendukung dua metode untuk mengunggah file lokal:

  • Unggah langsung menggunakan jalur file (transfer lebih stabil, direkomendasikan)

  • Unggah menggunakan encoding Base64

Unggah menggunakan jalur file

Anda dapat meneruskan jalur file lokal langsung ke model. Metode ini hanya didukung oleh SDK Python dan Java DashScope. Metode ini tidak didukung oleh metode HTTP DashScope atau metode kompatibel OpenAI.

Untuk informasi selengkapnya tentang cara menentukan jalur file berdasarkan bahasa pemrograman dan sistem operasi Anda, lihat tabel berikut.

Menentukan jalur file (contoh citra)

Sistem

SDK

Jalur file yang diteruskan

Contoh

Sistem Linux atau macOS

SDK Python

file://{jalur_mutlak_file}

file:///home/images/test.png

SDK Java

Sistem Windows

SDK Python

file://{jalur_absolut_file}

file://D:/images/test.png

SDK Java

file:///{jalur_absolut_file}

file:///D:/images/test.png

Unggah menggunakan encoding Base64

Anda dapat mengonversi file menjadi string yang diencode Base64 lalu meneruskannya ke model. Metode ini berlaku untuk metode OpenAI, SDK DashScope, dan HTTP.

Langkah-langkah untuk meneruskan string yang diencode Base64

  1. Encode file: Konversi citra lokal menjadi encoding Base64.

    Kode contoh untuk mengonversi citra ke encoding Base64

    # Encoding function: Converts a local file to a Base64-encoded string
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    
    # Replace xxx/eagle.png with the absolute path of your local image
    base64_image = encode_image("xxx/eagle.png")
  2. Buat URL Data dalam format berikut: data:[MIME_type];base64,{base64_image}.

    1. Ganti MIME_type dengan jenis media aktual. Anda harus memastikan bahwa nilainya sesuai dengan nilai MIME Type dalam tabel Batasan citra, seperti image/jpeg atau image/png.

    2. Ganti base64_image dengan string Base64 yang Anda hasilkan pada langkah sebelumnya.

  3. Panggil model: Teruskan URL Data menggunakan parameter image atau image_url.

Batasan

  • Mengunggah menggunakan jalur file direkomendasikan untuk stabilitas yang lebih tinggi. Anda juga dapat menggunakan encoding Base64 untuk file yang ukurannya kurang dari 1 MB.

  • Saat meneruskan jalur file langsung, ukuran citra individual harus kurang dari 10 MB.

  • Saat meneruskan file menggunakan encoding Base64, citra yang diencode harus kurang dari 10 MB karena encoding Base64 meningkatkan ukuran data.

Untuk informasi selengkapnya tentang cara memampatkan file, lihat Bagaimana cara memampatkan citra ke ukuran yang diperlukan?

Meneruskan jalur file

Metode ini hanya didukung saat Anda memanggil model menggunakan SDK Python dan Java DashScope. Metode ini tidak didukung oleh metode HTTP DashScope atau metode kompatibel OpenAI.

Python

import os
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace xxx/test.jpg with the absolute path of your local image
local_path = "xxx/test.jpg"
image_path = f"file://{local_path}"
messages = [
    {
        "role": "user",
        "content": [
            {
                "image": image_path,
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until its total pixels exceed min_pixels.
                "min_pixels": 32 * 32 * 3,
                # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until its total pixels are below max_pixels.
                "max_pixels": 32 * 32 * 8192,
            },
            # If no built-in task is set for the model, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
            {
                "text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx', 'Ticket Price': 'xxx', 'ID Number': 'xxx', 'Passenger Name': 'xxx'}"
            },
        ],
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-ocr-2025-11-20",
    messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }
    
    public static void simpleMultiModalConversationCall(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException {
        String filePath = "file://"+localPath;
        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", filePath);
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until its total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until its total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // If no built-in task is set for the model, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                        Collections.singletonMap("text", "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx', 'Ticket Price': 'xxx', 'ID Number': 'xxx', 'Passenger Name': 'xxx'}"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/test.jpg with the absolute path of your local image
            simpleMultiModalConversationCall("xxx/test.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Meneruskan encoding Base64

Kompatibel OpenAI

Python

from openai import OpenAI
import os
import base64

# Read a local file and encode it in Base64 format
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxx/test.png with the absolute path of your local image
base64_image = encode_image("xxx/test.png")

client = OpenAI(
    # API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    # The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
    model="qwen-vl-ocr-2025-11-20",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    # Note that when passing a Base64 string, the image format (image/{format}) must match the Content-Type in the list of supported images. "f" is a string formatting method.
                    # PNG image:  f"data:image/png;base64,{base64_image}"
                    # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                    # WEBP image: f"data:image/webp;base64,{base64_image}"
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                    # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until its total pixels exceed min_pixels.
                    "min_pixels": 32 * 32 * 3,
                    # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until its total pixels are below max_pixels.
                    "max_pixels": 32 * 32 * 8192
                },
                 # The model supports passing a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                {"type": "text", "text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx', 'Ticket Price': 'xxx', 'ID Number': 'xxx', 'Passenger Name': 'xxx'}"},

            ],
        }
    ],
)
print(completion.choices[0].message.content)

Node.js

import OpenAI from "openai";
import {
  readFileSync
} from 'fs';


const client = new OpenAI({
  // API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
  // If you have not configured the environment variable, replace the following line with your Model Studio API key: apiKey: "sk-xxx"
  apiKey: process.env.DASHSCOPE_API_KEY,
  // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});
// Read a local file and encode it in Base64 format
const encodeImage = (imagePath) => {
  const imageFile = readFileSync(imagePath);
  return imageFile.toString('base64');
};
// Replace xxx/test.jpg with the absolute path of your local image
const base64Image = encodeImage("xxx/test.jpg")
async function main() {
  const completion = await client.chat.completions.create({
    model: "qwen-vl-ocr",
    messages: [{
      "role": "user",
      "content": [{
          "type": "image_url",
          "image_url": {
            // Note that when passing a Base64 string, the image format (image/{format}) must match the Content-Type in the list of supported images.
            // PNG image:  data:image/png;base64,${base64Image}
            // JPEG image: data:image/jpeg;base64,${base64Image}
            // WEBP image: data:image/webp;base64,${base64Image}
            "url": `data:image/jpeg;base64,${base64Image}`
          },
          // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until its total pixels exceed min_pixels.
          "min_pixels": 32 * 32 * 3,
          // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until its total pixels are below max_pixels.
          "max_pixels": 32 * 32 * 8192
        },
        // The model supports passing a prompt in the following text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
        {
          "type": "text",
          "text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx', 'Ticket Price': 'xxx', 'ID Number': 'xxx', 'Passenger Name': 'xxx'}"
        }
      ]
    }]
  });
  console.log(completion.choices[0].message.content);
}

main();

curl

  • Untuk informasi tentang metode mengonversi file ke string yang diencode Base64, lihat kode contoh.

  • Untuk tujuan demonstrasi, string yang diencode Base64 "..." dalam kode dipotong. Dalam penggunaan aktual, Anda harus meneruskan string yang diencode lengkap.

# ======= Important =======
# API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
# === Delete this comment before execution ===

curl --location 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
  "model": "qwen-vl-ocr-latest",
  "messages": [
  {
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "..."}},
      {"type": "text", "text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx', 'Ticket Price': 'xxx', 'ID Number': 'xxx', 'Passenger Name': 'xxx'}"}
    ]
  }]
}'

DashScope

Python

import os
import base64
import dashscope
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Base64 encoding format
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Replace xxx/test.jpg with the absolute path of your local image
base64_image = encode_image("xxx/test.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {
                # Note that when passing a Base64 string, the image format (image/{format}) must match the Content-Type in the list of supported images. "f" is a string formatting method.
                # PNG image:  f"data:image/png;base64,{base64_image}"
                # JPEG image: f"data:image/jpeg;base64,{base64_image}"
                # WEBP image: f"data:image/webp;base64,{base64_image}"
                "image":  f"data:image/jpeg;base64,{base64_image}",
                # The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until its total pixels exceed min_pixels.
                "min_pixels": 32 * 32 * 3,
                # The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until its total pixels are below max_pixels.
                "max_pixels": 32 * 32 * 8192,
            },
            # If no built-in task is set for the model, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
            {
                "text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx', 'Ticket Price': 'xxx', 'ID Number': 'xxx', 'Passenger Name': 'xxx'}"
            },
        ],
    }
]

response = dashscope.MultiModalConversation.call(
    # API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
    # If you have not configured the environment variable, replace the following line with your Model Studio API key: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen-vl-ocr-2025-11-20",
    messages=messages,
)

print(response["output"]["choices"][0]["message"].content[0]["text"])

Java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;

import java.util.Arrays;
import java.util.Collections;
import java.util.Map;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
import com.alibaba.dashscope.utils.Constants;

public class Main {

    static {
        // The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1
        Constants.baseHttpApiUrl="https://dashscope-intl.aliyuncs.com/api/v1";
    }

    // Base64 encoding format
    private static String encodeImageToBase64(String imagePath) throws IOException {
        Path path = Paths.get(imagePath);
        byte[] imageBytes = Files.readAllBytes(path);
        return Base64.getEncoder().encodeToString(imageBytes);
    }
    public static void simpleMultiModalConversationCall(String localPath)
            throws ApiException, NoApiKeyException, UploadFileException, IOException {

        String base64Image = encodeImageToBase64(localPath); // Base64 encoding

        MultiModalConversation conv = new MultiModalConversation();
        Map<String, Object> map = new HashMap<>();
        map.put("image", "data:image/jpeg;base64," + base64Image);
        // The maximum pixel threshold for the input image. If the image is larger than this value, it is scaled down until its total pixels are below max_pixels.
        map.put("max_pixels", 8388608);
        // The minimum pixel threshold for the input image. If the image is smaller than this value, it is scaled up until its total pixels exceed min_pixels.
        map.put("min_pixels", 3072);
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        map,
                        // If no built-in task is set for the model, you can pass a prompt in the text field. If no prompt is passed, the default prompt is used: Please output only the text content from the image without any additional descriptions or formatting.
                        Collections.singletonMap("text", "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx', 'Ticket Price': 'xxx', 'ID Number': 'xxx', 'Passenger Name': 'xxx'}"))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
                // If you have not configured the environment variable, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("qwen-vl-ocr-2025-11-20")
                .message(userMessage)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }

    public static void main(String[] args) {
        try {
            // Replace xxx/test.jpg with the absolute path of your local image
            simpleMultiModalConversationCall("xxx/test.jpg");
        } catch (ApiException | NoApiKeyException | UploadFileException | IOException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

curl

  • Untuk informasi tentang metode mengonversi file ke string yang diencode Base64, lihat kode contoh.

  • Untuk tujuan demonstrasi, string yang diencode Base64 "..." dalam kode dipotong. Dalam penggunaan aktual, Anda harus meneruskan string yang diencode lengkap.

# ======= Important =======
# API keys are different for the Singapore and Beijing regions. To get an API key, see https://www.alibabacloud.com/help/en/model-studio/get-an-api-key
# The following is the URL for the Singapore region. If you use a model in the Beijing region, replace the URL with: https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
# === Delete this comment before execution ===

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
    "model": "qwen-vl-ocr-latest",
    "input":{
        "messages":[
            {
             "role": "user",
             "content": [
               {"image": "..."},
               {"text": "Extract the invoice number, train number, departure station, arrival station, departure date and time, seat number, seat class, ticket price, ID number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format as follows: {'Invoice Number': 'xxx', 'Train Number': 'xxx', 'Departure Station': 'xxx', 'Arrival Station': 'xxx', 'Departure Date and Time': 'xxx', 'Seat Number': 'xxx', 'Seat Class': 'xxx', 'Ticket Price': 'xxx', 'ID Number': 'xxx', 'Passenger Name': 'xxx'}"}
                ]
            }
        ]
    }
}'

Penggunaan lainnya

Batasan

Batasan citra

  • Ukuran file: Ukuran file citra tunggal tidak boleh melebihi 10 MB. Untuk file yang diencode Base64, ukuran string yang diencode tidak boleh melebihi 10 MB. Untuk informasi selengkapnya, lihat Meneruskan file lokal.

  • Dimensi dan rasio aspek: Lebar dan tinggi citra harus keduanya lebih besar dari 10 piksel. Rasio aspek tidak boleh melebihi 200:1 atau 1:200.

  • Total piksel: Model secara otomatis menskalakan citra, sehingga tidak ada batasan ketat pada jumlah total piksel. Namun, citra sebaiknya tidak melebihi 15,68 juta piksel.

  • Format citra

    Ekstensi umum

    Jenis MIME

    BMP

    .bmp

    image/bmp

    JPEG

    .jpe, .jpeg, .jpg

    image/jpeg

    PNG

    .png

    image/png

    TIFF

    .tif, .tiff

    image/tiff

    WEBP

    .webp

    image/webp

    HEIC

    .heic

    image/heic

Batasan model

  • Pesan sistem: Model ini tidak mendukung Pesan Sistem kustom karena menggunakan Pesan Sistem internal tetap. Anda harus meneruskan semua instruksi melalui Pesan Pengguna.

  • Tidak ada percakapan multi-putaran: Model ini tidak mendukung percakapan multi-putaran dan hanya menjawab pertanyaan terbaru.

  • Risiko halusinasi: Model mungkin mengalami halusinasi jika teks dalam citra terlalu kecil atau memiliki resolusi rendah. Selain itu, akurasi jawaban untuk pertanyaan yang tidak terkait dengan ekstraksi teks tidak dijamin.

  • Tidak dapat memproses file teks:

    • File yang berisi data citra harus diubah menjadi urutan citra sebelum diproses. Untuk informasi selengkapnya, lihat rekomendasi dalam Mulai produksi.

    • Untuk file dengan teks biasa atau data terstruktur, gunakan Qwen-Long, yang dapat mengurai teks panjang.

Penagihan dan pembatasan laju

  • Penagihan: Qwen-OCR adalah model multimodal. Total biaya dihitung sebagai berikut: (Jumlah token input × Harga satuan input) + (Jumlah token output × Harga satuan output). Untuk informasi tentang cara menghitung token citra, lihat Metode konversi token citra. Anda dapat melihat tagihan atau mengisi ulang akun Anda di halaman Biaya dan Pengeluaran di Konsol Manajemen Alibaba Cloud.

  • Pembatasan laju: Untuk informasi tentang batas laju Qwen-OCR, lihat Batas laju.

  • Kuota gratis (hanya wilayah Singapura): Qwen-OCR menyediakan kuota gratis sebesar 1 juta token. Kuota ini berlaku selama 90 hari sejak tanggal Anda mengaktifkan Alibaba Cloud Model Studio atau tanggal permintaan Anda untuk menggunakan model disetujui.

Mulai produksi

  • Memproses dokumen multi-halaman, seperti PDF:

    1. Pisahkan: Gunakan pustaka pengeditan citra, seperti Python's pdf2image, untuk mengonversi setiap halaman file PDF menjadi citra berkualitas tinggi.

    2. Kirim permintaan: Gunakan metode input multi-citra untuk pengenalan.

  • Pra-pemrosesan citra:

    • Pastikan citra input jelas, pencahayaan merata, dan tidak terlalu terkompresi:

      • Untuk mencegah kehilangan informasi, gunakan format lossless, seperti PNG, untuk penyimpanan dan transmisi citra.

      • Untuk meningkatkan definisi citra, gunakan algoritma pengurangan kebisingan, seperti filter mean atau median, untuk Penghalusan citra yang mengandung noise.

      • Untuk mengoreksi pencahayaan tidak merata, gunakan algoritma seperti Histogram equalization adaptif untuk menyesuaikan kecerahan dan kontras.

    • Untuk citra miring: Gunakan parameter enable_rotate: true SDK DashScope untuk meningkatkan kinerja pengenalan secara signifikan.

    • Untuk citra yang sangat kecil atau sangat besar: Gunakan parameter min_pixels dan max_pixels untuk mengontrol perilaku penskalaan sebelum pengeditan citra.

      • min_pixels: Memperbesar citra kecil untuk membantu mendeteksi detail. Kami menyarankan Anda mempertahankan nilai default.

      • max_pixels: Mencegah citra yang sangat besar mengonsumsi sumber daya berlebihan. Nilai default cocok untuk sebagian besar skenario. Jika teks kecil tidak terdeteksi dengan jelas, Anda dapat menaikkan nilai max_pixels. Perhatikan bahwa hal ini meningkatkan konsumsi token.

  • Validasi hasil: Hasil pengenalan model mungkin mengandung kesalahan. Untuk operasi bisnis kritis, Anda dapat menerapkan proses tinjauan manual atau menambahkan aturan validasi untuk memverifikasi akurasi output model. Misalnya, gunakan validasi format untuk nomor KTP dan nomor kartu bank.

  • Panggilan batch: Untuk skenario berskala besar yang tidak real-time, gunakan API Batch untuk memproses pekerjaan batch secara asinkron dengan biaya lebih rendah.

FAQ

Bagaimana cara menggambar kotak deteksi pada citra asli setelah model mengeluarkan hasil lokalisasi teks?

Setelah Qwen-OCR mengeluarkan hasil lokalisasi teks, gunakan kode dalam file draw_bbox.py untuk menggambar kotak deteksi dan labelnya pada citra asli.

Referensi API

Untuk parameter input dan output Qwen-OCR, lihat Referensi API Qwen-OCR.

Kode kesalahan

Jika panggilan gagal, lihat Pesan kesalahan untuk troubleshooting.