All Products
Search
Document Center

Alibaba Cloud Model Studio:Data mining (Qwen-Doc-Turbo)

Last Updated:Jan 06, 2026

The data mining model is designed for information extraction, content moderation, classification, and summary generation. Unlike general-purpose chat models, this model quickly and accurately outputs standardized structured data, such as in JSON format. This addresses the issue of general-purpose models returning non-standard response structures or extracting information inaccurately.

Note

This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.

Implementation guide

Qwen-Doc-Turbo supports extracting information from files in three ways. For more information about file size and type limits, see Limitations.

Feature

File URL (Recommended)

File ID

Plain text

File source

Public URL

Local file (upload required)

Passed as a string

Input length limit

Up to 10 files
Supports large files (max input 253k tokens)

1 file
Supports large files (max input 253k tokens)

Up to 9,000 tokens

SDK compatibility

Only DashScope

Upload: OpenAI
Call: OpenAI and DashScope

OpenAI and DashScope

Key advantages

No upload to Model Studio required. Supports batch calls.

Avoids repeated uploads. Ideal for reuse.

No file management required.

Prerequisites

Pass a file URL

Extract structured data directly using file URLs and process up to 10 files simultaneously. This example shows how to pass the Sample Product Manual A and Sample Product Manual B files and use a prompt to instruct the model to return the extracted information in JSON format.

The file URL method currently supports only the DashScope protocol. You can use the DashScope Python SDK or make an HTTP call, such as using curl.
import os
import dashscope

response = dashscope.Generation.call(
    api_key=os.getenv('DASHSCOPE_API_KEY'), # If you have not set the environment variable, replace this with your API key
    model='qwen-doc-turbo',
    messages=[
    {"role": "system","content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "From these two product manuals, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."
            },
            {
                "type": "doc_url",
                "doc_url": [
                    "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/jockge/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CA.docx",
                    "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/ztwxzr/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CB.docx"
                ],
                "file_parsing_strategy": "auto"
            }
        ]
    }]
)
try:
    if response.status_code == 200:
        print(response.output.choices[0].message.content)
    else:
        print(f"Request failed, status code: {response.status_code}")
        print(f"Error code: {response.code}")
        print(f"Error message: {response.message}")
        print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")
except Exception as e:
    print(f"An error occurred: {e}")
    print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $DASHSCOPE_API_KEY' \
--header 'X-DashScope-SSE: enable' \
--data '{
    "model": "qwen-doc-turbo",
    "input": {
        "messages": [
                {
                    "role": "system",
                    "content": "you are a helpful assistant."
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "From these two product manuals, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."
                        },
                        {
                            "type": "doc_url",
                            "doc_url": [
                                "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/jockge/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CA.docx",
                                "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/ztwxzr/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CB.docx"
                            ],
                            "file_parsing_strategy": "auto"
                        }
                    ]
                }
            ]
    }
}'

Response example

[
  {
    "model": "PRO-100",
    "name": "Smart Printer",
    "price": "8999"
  },
  {
    "model": "PRO-200",
    "name": "Smart Scanner",
    "price": "12999"
  },
  ...
  {
    "model": "SEC-400",
    "name": "Smart Visitor System",
    "price": "9999"
  },
  {
    "model": "SEC-500",
    "name": "Smart Parking Management",
    "price": "22999"
  }
]

Pass a file ID

Upload a file

Before you run the following code, click Sample Product Manual A to download the file and place it in the same directory as your project code. Then, upload the file to the secure bucket in Alibaba Cloud Model Studio through the OpenAI compatible interface to obtain a file-id. For more information about the parameters and call methods for the file upload interface, see the API reference.

Python

import os
from pathlib import Path
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # If you have not set the environment variable, replace this with your API key
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",  # Enter the DashScope service base_url
)

file_object = client.files.create(file=Path("Sample Product Manual A.docx"), purpose="file-extract")
# Print the file-id for use in subsequent model calls
print(file_object.id)

Java

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.files.*;

import java.nio.file.Paths;

public class Main {
    public static void main(String[] args) {
        // Create a client and use the API key from the environment variable
        OpenAIClient client = OpenAIOkHttpClient.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .baseUrl("https://dashscope.aliyuncs.com/compatible-mode/v1")
                .build();
        // Set the file path. Modify the path and filename as needed.
        Path filePath = Paths.get("src/main/java/org/example/Sample Product Manual A.docx");
        // Create file upload parameters
        FileCreateParams fileParams = FileCreateParams.builder()
                .file(filePath)
                .purpose(FilePurpose.of("file-extract"))
                .build();

        // Upload the file and print the file-id
        FileObject fileObject = client.files().create(fileParams);
        // Print the file-id for use in subsequent model calls
        System.out.println(fileObject.id());
    }
}

curl

curl --location --request POST 'https://dashscope.aliyuncs.com/compatible-mode/v1/files' \
  --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
  --form 'file=@"Sample Product Manual A.docx"' \
  --form 'purpose="file-extract"'

Run the code to obtain the file-id for the uploaded file.

Pass information and start a conversation using a file ID

Embed the obtained file-id into a system message. The first system message sets the role for the model. The subsequent system message passes the file-id. The user message contains the specific query about the file.

import os
from openai import OpenAI, BadRequestError

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"), # If you have not set the environment variable, replace this with your API key
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

try:
    completion = client.chat.completions.create(
        model="qwen-doc-turbo",
        messages=[
            {'role': 'system', 'content': 'You are a helpful assistant.'},
            # Replace '{FILE_ID}' with the file-id from your scenario
            {'role': 'system', 'content': 'fileid://{FILE_ID}'},
            {'role': 'user', 'content': 'From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).'}
        ],
        # This code example uses streaming output to clearly show the model's output process. For non-streaming output examples, see https://www.alibabacloud.com/help/en/model-studio/user-guide/text-generation
        stream=True,
        stream_options={"include_usage": True}
    )

    full_content = ""
    for chunk in completion:
        if chunk.choices and chunk.choices[0].delta.content:
            full_content += chunk.choices[0].delta.content
            print(chunk.model_dump())
    
    print(full_content)

except BadRequestError as e:
    print(f"Error message: {e}")
    print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.http.StreamResponse;
import com.openai.models.chat.completions.*;

public class Main {
    public static void main(String[] args) {
        // Create a client and use the API key from the environment variable
        OpenAIClient client = OpenAIOkHttpClient.builder()
                // If you have not set the environment variable, replace the next line with your Model Studio API key: .apiKey("sk-xxx");
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .baseUrl("https://dashscope.aliyuncs.com/compatible-mode/v1")
                .build();

        ChatCompletionCreateParams chatParams = ChatCompletionCreateParams.builder()
                .addSystemMessage("You are a helpful assistant.")
                // Replace '{FILE_ID}' with the file-id from your scenario
                .addSystemMessage("fileid://{FILE_ID}")
                .addUserMessage("From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).")
                .model("qwen-doc-turbo")
                .build();

        try (StreamResponse<ChatCompletionChunk> streamResponse = client.chat().completions().createStreaming(chatParams)) {
            streamResponse.stream().forEach(chunk -> {
                String content = chunk.choices().get(0).delta().content().orElse("");
                if (!content.isEmpty()) {
                    System.out.print(content);
                }
            });
        } catch (Exception e) {
            System.err.println("Error message: " + e.getMessage());
        }
    }
}
curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header "Content-Type: application/json" \
--data '{
    "model": "qwen-doc-turbo",
    "messages": [
        {"role": "system","content": "You are a helpful assistant."},
        {"role": "system","content": "fileid://{FILE_ID}"},
        {"role": "user","content": "From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."}
    ],
    "stream": true,
    "stream_options": {
        "include_usage": true
    }
}'

Complete example: Upload a file and call the model

import os
import time
from pathlib import Path
from openai import OpenAI, BadRequestError

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # If you have not set the environment variable, replace this with your API key
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

try:
    # Step 1: Upload the file
    file_object = client.files.create(file=Path("Sample Product Manual A.docx"), purpose="file-extract")
    file_id = file_object.id
    print(f"File uploaded successfully. file-id: {file_id}")
    
    # Step 2: Wait for the file to finish parsing (optional, may be needed for large files)
    # If the file is still parsing, the API returns an error. You must retry.
    max_retries = 10
    retry_count = 0
    
    while retry_count < max_retries:
        try:
            # Step 3: Call the model using the file-id
            completion = client.chat.completions.create(
                model="qwen-doc-turbo",
                messages=[
                    {'role': 'system', 'content': 'You are a helpful assistant.'},
                    {'role': 'system', 'content': f'fileid://{file_id}'},
                    {'role': 'user', 'content': 'From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).'}
                ],
                stream=True,
                stream_options={"include_usage": True}
            )
            
            # Step 4: Process the model output
            full_content = ""
            for chunk in completion:
                if chunk.choices and chunk.choices[0].delta.content:
                    full_content += chunk.choices[0].delta.content
                    print(chunk.choices[0].delta.content, end='', flush=True)
            
            print(f"\n\nFull output:\n{full_content}")
            break
            
        except BadRequestError as e:
            if "File parsing in progress" in str(e):
                retry_count += 1
                print(f"File is parsing. Retrying after a delay ({retry_count}/{max_retries})...")
                time.sleep(2)  # Wait 2 seconds and retry
            else:
                raise e
    
    if retry_count >= max_retries:
        print("File parsing timed out. Try again later.")

except BadRequestError as e:
    print(f"Error message: {e}")
    print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")
except Exception as e:
    print(f"An error occurred: {e}")
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.http.StreamResponse;
import com.openai.models.chat.completions.*;
import com.openai.models.files.*;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) {
        // Create a client
        OpenAIClient client = OpenAIOkHttpClient.builder()
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .baseUrl("https://dashscope.aliyuncs.com/compatible-mode/v1")
                .build();
        
        try {
            // Step 1: Upload the file
            Path filePath = Paths.get("src/main/java/org/example/Sample Product Manual A.docx");
            FileCreateParams fileParams = FileCreateParams.builder()
                    .file(filePath)
                    .purpose(FilePurpose.of("file-extract"))
                    .build();
            
            FileObject fileObject = client.files().create(fileParams);
            String fileId = fileObject.id();
            System.out.println("File uploaded successfully. file-id: " + fileId);
            
            // Step 2: Wait for the file to parse and then call the model (max 10 retries)
            int maxRetries = 10;
            int retryCount = 0;
            boolean success = false;
            
            while (retryCount < maxRetries && !success) {
                try {
                    // Step 3: Call the model using the file-id
                    ChatCompletionCreateParams chatParams = ChatCompletionCreateParams.builder()
                            .addSystemMessage("You are a helpful assistant.")
                            .addSystemMessage("fileid://" + fileId)
                            .addUserMessage("From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).")
                            .model("qwen-doc-turbo")
                            .build();
                    
                    // Step 4: Process the model output
                    try (StreamResponse<ChatCompletionChunk> streamResponse = 
                            client.chat().completions().createStreaming(chatParams)) {
                        streamResponse.stream().forEach(chunk -> {
                            String content = chunk.choices().get(0).delta().content().orElse("");
                            if (!content.isEmpty()) {
                                System.out.print(content);
                            }
                        });
                        System.out.println();
                        success = true;
                    }
                    
                } catch (Exception e) {
                    if (e.getMessage() != null && e.getMessage().contains("File parsing in progress")) {
                        retryCount++;
                        System.out.println("File is parsing. Retrying after a delay (" + retryCount + "/" + maxRetries + ")...");
                        TimeUnit.SECONDS.sleep(2);  // Wait 2 seconds and retry
                    } else {
                        throw e;
                    }
                }
            }
            
            if (!success) {
                System.out.println("File parsing timed out. Try again later.");
            }
            
        } catch (Exception e) {
            System.err.println("Error message: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Example response

[
  {
    "model": "PRO-100",
    "name": "Smart Printer",
    "price": "8999"
  },
  {
    "model": "PRO-200",
    "name": "Smart Scanner",
    "price": "12999"
  },
  ...
  {
    "model": "SEC-400",
    "name": "Smart Visitor System",
    "price": "9999"
  },
  {
    "model": "SEC-500",
    "name": "Smart Parking Management",
    "price": "22999"
  }
]

Pass plain text

In addition to passing file information using a file-id, you can also pass the file content directly as a string. When using this method, to prevent the model from confusing the role setting with the file content, ensure that the role-setting information is in the first message of the messages array.

Because of API request body size limits, if your text content exceeds 9,000 tokens, pass the content using a file URL or a file ID.
import os
from openai import OpenAI, BadRequestError

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"), # If you have not set the environment variable, replace this with your API key
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

try:
    completion = client.chat.completions.create(
        model="qwen-doc-turbo",
        messages=[
            {'role': 'system', 'content': 'You are a helpful assistant.'},
            {'role': 'system', 'content': 'Smart Office Product Manual Version: V2.0 Release Date: January 2024 Table of Contents 1.1 Product Overview...'},
            {'role': 'user', 'content': 'From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).'}
        ],
        # This code example uses streaming output to clearly show the model's output process. For non-streaming output examples, see https://www.alibabacloud.com/help/en/model-studio/user-guide/text-generation
        stream=True,
        stream_options={"include_usage": True}
    )

    full_content = ""
    for chunk in completion:
        if chunk.choices and chunk.choices[0].delta.content:
            full_content += chunk.choices[0].delta.content
            print(chunk.model_dump())
    
    print(full_content)

except BadRequestError as e:
    print(f"Error message: {e}")
    print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.http.StreamResponse;
import com.openai.models.chat.completions.*;

public class Main {
    public static void main(String[] args) {
        // Create a client and use the API key from the environment variable
        OpenAIClient client = OpenAIOkHttpClient.builder()
                // If you have not set the environment variable, replace the next line with your Model Studio API key: .apiKey("sk-xxx");
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .baseUrl("https://dashscope.aliyuncs.com/compatible-mode/v1")
                .build();

        ChatCompletionCreateParams chatParams = ChatCompletionCreateParams.builder()
                .addSystemMessage("You are a helpful assistant.")
                .addSystemMessage("Smart Office Product Manual Version: V2.0 Release Date: January 2024 Table of Contents 1.1 Product Overview...")
                .addUserMessage("From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).")
                .model("qwen-doc-turbo")
                .build();

        try (StreamResponse<ChatCompletionChunk> streamResponse = client.chat().completions().createStreaming(chatParams)) {
            streamResponse.stream().forEach(chunk -> {
                String content = chunk.choices().get(0).delta().content().orElse("");
                if (!content.isEmpty()) {
                    System.out.print(content);
                }
            });
        } catch (Exception e) {
            System.err.println("Error message: " + e.getMessage());
        }
    }
}
curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header "Content-Type: application/json" \
--data '{
    "model": "qwen-doc-turbo",
    "messages": [
        {"role": "system","content": "You are a helpful assistant."},
        {"role": "system","content": "Smart Office Product Manual Version: V2.0 Release Date: January 2024 Table of Contents 1.1 Product Overview..."},
        {"role": "user","content": "From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."}
    ],
    "stream": true,
    "stream_options": {
        "include_usage": true
    }
}'

Response example

[
  {
    "model": "PRO-100",
    "name": "Smart Printer",
    "price": "8999"
  },
  {
    "model": "PRO-200",
    "name": "Smart Scanner",
    "price": "12999"
  },
  ...
  {
    "model": "SEC-400",
    "name": "Smart Visitor System",
    "price": "9999"
  },
  {
    "model": "SEC-500",
    "name": "Smart Parking Management",
    "price": "22999"
  }
]

Model pricing

Model

Context window

Max input

Max output

Input cost

Output cost

Free quota

(Tokens)

(Million tokens)

qwen-doc-turbo

262,144

253,952

32,768

$0.087

$0.144

No free quota

FAQ

  1. Where are files stored after being uploaded through the OpenAI compatible file interface?

    All files uploaded through the OpenAI compatible file interface are stored free of charge in the Alibaba Cloud Model Studio bucket under your Alibaba Cloud account. For more information about how to query and manage uploaded files, see OpenAI file interface.

  2. When uploading using the file URL method, what are the differences between the file_parsing_strategy parameter options?

    When the parsing strategy is set to "auto", the system automatically parses the file based on its content. When set to "text_only", the system parses only text content. When set to "text_and_images", the system parses all images and text content, which increases the parsing time.

  3. How can I determine if a file has finished parsing?

    After you obtain a file ID, you can try to start a conversation with the model using that ID. If the file is still parsing, the API returns the error message File parsing in progress, please try again later.. If this happens, try again later. If the model call is successful and returns a response, the file has finished parsing and is ready to use.

  4. Does the parsing process after file upload incur any extra costs?

    Document parsing is free of charge.

API reference

For the input and output parameters of Qwen-Doc-Turbo, see Qwen API reference.

Error codes

If a call fails, see Error messages for troubleshooting.

Limitations

  • SDK dependencies:

    • File URL (doc_url): The file URL method currently supports only the DashScope protocol. You can use the DashScope Python SDK or make an HTTP call, such as using curl.

    • Upload file (file-id): File upload and management operations must use an OpenAI compatible SDK.

  • File upload and reference:

    • File URL (doc_url): A single request supports up to 10 file URLs. The provided URLs must be accessible from the public network.

    • Upload file (file-id): The maximum size of a single file is 150 MB. Each Alibaba Cloud account is limited to 10,000 uploaded files, with a total size of up to 100 GB. Uploaded files do not expire. Each request can reference only one file.

      When you use file IDs, new file upload requests will fail if the file count or total size limit is reached. To continue uploading, delete files that are no longer needed to release your quota. For more information, see OpenAI compatible - File.
    • Supported formats: TXT, DOC, DOCX, PDF, XLS, XLSX, MD, PPT, PPTX, JPG, JPEG, PNG, GIF, and BMP.

  • API input:

    • When passing information using doc_url or a file-id, the maximum context length is 262,144 tokens.

    • When entering plain text directly in a user or system message, the content of a single message is limited to 9,000 tokens.

  • API output:

    • The maximum output length is 32,768 tokens.

  • File sharing:

    • A file-id is valid only within the Alibaba Cloud account that generated it. It cannot be used across accounts or called using the API key of a RAM user.

  • Rate limit: See Rate limits.