The data mining model is designed for information extraction, content moderation, classification, and summary generation. Unlike general-purpose chat models, this model quickly and accurately outputs standardized structured data, such as in JSON format. This addresses the issue of general-purpose models returning non-standard response structures or extracting information inaccurately.
This document applies only to the China (Beijing) region. To use the model, you must use an API key from the China (Beijing) region.
Implementation guide
Qwen-Doc-Turbo supports extracting information from files in three ways. For more information about file size and type limits, see Limitations.
Feature | File URL (Recommended) | File ID | Plain text |
File source | Public URL | Local file (upload required) | Passed as a string |
Input length limit | Up to 10 files | 1 file | Up to 9,000 tokens |
SDK compatibility | Only | Upload: |
|
Key advantages | No upload to Model Studio required. Supports batch calls. | Avoids repeated uploads. Ideal for reuse. | No file management required. |
Prerequisites
You have created an API key, and export the API key as an environment variable.
If you plan to call the model using a SDK, install the OpenAI SDK or the DashScope SDK.
Pass a file URL
Extract structured data directly using file URLs and process up to 10 files simultaneously. This example shows how to pass the Sample Product Manual A and Sample Product Manual B files and use a prompt to instruct the model to return the extracted information in JSON format.
The file URL method currently supports only the DashScope protocol. You can use the DashScope Python SDK or make an HTTP call, such as using curl.
import os
import dashscope
response = dashscope.Generation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'), # If you have not set the environment variable, replace this with your API key
model='qwen-doc-turbo',
messages=[
{"role": "system","content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "text",
"text": "From these two product manuals, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."
},
{
"type": "doc_url",
"doc_url": [
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/jockge/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CA.docx",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/ztwxzr/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CB.docx"
],
"file_parsing_strategy": "auto"
}
]
}]
)
try:
if response.status_code == 200:
print(response.output.choices[0].message.content)
else:
print(f"Request failed, status code: {response.status_code}")
print(f"Error code: {response.code}")
print(f"Error message: {response.message}")
print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")
except Exception as e:
print(f"An error occurred: {e}")
print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")
curl --location 'https://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $DASHSCOPE_API_KEY' \
--header 'X-DashScope-SSE: enable' \
--data '{
"model": "qwen-doc-turbo",
"input": {
"messages": [
{
"role": "system",
"content": "you are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "From these two product manuals, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."
},
{
"type": "doc_url",
"doc_url": [
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/jockge/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CA.docx",
"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20251107/ztwxzr/%E7%A4%BA%E4%BE%8B%E4%BA%A7%E5%93%81%E6%89%8B%E5%86%8CB.docx"
],
"file_parsing_strategy": "auto"
}
]
}
]
}
}'
Pass a file ID
Upload a file
Before you run the following code, click Sample Product Manual A to download the file and place it in the same directory as your project code. Then, upload the file to the secure bucket in Alibaba Cloud Model Studio through the OpenAI compatible interface to obtain a file-id. For more information about the parameters and call methods for the file upload interface, see the API reference.
Python
import os
from pathlib import Path
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"), # If you have not set the environment variable, replace this with your API key
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", # Enter the DashScope service base_url
)
file_object = client.files.create(file=Path("Sample Product Manual A.docx"), purpose="file-extract")
# Print the file-id for use in subsequent model calls
print(file_object.id)
Java
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.files.*;
import java.nio.file.Paths;
public class Main {
public static void main(String[] args) {
// Create a client and use the API key from the environment variable
OpenAIClient client = OpenAIOkHttpClient.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.baseUrl("https://dashscope.aliyuncs.com/compatible-mode/v1")
.build();
// Set the file path. Modify the path and filename as needed.
Path filePath = Paths.get("src/main/java/org/example/Sample Product Manual A.docx");
// Create file upload parameters
FileCreateParams fileParams = FileCreateParams.builder()
.file(filePath)
.purpose(FilePurpose.of("file-extract"))
.build();
// Upload the file and print the file-id
FileObject fileObject = client.files().create(fileParams);
// Print the file-id for use in subsequent model calls
System.out.println(fileObject.id());
}
}curl
curl --location --request POST 'https://dashscope.aliyuncs.com/compatible-mode/v1/files' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--form 'file=@"Sample Product Manual A.docx"' \
--form 'purpose="file-extract"'Run the code to obtain the file-id for the uploaded file.
Pass information and start a conversation using a file ID
Embed the obtained file-id into a system message. The first system message sets the role for the model. The subsequent system message passes the file-id. The user message contains the specific query about the file.
import os
from openai import OpenAI, BadRequestError
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"), # If you have not set the environment variable, replace this with your API key
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
try:
completion = client.chat.completions.create(
model="qwen-doc-turbo",
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
# Replace '{FILE_ID}' with the file-id from your scenario
{'role': 'system', 'content': 'fileid://{FILE_ID}'},
{'role': 'user', 'content': 'From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).'}
],
# This code example uses streaming output to clearly show the model's output process. For non-streaming output examples, see https://www.alibabacloud.com/help/en/model-studio/user-guide/text-generation
stream=True,
stream_options={"include_usage": True}
)
full_content = ""
for chunk in completion:
if chunk.choices and chunk.choices[0].delta.content:
full_content += chunk.choices[0].delta.content
print(chunk.model_dump())
print(full_content)
except BadRequestError as e:
print(f"Error message: {e}")
print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.http.StreamResponse;
import com.openai.models.chat.completions.*;
public class Main {
public static void main(String[] args) {
// Create a client and use the API key from the environment variable
OpenAIClient client = OpenAIOkHttpClient.builder()
// If you have not set the environment variable, replace the next line with your Model Studio API key: .apiKey("sk-xxx");
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.baseUrl("https://dashscope.aliyuncs.com/compatible-mode/v1")
.build();
ChatCompletionCreateParams chatParams = ChatCompletionCreateParams.builder()
.addSystemMessage("You are a helpful assistant.")
// Replace '{FILE_ID}' with the file-id from your scenario
.addSystemMessage("fileid://{FILE_ID}")
.addUserMessage("From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).")
.model("qwen-doc-turbo")
.build();
try (StreamResponse<ChatCompletionChunk> streamResponse = client.chat().completions().createStreaming(chatParams)) {
streamResponse.stream().forEach(chunk -> {
String content = chunk.choices().get(0).delta().content().orElse("");
if (!content.isEmpty()) {
System.out.print(content);
}
});
} catch (Exception e) {
System.err.println("Error message: " + e.getMessage());
}
}
}curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header "Content-Type: application/json" \
--data '{
"model": "qwen-doc-turbo",
"messages": [
{"role": "system","content": "You are a helpful assistant."},
{"role": "system","content": "fileid://{FILE_ID}"},
{"role": "user","content": "From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."}
],
"stream": true,
"stream_options": {
"include_usage": true
}
}'Pass plain text
In addition to passing file information using a file-id, you can also pass the file content directly as a string. When using this method, to prevent the model from confusing the role setting with the file content, ensure that the role-setting information is in the first message of the messages array.
Because of API request body size limits, if your text content exceeds 9,000 tokens, pass the content using a file URL or a file ID.
import os
from openai import OpenAI, BadRequestError
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"), # If you have not set the environment variable, replace this with your API key
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
try:
completion = client.chat.completions.create(
model="qwen-doc-turbo",
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'system', 'content': 'Smart Office Product Manual Version: V2.0 Release Date: January 2024 Table of Contents 1.1 Product Overview...'},
{'role': 'user', 'content': 'From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).'}
],
# This code example uses streaming output to clearly show the model's output process. For non-streaming output examples, see https://www.alibabacloud.com/help/en/model-studio/user-guide/text-generation
stream=True,
stream_options={"include_usage": True}
)
full_content = ""
for chunk in completion:
if chunk.choices and chunk.choices[0].delta.content:
full_content += chunk.choices[0].delta.content
print(chunk.model_dump())
print(full_content)
except BadRequestError as e:
print(f"Error message: {e}")
print("For more information, see https://www.alibabacloud.com/help/en/model-studio/developer-reference/error-codes")import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.http.StreamResponse;
import com.openai.models.chat.completions.*;
public class Main {
public static void main(String[] args) {
// Create a client and use the API key from the environment variable
OpenAIClient client = OpenAIOkHttpClient.builder()
// If you have not set the environment variable, replace the next line with your Model Studio API key: .apiKey("sk-xxx");
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.baseUrl("https://dashscope.aliyuncs.com/compatible-mode/v1")
.build();
ChatCompletionCreateParams chatParams = ChatCompletionCreateParams.builder()
.addSystemMessage("You are a helpful assistant.")
.addSystemMessage("Smart Office Product Manual Version: V2.0 Release Date: January 2024 Table of Contents 1.1 Product Overview...")
.addUserMessage("From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed).")
.model("qwen-doc-turbo")
.build();
try (StreamResponse<ChatCompletionChunk> streamResponse = client.chat().completions().createStreaming(chatParams)) {
streamResponse.stream().forEach(chunk -> {
String content = chunk.choices().get(0).delta().content().orElse("");
if (!content.isEmpty()) {
System.out.print(content);
}
});
} catch (Exception e) {
System.err.println("Error message: " + e.getMessage());
}
}
}curl --location 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header "Content-Type: application/json" \
--data '{
"model": "qwen-doc-turbo",
"messages": [
{"role": "system","content": "You are a helpful assistant."},
{"role": "system","content": "Smart Office Product Manual Version: V2.0 Release Date: January 2024 Table of Contents 1.1 Product Overview..."},
{"role": "user","content": "From this product manual, extract all product information and organize it into a standard JSON array. Each object must include the following: model (the product model), name (the product name), and price (the price, with currency symbols and commas removed)."}
],
"stream": true,
"stream_options": {
"include_usage": true
}
}'Model pricing
Model | Context window | Max input | Max output | Input cost | Output cost | Free quota |
(Tokens) | (Million tokens) | |||||
qwen-doc-turbo | 262,144 | 253,952 | 32,768 | $0.087 | $0.144 | No free quota |
FAQ
Where are files stored after being uploaded through the OpenAI compatible file interface?
All files uploaded through the OpenAI compatible file interface are stored free of charge in the Alibaba Cloud Model Studio bucket under your Alibaba Cloud account. For more information about how to query and manage uploaded files, see OpenAI file interface.
When uploading using the file URL method, what are the differences between the file_parsing_strategy parameter options?
When the parsing strategy is set to "auto", the system automatically parses the file based on its content. When set to "text_only", the system parses only text content. When set to "text_and_images", the system parses all images and text content, which increases the parsing time.
How can I determine if a file has finished parsing?
After you obtain a file ID, you can try to start a conversation with the model using that ID. If the file is still parsing, the API returns the error message
File parsing in progress, please try again later.. If this happens, try again later. If the model call is successful and returns a response, the file has finished parsing and is ready to use.Does the parsing process after file upload incur any extra costs?
Document parsing is free of charge.
API reference
For the input and output parameters of Qwen-Doc-Turbo, see Qwen API reference.
Error codes
If a call fails, see Error messages for troubleshooting.
Limitations
SDK dependencies:
File URL (doc_url): The file URL method currently supports only the DashScope protocol. You can use the
DashScope Python SDKor make an HTTP call, such as using curl.Upload file (file-id): File upload and management operations must use an
OpenAIcompatible SDK.
File upload and reference:
File URL (
doc_url): A single request supports up to 10 file URLs. The provided URLs must be accessible from the public network.Upload file (
file-id): The maximum size of a single file is 150 MB. Each Alibaba Cloud account is limited to 10,000 uploaded files, with a total size of up to 100 GB. Uploaded files do not expire. Each request can reference only one file.When you use file IDs, new file upload requests will fail if the file count or total size limit is reached. To continue uploading, delete files that are no longer needed to release your quota. For more information, see OpenAI compatible - File.
Supported formats: TXT, DOC, DOCX, PDF, XLS, XLSX, MD, PPT, PPTX, JPG, JPEG, PNG, GIF, and BMP.
API input:
When passing information using
doc_urlor afile-id, the maximum context length is 262,144 tokens.When entering plain text directly in a
userorsystemmessage, the content of a single message is limited to 9,000 tokens.
API output:
The maximum output length is 32,768 tokens.
File sharing:
A
file-idis valid only within the Alibaba Cloud account that generated it. It cannot be used across accounts or called using the API key of a RAM user.
Rate limit: See Rate limits.