For inference scenarios that do not require real-time responses, batch inference asynchronously processes large volumes of data requests at 50% of the cost of real-time inference. Its OpenAI-compatible API is ideal for batch jobs such as model evaluation and data labeling.
How it works
-
Submit a task: Upload a JSONL file containing multiple requests to create a batch inference task.
-
Asynchronous processing: The system processes tasks in a background queue. You can monitor task progress and status on the console or by using the API.
-
Download results: After the task is complete, the system generates a result file for successful responses and an error file detailing any failures.
Scope
Singapore
Supported models: qwen-max, qwen-plus, qwen-flash, qwen-turbo.
China (Beijing)
Supported models:
-
Text generation models: The stable versions and some
latestversions of Qwen-Max, Plus, Flash, and Long. The QwQ series (qwq-plus) and some third-party models (deepseek-r1, deepseek-v3.2, deepseek-v3) are also supported. -
Multimodal models: The stable versions and some
latestversions of Qwen-VL-Max, Plus, and Flash. The Qwen-OCR model is also supported. -
Text embedding models: The text-embedding-v4 model.
-
In the batch processing scenario, the maximum context tokens per request is 256 K for
qwen3.7-max,qwen3.7-plus,qwen3.6-plus,qwen3.5-plus, andqwen3.5-flash. -
Some models support thinking mode. Enabling this mode generates thinking
tokensand increases costs. -
The
qwen3.7,qwen3.6, andqwen3.5series models have thinking mode enabled by default. If you use a hybrid thinking model, you must explicitly set theenable_thinkingparameter. Set this parameter totrueto enable the mode orfalseto disable it. -
In the JSONL request body,
enable_thinkingis a top-level parameter ofbodyand must be placed at the same level asmodel. Do not place it insideextra_body.
Use batch inference
Step 1: Prepare the input file
Before creating a task, prepare a JSONL file that meets the following requirements:
-
Format: UTF-8 encoded JSONL (one JSON object per line).
-
Scale limits: Up to 50,000 requests per file, and a maximum file size of 500 MB.
If your dataset exceeds these limits, split it into multiple files and submit them as separate tasks.
-
Per-line limit: Each JSON object can be up to 6 MB and must not exceed the model's context window.
-
Consistency: All requests within the same file must use the same model.
-
Unique identifier: Each request must include a
custom_idfield that is unique within the file for result matching. The custom_id supports a maximum of 256 characters. If this limit is exceeded, the task validation fails. To return a longer identifier, use a custom field in themetadataparameter when creating the task. For more information, see Use metadata to return custom identifiers.
Each JSON object must adhere to the following schema:
|
Parameter |
Type |
Required |
Description |
|
|
string |
Yes |
A unique identifier for the request within the file. |
|
|
string |
Yes |
The supported HTTP method is |
|
|
string |
Yes |
Only the |
|
|
object |
Yes |
The request body is in the same format as the |
Sample file
You can download the sample file test_model.jsonl. The content is as follows:
{"custom_id":"1","method":"POST","url":"/v1/chat/completions","body":{"model":"qwen-max","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Hello!"}]}}
{"custom_id":"2","method":"POST","url":"/v1/chat/completions","body":{"model":"qwen-max","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is 2+2?"}]}}JSONL batch generation tool
Use this tool to quickly generate JSONL files.
Configure thinking mode in batch inference
Some models, such as qwen3.7-plus, qwen3.7-max, and the qwen3.6 and qwen3.5 series, have thinking mode enabled by default, which generates additional thinking tokens. To configure thinking mode for batch inference, set the enable_thinking parameter at the same level as the model parameter within the body of each request. The optional thinking_budget parameter can be used to set an upper limit on the number of thinking tokens.
The enable_thinking and thinking_budget parameters must be placed directly at the top level of the body, at the same level as model. Do not place them in extra_body. The extra_body parameter is a mechanism for passing non-standard parameters with the OpenAI Python SDK; it is effective only for real-time inference calls and does not apply to batch inference files.
Example: Disable thinking mode
{"custom_id":"request-1","method":"POST","url":"/v1/chat/completions","body":{"model":"qwen3.5-plus","enable_thinking":false,"messages":[{"role":"user","content":"Hello"}]}}
Example: Enable thinking mode and limit the thinking token budget
{"custom_id":"request-2","method":"POST","url":"/v1/chat/completions","body":{"model":"qwen3.5-plus","enable_thinking":true,"thinking_budget":50,"messages":[{"role":"user","content":"Please analyze the following question"}]}}
Step 2: Create a batch inference task
-
On the Batch Inference** page, click Create Batch.
-
In the dialog box that appears, enter a Task Name and Description, set the Maximum Waiting Time (from 1 to 14 days), and upload your JSONL file.
You can click Download Sample File to obtain the template.

-
When you are done, click Confirm.
Step 3: Monitor and manage tasks
-
View:
-
On the task list page, view the task's progress (processed requests/total requests) and Status.
-
Search by task name or ID, or filter by workspace to quickly locate a specific task.

-
-
Manage:
-
Cancel: You can cancel tasks that are in the Executing state in the Actions column.
-
Troubleshoot: For a failed task, hover over the status to view an error summary or download the error file for details.

-
Step 4: Download results
When the task is complete, click View Results to download the output file: 
-
Result file: Records all successful requests and their
responseresults. -
Error file (if any): Records all failed requests and their
errordetails.
Both files contain a custom_id field, which is used to match with the original input data, associate results, or locate errors.
Step 5: View usage statistics (optional)
On the Model Monitoring page, you can filter and view usage statistics for batch inference.
-
View data overview: Select a Time Range (up to 30 days), set Inference Type to Batches, and you can view the following:
-
Monitoring data: Summary statistics for all models in the selected period, such as the total number of calls and failures.
-
Model list: Detailed data for each model, such as total calls, failure rate, and average call duration.

To view inference data that is more than 30 days old, go to the Bills page.
-
-
View model details: In the Models, click Monitor in the Actions column of the target model to view Call Statistics, such as the number of calls and call volume.

-
Call data for batch inference is recorded based on the task completion time. For running tasks, call information cannot be queried until the task is complete.
-
Monitoring data may have a delay of one to two hours.
Use metadata to return custom identifiers
custom_id supports up to 256 characters. If you need to return a longer identifier in the result file, you can use a custom field in metadata.
Metadata fields
metadata is an optional parameter for creating a Batch task. It supports the following fields:
-
ds_name: The name of the task. This name is displayed in the Task Name column on the console. -
ds_description: The description of the task. This description is displayed in the Task Description column in the console. -
Custom fields: In addition to the official fields, the
metadataobject also supports any custom fields, and their values are not limited to 256 characters. When you query task details, all custom fields are returned in full.
Code example
The following example shows how to use a custom field in metadata to pass back an identifier that is longer than 256 characters:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
batch = client.batches.create(
input_file_id="file-batch-xxxxxxxxxxxxxxxxxxxx",
endpoint="/v1/chat/completions",
completion_window="24h",
metadata={
"ds_name": "my_batch_task",
"ds_description": "A description for my batch inference task",
"my_custom_field": "The value of this field can exceed 256 characters and is used to pass back longer identifier information..."
}
)
print(batch)
After a task is successfully created, you can call the GET /v1/batches/{batch_id} operation to retrieve the complete metadata information, which includes all custom fields and their full content.
API reference
In a production environment, use the OpenAI-compatible API to automate the creation and management of batch tasks. The core workflow is as follows:
-
Call
POST /v1/filesto upload a file. Record the file ID that is returned. -
To create a task, pass in the file ID , call
POST /v1/batches, and record the returnedbatch_id. -
Poll status by using
batch_idto pollGET /v1/batches/{batch_id}. When thestatuschanges tocompleted, record theoutput_file_idand stop polling. -
To download the result file, use the
output_file_idto callGET /v1/files/{output_file_id}/content.
For complete Batch API definitions and code examples, see OpenAI-compatible - Batch (file input).
Task lifecycle
|
Status |
Description |
|
validating |
The system is validating the file format (JSONL specification) and the API format of each request. |
|
in_progress |
The system has validated the file and started processing the inference requests. |
|
finalizing |
All requests have been processed, and the system is writing the results to the output files. |
|
completed |
The result and error files have been generated and are available for download. |
|
failed |
The task failed during the |
|
expired |
The task's runtime exceeded the maximum waiting time set at creation and was terminated by the system. When creating a new task, consider setting a longer waiting time. |
|
cancelled |
The task was cancelled by the user. Any unprocessed requests are terminated. |
Billing
-
Pricing: For all successful requests, both input and output tokens are priced at 50% of the real-time inference price for the corresponding model. For details, see Models and Pricing.
-
Billing scope:
-
You are billed only for successfully executed requests within a task.
-
File parsing failures, task execution failures, or line-level request errors do not incur charges.
-
For cancelled tasks, any requests that were successfully completed before cancellation are billed normally.
-
-
Batch inference is a separate billable item and supports the AI Universal Savings Plan. However, it is not eligible for other promotions like prepaid plans (Savings Plans) and new user free quotas, or features like context caching.
-
Some models, such as qwen3.7-plus, qwen3.7-max, and the qwen3.6 and qwen3.5 series, have thinking mode enabled by default. This generates additional thinking tokens, which are billed at the output token price, thereby increasing costs. To control costs, set the
enable_thinkingparameter based on task complexity. For details, see Deep Thinking.
FAQ
-
Do I need to purchase or enable anything extra to use batch inference?
No. The feature is available after you activate Model Studio. Charges are incurred on a pay-as-you-go basis and deducted from your account balance.
-
Why did my task fail immediately after submission (status changed to
failed)?This typically indicates a file-level error, and no inference requests were executed. Check the following in order:
-
File format: Verify that the file uses the strict JSONL format, with one complete JSON object per line.
-
File scale: Ensure the file size and number of lines do not exceed the limits. For details, see Step 1: Prepare the input file.
-
Model consistency: Check that the
body.modelfield is identical for all requests in the file, and that the model used is supported in the current region.
-
-
How long does a task take to process?
Processing time depends on the system load when the task is submitted. During busy periods, tasks may be queued. However, a result (success or failure) is always returned within the specified maximum waiting time.
Error codes
If a call fails and an error message is returned, see Error codes.