All Products
Search
Document Center

Alibaba Cloud Model Studio:Batch inference

Last Updated:Jun 22, 2026

For inference scenarios that do not require real-time responses, batch inference asynchronously processes large volumes of data requests at 50% of the cost of real-time inference. Its OpenAI-compatible API is ideal for batch jobs such as model evaluation and data labeling.

How it works

  1. Submit a task: Upload a JSONL file containing multiple requests to create a batch inference task.

  2. Asynchronous processing: The system processes tasks in a background queue. You can monitor task progress and status on the console or by using the API.

  3. Download results: After the task is complete, the system generates a result file for successful responses and an error file detailing any failures.

Scope

Singapore

Supported models: qwen-max, qwen-plus, qwen-flash, qwen-turbo.

China (Beijing)

Supported models:

  • Text generation models: The stable versions and some latest versions of Qwen-Max, Plus, Flash, and Long. The QwQ series (qwq-plus) and some third-party models (deepseek-r1, deepseek-v3.2, deepseek-v3) are also supported.

  • Multimodal models: The stable versions and some latest versions of Qwen-VL-Max, Plus, and Flash. The Qwen-OCR model is also supported.

  • Text embedding models: The text-embedding-v4 model.

List of supported model names

Important
  • In the batch processing scenario, the maximum context tokens per request is 256 K for qwen3.7-max, qwen3.7-plus, qwen3.6-plus, qwen3.5-plus, and qwen3.5-flash.

  • Some models support thinking mode. Enabling this mode generates thinking tokens and increases costs.

  • The qwen3.7, qwen3.6, and qwen3.5 series models have thinking mode enabled by default. If you use a hybrid thinking model, you must explicitly set the enable_thinking parameter. Set this parameter to true to enable the mode or false to disable it.

  • In the JSONL request body, enable_thinking is a top-level parameter of body and must be placed at the same level as model. Do not place it inside extra_body.

Use batch inference

Step 1: Prepare the input file

Before creating a task, prepare a JSONL file that meets the following requirements:

  • Format: UTF-8 encoded JSONL (one JSON object per line).

  • Scale limits: Up to 50,000 requests per file, and a maximum file size of 500 MB.

    If your dataset exceeds these limits, split it into multiple files and submit them as separate tasks.
  • Per-line limit: Each JSON object can be up to 6 MB and must not exceed the model's context window.

  • Consistency: All requests within the same file must use the same model.

  • Unique identifier: Each request must include a custom_id field that is unique within the file for result matching. The custom_id supports a maximum of 256 characters. If this limit is exceeded, the task validation fails. To return a longer identifier, use a custom field in the metadata parameter when creating the task. For more information, see Use metadata to return custom identifiers.

Each JSON object must adhere to the following schema:

Parameter

Type

Required

Description

custom_id

string

Yes

A unique identifier for the request within the file.

method

string

Yes

The supported HTTP method is POST.

url

string

Yes

Only the /v1/chat/completions request endpoint is supported.

body

object

Yes

The request body is in the same format as the /v1/chat/completions API.

Sample file

You can download the sample file test_model.jsonl. The content is as follows:

{"custom_id":"1","method":"POST","url":"/v1/chat/completions","body":{"model":"qwen-max","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Hello!"}]}}
{"custom_id":"2","method":"POST","url":"/v1/chat/completions","body":{"model":"qwen-max","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is 2+2?"}]}}

JSONL batch generation tool

Use this tool to quickly generate JSONL files.

JSONL batch generation tool
Select a mode:

Configure thinking mode in batch inference

Some models, such as qwen3.7-plus, qwen3.7-max, and the qwen3.6 and qwen3.5 series, have thinking mode enabled by default, which generates additional thinking tokens. To configure thinking mode for batch inference, set the enable_thinking parameter at the same level as the model parameter within the body of each request. The optional thinking_budget parameter can be used to set an upper limit on the number of thinking tokens.

Important

The enable_thinking and thinking_budget parameters must be placed directly at the top level of the body, at the same level as model. Do not place them in extra_body. The extra_body parameter is a mechanism for passing non-standard parameters with the OpenAI Python SDK; it is effective only for real-time inference calls and does not apply to batch inference files.

Example: Disable thinking mode

{"custom_id":"request-1","method":"POST","url":"/v1/chat/completions","body":{"model":"qwen3.5-plus","enable_thinking":false,"messages":[{"role":"user","content":"Hello"}]}}

Example: Enable thinking mode and limit the thinking token budget

{"custom_id":"request-2","method":"POST","url":"/v1/chat/completions","body":{"model":"qwen3.5-plus","enable_thinking":true,"thinking_budget":50,"messages":[{"role":"user","content":"Please analyze the following question"}]}}

Step 2: Create a batch inference task

  1. On the Batch Inference** page, click Create Batch.

  2. In the dialog box that appears, enter a Task Name and Description, set the Maximum Waiting Time (from 1 to 14 days), and upload your JSONL file.

    You can click Download Sample File to obtain the template.

    image

  3. When you are done, click Confirm.

Step 3: Monitor and manage tasks

  • View:

    • On the task list page, view the task's progress (processed requests/total requests) and Status.

    • Search by task name or ID, or filter by workspace to quickly locate a specific task.image

  • Manage:

    • Cancel: You can cancel tasks that are in the Executing state in the Actions column.

    • Troubleshoot: For a failed task, hover over the status to view an error summary or download the error file for details.image

Step 4: Download results

When the task is complete, click View Results to download the output file: image

  • Result file: Records all successful requests and their response results.

  • Error file (if any): Records all failed requests and their error details.

Both files contain a custom_id field, which is used to match with the original input data, associate results, or locate errors.

Step 5: View usage statistics (optional)

On the Model Monitoring page, you can filter and view usage statistics for batch inference.

  • View data overview: Select a Time Range (up to 30 days), set Inference Type to Batches, and you can view the following:

    • Monitoring data: Summary statistics for all models in the selected period, such as the total number of calls and failures.

    • Model list: Detailed data for each model, such as total calls, failure rate, and average call duration.

    image

    To view inference data that is more than 30 days old, go to the Bills page.
  • View model details: In the Models, click Monitor in the Actions column of the target model to view Call Statistics, such as the number of calls and call volume.image

Important
  • Call data for batch inference is recorded based on the task completion time. For running tasks, call information cannot be queried until the task is complete.

  • Monitoring data may have a delay of one to two hours.

Use metadata to return custom identifiers

custom_id supports up to 256 characters. If you need to return a longer identifier in the result file, you can use a custom field in metadata.

Metadata fields

metadata is an optional parameter for creating a Batch task. It supports the following fields:

  • ds_name: The name of the task. This name is displayed in the Task Name column on the console.

  • ds_description: The description of the task. This description is displayed in the Task Description column in the console.

  • Custom fields: In addition to the official fields, the metadata object also supports any custom fields, and their values are not limited to 256 characters. When you query task details, all custom fields are returned in full.

Code example

The following example shows how to use a custom field in metadata to pass back an identifier that is longer than 256 characters:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

batch = client.batches.create(
    input_file_id="file-batch-xxxxxxxxxxxxxxxxxxxx",
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
        "ds_name": "my_batch_task",
        "ds_description": "A description for my batch inference task",
        "my_custom_field": "The value of this field can exceed 256 characters and is used to pass back longer identifier information..."
    }
)
print(batch)

After a task is successfully created, you can call the GET /v1/batches/{batch_id} operation to retrieve the complete metadata information, which includes all custom fields and their full content.

API reference

In a production environment, use the OpenAI-compatible API to automate the creation and management of batch tasks. The core workflow is as follows:

  1. Upload a file

    Call POST /v1/files to upload a file. Record the file ID that is returned.

  2. To create a task, pass in the file ID , call POST /v1/batches, and record the returned batch_id.

  3. Poll status by using batch_id to poll GET /v1/batches/{batch_id}. When the status changes to completed, record the output_file_id and stop polling.

  4. To download the result file, use the output_file_id to call GET /v1/files/{output_file_id}/content.

For complete Batch API definitions and code examples, see OpenAI-compatible - Batch (file input).

Task lifecycle

Status

Description

validating

The system is validating the file format (JSONL specification) and the API format of each request.

in_progress

The system has validated the file and started processing the inference requests.

finalizing

All requests have been processed, and the system is writing the results to the output files.

completed

The result and error files have been generated and are available for download.

failed

The task failed during the validating stage, typically due to file-level errors such as an incorrect JSONL format or an oversized file. In this state, no inference requests are executed, and no result file is generated.

expired

The task's runtime exceeded the maximum waiting time set at creation and was terminated by the system. When creating a new task, consider setting a longer waiting time.

cancelled

The task was cancelled by the user. Any unprocessed requests are terminated.

Billing

  • Pricing: For all successful requests, both input and output tokens are priced at 50% of the real-time inference price for the corresponding model. For details, see Models and Pricing.

  • Billing scope:

    • You are billed only for successfully executed requests within a task.

    • File parsing failures, task execution failures, or line-level request errors do not incur charges.

    • For cancelled tasks, any requests that were successfully completed before cancellation are billed normally.

Note
  • Batch inference is a separate billable item and supports the AI Universal Savings Plan. However, it is not eligible for other promotions like prepaid plans (Savings Plans) and new user free quotas, or features like context caching.

  • Some models, such as qwen3.7-plus, qwen3.7-max, and the qwen3.6 and qwen3.5 series, have thinking mode enabled by default. This generates additional thinking tokens, which are billed at the output token price, thereby increasing costs. To control costs, set the enable_thinking parameter based on task complexity. For details, see Deep Thinking.

FAQ

  1. Do I need to purchase or enable anything extra to use batch inference?

    No. The feature is available after you activate Model Studio. Charges are incurred on a pay-as-you-go basis and deducted from your account balance.

  2. Why did my task fail immediately after submission (status changed to failed)?

    This typically indicates a file-level error, and no inference requests were executed. Check the following in order:

    • File format: Verify that the file uses the strict JSONL format, with one complete JSON object per line.

    • File scale: Ensure the file size and number of lines do not exceed the limits. For details, see Step 1: Prepare the input file.

    • Model consistency: Check that the body.model field is identical for all requests in the file, and that the model used is supported in the current region.

  3. How long does a task take to process?

    Processing time depends on the system load when the task is submitted. During busy periods, tasks may be queued. However, a result (success or failure) is always returned within the specified maximum waiting time.

Error codes

If a call fails and an error message is returned, see Error codes.