How to Deploy Large Language Models - Alibaba Cloud Model Studio

Get independent, dedicated inference services for high concurrency, low latency, and other performance requirements.

Important

This document applies only to the Chinese mainland (Beijing) region.

Billing methods

Before deployment, view the estimated hourly cost for different models on the Model Deployment console (Beijing).

Note

You cannot change the billing method after service creation. To switch, unpublish the deployed model and then redeploy it.

	Provisioned Throughput (PTU) (High throughput; high performance)	Model Unit (Whitelist open. Contact your account manager; customizable performance metrics; resource isolation; for fine-tuned models)
Definition	A model deployment method that reserves platform resources to ensure a specific TPM throughput capacity. No speed limits within the guaranteed quota.	A model deployment method that configures computing power based on usage duration and the number of model units, with dedicated resources.
Advantages	Provides stable throughput capacity, lower latency, and stronger resource determinism for high-load production environments. Supports auto-renewal.	Customize performance metrics such as latency and throughput. Supports auto-renewal.
Supported models	Some pre-built models	Fine-tuned models
Scenarios	Smart customer service for banking apps (stable traffic, requires guaranteed concurrent experience). Real-time content moderation for social platforms (requires stable processing of predictable pipeline tasks). Public cloud translation API (provides baseline service assurance for standard package users).	E-commerce exclusive fine-tuned Large Language Model (deploy private models, manually scale out during sales promotions). Molecular screening models for pharmaceutical companies (require dedicated resources for long-running tasks). Autonomous driving simulation (requires long-term continuous computation).
Billing diagram
Billing method	Based on usage duration and provisioned throughput Pay-as-you-go, daily package	Based on usage duration and number of model units Pay-as-you-go, monthly package
Scaling method	Self-service adjustment of throughput	Self-service adjustment of model unit quantity

Product constraints	Upfront payment is billed daily. No early refunds. If usage exceeds the purchased throughput within a unit of time, the service automatically switches to Model Studio's Model invocation service.	If you cancel an upfront purchase within the first month, the daily unit price is billed at 1.2 times the original rate.

To view token usage and historical call statistics for a single call, go to Monitoring (Beijing).

Billing details

Billing by usage duration (Provisioned Throughput)

Cost = Usage Duration × (Input TPM Unit Price × Input TPM + Output TPM Unit Price × Output TPM)

Upfront payment orders take effect immediately upon payment. The validity period ends at 23:59 on day N. If you place an order after 22:00, the expiration date automatically extends by one day.
After an upfront payment order expires, the service stops after a 2 hour delay. Resources are retained for 14 hours after stopping, then they are released.
Upfront payment orders cannot be terminated early.
For pay-as-you-go billing, if your account has overdue payments, deployed resources are retained and continue to be billed for 24 hours, then they are automatically released.

If the model input exceeds the maximum input tokens or the purchased TPM (Tokens per Minute), calls automatically switch to pay-as-you-go model calls. Inference performance may decrease. Rate limiting occupies public traffic in the workspace. Costs are charged according to the pay-as-you-go model call standard.

In this case, the API call return header includes the following:x-dashscope-ptu-overflow:true.
For TPM statistics, go to Model Monitoring (Beijing).

Model name	Model type	Max context window (Input Tokens + Output Tokens)	Max input tokens	Pay-as-you-go - hourly		Upfront—per day
Model name	Model type	Max context window (Input Tokens + Output Tokens)	Max input tokens	Input (per 10k TPM)	Output (per 1k TPM)	Input (per 10k TPM)	Output (per 1k TPM)
Qwen3-max-2025-09-23	Instruct	128,000	128,000	$1.11	$0.45	$13.32	$5.40
Qwen-plus-2025-12-01	Instruct			$0.28	$0.07	$3.36	$0.84
Qwen-plus-2025-12-01	Thinking			$0.28	$0.28	$3.36	$3.36
Qwen-flash-2025-07-28	Instruct/Thinking			$0.06	$0.06	$0.72	$0.72
Qwen3-vl-plus-2025-09-23	Instruct/Thinking			$0.35	$0.35	$4.20	$4.20
DeepSeek-v3.2	Instruct/Thinking		64,000	$1.04	$0.16	$12.48	$1.92

Model types:

Instruct - The model performs inference in non-thinking mode after deployment.
Thinking - The model performs inference in thinking mode after deployment.

Billing by usage duration (Model Unit)

Cost = Usage Duration (hours) × Number of Model Units × Model Unit Price

If you cancel an upfront purchase within the first month, the daily unit price is billed at 1.2 times the original rate.

Note

Model unit computing power resources for pay-as-you-go are allocated on a first-come, first-served basis. If a purchase fails, you receive a full refund.

Qwen

Model name

Model type

Rate limiting

Model unit specification

Max context window

Unit price

(billed per minute for less than 1 minute)

Monthly unit price

(billed per day for less than 1 day)

(If you cancel an upfront purchase within the first month, the daily unit price is billed at 1.2 times the original rate.)

Qwen3-14B

Instruct

Not supported

Type I Model Unit (MU1)

Fixed at: 131,072

See qwen-3

$40/hour

$18,800/month

Qwen3-32B

Instruct

Not supported

Type I Model Unit (MU1)

Fixed at: 131,072

See qwen-3

$40/hour

$18,800/month

Model types:

Instruct - The model performs inference in non-thinking mode after deployment.

Qwen-VL

Model name

Model type

Rate limiting

Model unit specification

Max context window

Unit price

(billed per minute for less than 1 minute)

Monthly unit price

(billed per day for less than 1 day)

(If you cancel an upfront purchase within the first month, the daily unit price is billed at 1.2 times the original rate.)

Qwen3-VL-8B-Instruct

Instruct

Not supported

Type I Model Unit (MU1)

Fixed at: 131,072

$20/hour

$9,400/month

Qwen3-VL-8B-Thinking

Thinking

Not supported

Model types:

Instruct - The model performs inference in non-thinking mode after deployment.
Thinking - The model performs inference in thinking mode after deployment.

To deploy more models, refer to this solution and select the most suitable deployment solution based on your specific business requirements.

Deployment methods

Deploy models in the console. Follow these steps:

If you encounter a permission error, see the following: What do I do if I get a permission error during deployment?

Go to the Deployment console (Beijing).

Select a model and a billing method. Retain the default settings. Then set a model name and start the deployment.

Once the deployment status shows Running, the model is successfully deployed.

Important

You incur charges after successful model deployment.

Deployment configuration

Model unit

Configuration	Details
Configure model inference pattern	Some models let you configure the inference pattern and maximum context window when deployed as a Model Unit. Instruct: The model performs inference in non-thinking mode after deployment. Thinking: The model performs inference in thinking mode after deployment.
Maximum context window	Some models support this setting in Model Unit deployment mode. The maximum context window depends on the model type.
Service throttling	Some models support this setting in Model Unit deployment mode. You can limit RPM and TPM for model calls.

Call a deployed model

After you deploy a model, you can call it using the OpenAI-compatible interface, DashScope, or the Assistant SDK.

When you call a successfully deployed model, set the model parameter to the model code that appears after deployment. Go to the Deployment console (Beijing) to get the Model Code.

DashScope

import os
import dashscope

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
response = dashscope.Generation.call(
    # If you have not set the environment variable, replace the next line with: api_key="sk-xxx"
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model="qwen3-max-xxx-xxx",  # Replace with the model code from your successful deployment
    messages=messages,
    result_format="message",
    enable_thinking=False,
)
print(response)

OpenAI-compatible interface

import os
from openai import OpenAI


client = OpenAI(
    # If you have not set the environment variable, replace the next line with: api_key="sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3-max-xxx-xxx",  # Replace with the model code from your successful deployment
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who are you?"},
    ],
    extra_body={"enable_thinking": False},
)
print(completion)

Scale your service

Provisioned throughput (by time): Click Scaling to adjust the number of instances manually.
Model unit (by time): Click Scaling to adjust the number of instances manually.

Unpublish a deployed service

Go to the Deployment console (Beijing), find the deployed servic, click Deactivate, and confirm. Billing stops after unpublishing.

FAQ

Can I upload and deploy my own models?

Currently, you cannot upload and deploy your own models. We recommend that you follow the latest updates from Alibaba Cloud Model Studio.

Platform for AI (PAI) also lets you deploy your own models. For deployment methods, see Deploy large language models.

What if a permission error occurs during deployment?

If the message 'Missing permissions for this module' appears, ensure that your account is granted the 'ModelDeploy-FullAccess' permission on the permission management page for the workspace.
If you cannot perform the operation, contact your organization or IT administrator to add the required permissions or resolve the issue.
If the 'Workspace xx does not have permission to deploy model xx' error occurs during deployment, go to the Workspaces page in Model Studio and add the deployment permission for the model to the workspace.
API call error: Workspace xxx does not have deployment privilege for model xxxx.
If a permission error occurs, contact your organization or IT administrator to add the required permissions or perform the operation for you.