Deploy a model on dedicated inference resources to meet performance requirements such as high concurrency, low latency, and predictable traffic.
This document applies only to the China (Beijing) region.
Supported models and pricing
Deployment uses Provisioned Throughput, billed by usage duration and Provisioned Throughput Unit (PTU).
Before you deploy, check the estimated hourly cost for each model in the Deployment console (Beijing).
| Model | Type | Context window (input + output) | Max input tokens | Pay-as-you-go (hourly) | Subscription (daily) | ||
|---|---|---|---|---|---|---|---|
| Input (per 10k TPM) | Output (per 1k TPM) | Input (per 10k TPM) | Output (per 1k TPM) | ||||
| Qwen3-Max-2025-09-23 | Instruct | 128,000 | 128,000 | $1.11 | $0.45 | $13.32 | $5.40 |
| Qwen-Plus-2025-12-01 | Instruct | $0.28 | $0.07 | $3.36 | $0.84 | ||
| Qwen-Plus-2025-12-01 | Thinking | $0.28 | $3.36 | ||||
| Qwen-Flash-2025-07-28 | Instruct/Thinking | $0.06 | $0.06 | $0.72 | $0.72 | ||
| Qwen3-VL-Plus-2025-09-23 | Instruct/Thinking | $0.35 | $0.35 | $4.20 | $4.20 | ||
| DeepSeek-v3.2 | Instruct/Thinking | 64,000 | $1.04 | $0.16 | $12.48 | $1.92 |
Model types:
Instruct: The model runs in non-thinking mode after deployment.
Thinking: The model runs in thinking mode after deployment.
To deploy models beyond this list, see available options in this deployment solution.
View token usage and call statistics for individual invocations in the Monitoring (Beijing) console.
Deploy a model
If you get a permission error, see What to do if I get a permission error during deployment in the FAQ section.
Go to the Deployment console (Beijing).

Select a model and billing method. Keep other settings at their defaults. Set a model name and start the deployment.
When the deployment status shows Running, the model is ready.
Billing starts as soon as the model is deployed.
Invoke a deployed model
After deployment, invoke the model through one of these APIs:
Set the model parameter to the Model Code shown in the Deployment console (Beijing).

OpenAI compatible
import os
from openai import OpenAI
client = OpenAI(
# If you haven't configured an environment variable, replace the next line with: api_key="sk-xxx",
api_key=os.getenv('DASHSCOPE_API_KEY'),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="<your-deployed-model-code>", # Replace with your Model Code from the deployment console
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
],
extra_body={"enable_thinking": False},
)
print(completion)DashScope
import os
import dashscope
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
response = dashscope.Generation.call(
# If you haven't configured an environment variable, replace the next line with: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="<your-deployed-model-code>", # Replace with your Model Code from the deployment console
messages=messages,
result_format="message",
enable_thinking=False,
)
print(response)Replace <your-deployed-model-code> with the Model Code from the deployment console.
Scale a deployed service
Click Scaling in the Deployment console (Beijing) to manually adjust the number of instances.
Deactivate a deployed service
Go to the Deployment console (Beijing).
Find the service and click Deactivate, then confirm.
Billing stops after deactivation.

Billing
Billing methods
You cannot change the billing method after you create the service. To switch, deactivate the deployed model and redeploy it.
| Pay-as-you-go | Subscription | |
|---|---|---|
| Minimum billing unit | Per minute | Per day |
| Scaling | Self-service throughput adjustment | Self-service throughput adjustment |
| Advantages | Stable throughput capacity, lower latency, and stronger resource certainty for high-load production environments | Stable throughput capacity, lower latency, and stronger resource certainty for high-load production environments. Supports auto-renewal. |
| Early termination | N/A | Days already used are charged at 1.5x the standard rate |
| Overdue payment | Resources remain active and billed for 24 hours, then released automatically | N/A |
Billing formula
Cost = Usage duration x (Input TPM unit price x Input TPM + Output TPM unit price x Output TPM)Subscription lifecycle
Orders take effect immediately after payment and expire at 23:59 on day N. Orders placed after 22:00 have the expiration extended by one day.
After expiration, the service stops after a 2-hour grace period. Resources are retained for 14 hours before release.
Subscription orders cannot be terminated early.
Overflow handling
If input exceeds the maximum input token limit or the purchased TPM quota, calls automatically fall back to Model Studio's standard model invocation service. When this happens:
Inference performance may degrade.
Rate limiting applies to your workspace.
Costs are calculated at standard pay-as-you-go invocation rates.
The API response includes the header
x-dashscope-ptu-overflow:true.
Monitor TPM statistics in the Monitoring (Beijing) console.
FAQ
Can I deploy my own models?
Not currently. Model Studio does not support uploading and deploying custom models at this time. Check the latest announcements for updates.
To deploy your own models, use Platform for AI (PAI).
What to do if I get a permission error during deployment
"Lack permissions for this module"
Grant the ModelDeploy-FullAccess permission to your account in the workspace's Permissions page.

If you cannot proceed, contact your organization or IT administrator.
"Workspace xx does not have deployment privilege for model xx"
Go to the Workspaces page and add deployment permissions for the model to the workspace.
API error message: Workspace xxx does not have deployment privilege for model xxxx.

If you cannot resolve the error, contact your organization or IT administrator.