Create a model deployment task.
Prerequisites
-
Read Introduction to model deployment and Deploy a model by using the API to understand how model deployment works and the basic workflow on Alibaba Cloud Model Studio.
-
Configure your API key for Model Studio. For details, see Get an API key.
Model deployment
Endpoint
POST https://dashscope-intl.aliyuncs.com/api/v1/deployments
Request examples
Provisioned Throughput (PTU)
After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.
The provisioned throughput billing method charges based on usage duration. This method is suitable for scenarios that require stable throughput, high concurrency, low latency, and predictable traffic. In this mode, the platform provisions both throughput/concurrency and generation speed, which you cannot adjust.
curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"name": "my_qwen_flash",
"model_name": "qwen-flash-2025-07-28",
"plan": "ptu",
"ptu_capacity": {
"input_tpm": 10000,
"output_tpm": 1000
}
}'Model unit
After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.
Computing resources for the post-paid model unit plan are allocated on a first-come, first-served basis. If the purchase is unsuccessful, a full refund will be issued.
The model unit billing method charges you based on usage duration. This billing method is ideal for large-scale inference tasks after model finetuning, offering dedicated resources with flexible performance and cost adjustments. You can customize both throughput/concurrency and generation speed.
curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"name": "my_qwen_plus",
"model_name": "qwen-plus-2025-12-01",
"plan": "mu",
"deploy_spec": "MU1",
"enable_thinking": true,
"capacity": 4,
"max_context_length": 10000,
"rpm_limit": 500,
"tpm_limit": 1000
}'The model unit deployment mode supports the following additional settings:
|
Configuration |
Details |
|
Configure model inference mode |
For some models, you can configure the inference mode, maximum context length, and other settings when deploying them using the Model Unit method.
|
|
Maximum context length |
This setting is supported for the Model Unit deployment mode of some models. The maximum context length depends on the model type. |
|
Service throttling |
This setting is supported for the Model Unit deployment mode of some models. It lets you limit the RPM and TPM of model calls. |
To learn how to configure these settings by using the API, see Create a model deployment task by using an API.
Token usage
With token usage billing, you are charged based on token usage. This method is suitable for cost-sensitive scenarios where concurrency and latency requirements are not critical. This mode offers the best price advantage; the platform provisions throughput/concurrency and generation speed, which you cannot adjust.
curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model_name": "qwen3-8b-ft-202511132025-0260",
"plan": "lora",
"capacity": 1,
"name": "qwen3-8b-ft"
}'The capacity parameter is required but currently has no effect. To request scaling, go to the model deployment console and submit a form.Request parameters
|
Parameter |
Type |
Location |
Required |
Description |
|||||||||||
|
model_name |
String |
body |
Yes |
The name of the model to deploy. This corresponds to the model ID in My Models. You can also get this ID from the output of the Create Training Job or Create Import Job operations. |
|||||||||||
|
plan |
String |
body |
Yes |
The deployment plan. The following billing methods are supported:
You can quickly find the supported deployment plans for a fine-tuned model in My Models. Note
Fine-tuned CosyVoice models currently only support |
|||||||||||
|
name |
String |
body |
Yes |
The display name of the model in the console. |
|||||||||||
|
capacity |
Integer |
body |
No |
Required only when Note
CosyVoice models currently provide the following two deployment templates with corresponding
|
|||||||||||
|
billing_method |
String |
body |
No |
Required only when |
|||||||||||
|
deploy_spec |
String |
body |
No |
This setting is applicable only when For details about feature support, see Feature support for model unit deployment. |
This parameter is required when Note
You can get this value from the |
||||||||||
|
enable_thinking |
Boolean |
body |
No |
Supported by some models. You can set this to |
|||||||||||
|
max_context_length |
Number |
body |
No |
Supported by some models. Example: |
|||||||||||
|
rpm_limit |
Number |
body |
No |
Supported by some models. Specifies the maximum number of requests per minute (RPM). |
|||||||||||
|
tpm_limit |
Number |
body |
No |
Supported by some models. Specifies the maximum number of tokens per minute (TPM). |
|||||||||||
|
ptu_capacity |
Object |
body |
No |
This setting is applicable only when For details about feature support, see Feature Support for PTU Deployment. If you do not specify this parameter, the system defaults to |
Example: Example: |
||||||||||
|
ptu_capacity.input_tpm |
Number |
body |
No |
Supported by all models. Specifies the maximum number of input tokens per minute (TPM). |
|||||||||||
|
ptu_capacity.output_tpm |
Number |
body |
No |
Supported by all models. Specifies the maximum number of output tokens per minute (TPM). |
|||||||||||
|
ptu_capacity.thinking_output_tpm |
Number |
body |
No |
Supported by some models. Specifies the maximum number of provisioned thinking output tokens per minute (TPM). |
|||||||||||
|
suffix |
String |
body |
No |
After a model is deployed, a new model name is generated. The suffix parameter specifies the suffix for this new name. It must be globally unique and have a maximum length of 8 characters. You can omit the suffix for the first deployment of a model. If you deploy the same model multiple times, you must specify a unique suffix for each deployment. See the deployed_model output parameter for more information. |
|||||||||||
Supported models
Response example
The command returns the following:
{
"request_id": "f2ae64f7-83cc-410c-bc0b-840443f7eb86",
"output": {
"deployed_model": "emo-35b3f106-sample01",
"gmt_create": "2025-06-17T11:00:38.68",
"gmt_modified": "2025-06-17T11:00:38.68",
"status": "PENDING",
"model_name": "emo",
"base_model": "emo",
"base_capacity": 1,
"capacity": 1,
"ready_capacity": 0,
"workspace_id": "llm-v71tlv3d***",
"charge_type": "post_paid",
"creator": "175805416***",
"modifier": "175805416***"
}
}
Response parameters
|
Parameter |
Type |
Description |
|
request_id |
String |
The ID of the request. |
|
output |
Object |
Details of the deployment task. |
|
deployed_model |
String |
A unique identifier for the deployed model. This ID is used for API operations, such as querying deployment details, modifying deployment rate limiting, deployment scaling, and deleting deployments, and is also passed as an SDK parameter when you invoke the model. |
|
gmt_create |
String |
The creation time of the deployment task. |
|
gmt_modified |
String |
The last modification time of the deployment task. |
|
status |
String |
The status of the deployment task.
|
|
model_name |
String |
The name of the model used in the deployment task. |
|
base_model |
String |
The ID of the base model used in the deployment task. |
|
base_capacity |
Number |
The minimum number of resource units required to run the base model. |
|
capacity |
Number |
The number of resource units used by the deployment task. |
|
ready_capacity |
Number |
The number of resource units that are ready to process requests immediately. Resource initialization speed or hardware status can limit this value. |
|
workspace_id |
String |
The ID of the deployment task's workspace. |
|
charge_type |
String |
The billing method for the deployment task.
|
|
creator |
String |
The UID of the user who created the deployment task. |
|
modifier |
String |
The UID of the user who last modified the deployment task. |
|
plan |
String |
The billing model for the deployment task. This parameter is not returned for some billing models. |
|
Returned only for Model Unit deployments. |
||
|
model_unit_spec |
String |
The model unit specification. |
|
enable_thinking |
Boolean |
Specifies if Thinking mode is enabled. This feature is only available for certain models. |
|
max_context_length |
Number |
The maximum context length. |
|
rpm_limit |
String |
The maximum number of requests per minute (RPM). |
|
tpm_limit |
Number |
The maximum number of tokens per minute (TPM). |
|
Returned only for provisioned throughput (PTU) deployments. |
||
|
ptu_capacity |
Object |
This parameter takes effect only when Example: |
|
ptu_capacity.input_tpm |
Number |
The maximum number of input tokens per minute (TPM) for the deployed model. This feature is supported by all models. |
|
ptu_capacity.output_tpm |
Number |
The maximum number of output tokens per minute (TPM) for the deployed model. This feature is supported by all models. |
|
ptu_capacity.thinking_output_tpm |
Number |
The maximum number of thinking output tokens per minute (TPM) for the deployed model. This feature is only available for certain models. |
Error response
Response example
{
"request_id": "ca218d57-b91b-46b2-bd35-c41c6287bcf4",
"message": "Model: qwen-plus-20230703-cx7f not found!",
"code": "NotFound"
}
Response parameters
|
Parameter |
Type |
Description |
|
request_id |
String |
The unique ID of the request. |
|
code |
String |
The error code. |
|
message |
String |
The error message. |
The following errors can occur when a request fails:
|
Error code |
Error message |
Reason |
|
NotFound |
Model: xxx not found! |
|
|
Conflict |
Deployed model xxx already exists, please specify a suffix. |
You are creating a deployment task with a suffix that is already in use. |
|
InvalidParameter |
Invalid capacity (xx), capacity must be larger than or equal to 0 and multiples of 1 and less than 1000! |
You are creating or updating a deployment task with an invalid number of capacity units. |