Create deployment - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

Create a model deployment task.

Prerequisites

Read Introduction to model deployment and Deploy a model by using the API to understand how model deployment works and the basic workflow on Alibaba Cloud Model Studio.
Configure your API key for Model Studio. For details, see Get an API key.

Model deployment

Endpoint

POST https://dashscope-intl.aliyuncs.com/api/v1/deployments

Request examples

Provisioned Throughput (PTU)

Note

After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.

The provisioned throughput billing method charges based on usage duration. This method is suitable for scenarios that require stable throughput, high concurrency, low latency, and predictable traffic. In this mode, the platform provisions both throughput/concurrency and generation speed, which you cannot adjust.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "name": "my_qwen_flash",
    "model_name": "qwen-flash-2025-07-28",
    "plan": "ptu",
    "ptu_capacity": {
        "input_tpm": 10000,
	"output_tpm": 1000
    }
}'

Model unit

Note

After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.
Computing resources for the post-paid model unit plan are allocated on a first-come, first-served basis. If the purchase is unsuccessful, a full refund will be issued.

The model unit billing method charges you based on usage duration. This billing method is ideal for large-scale inference tasks after model finetuning, offering dedicated resources with flexible performance and cost adjustments. You can customize both throughput/concurrency and generation speed.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "name": "my_qwen_plus",
    "model_name": "qwen-plus-2025-12-01",
    "plan": "mu",
    "deploy_spec": "MU1",
    "enable_thinking": true,
    "capacity": 4,
    "max_context_length": 10000,
    "rpm_limit": 500,
    "tpm_limit": 1000
}'

The model unit deployment mode supports the following additional settings:

Configuration	Details
Configure model inference mode	For some models, you can configure the inference mode, maximum context length, and other settings when deploying them using the Model Unit method. Instruct - The model is deployed for inference in non-thinking mode. Thinking - The model is deployed for inference in thinking mode.
Maximum context length	This setting is supported for the Model Unit deployment mode of some models. The maximum context length depends on the model type.
Service throttling	This setting is supported for the Model Unit deployment mode of some models. It lets you limit the RPM and TPM of model calls.

To learn how to configure these settings by using the API, see Create a model deployment task by using an API.

Token usage

With token usage billing, you are charged based on token usage. This method is suitable for cost-sensitive scenarios where concurrency and latency requirements are not critical. This mode offers the best price advantage; the platform provisions throughput/concurrency and generation speed, which you cannot adjust.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model_name": "qwen3-8b-ft-202511132025-0260",
    "plan": "lora",
    "capacity": 1,
    "name": "qwen3-8b-ft"
}'

The capacity parameter is required but currently has no effect. To request scaling, go to the model deployment console and submit a form.

Request parameters

Parameter

Type

Location

Required

Description

model_name

String

body

Yes

The name of the model to deploy. This corresponds to the model ID in My Models. You can also get this ID from the output of the Create Training Job or Create Import Job operations.

plan

String

body

Yes

The deployment plan. The following billing methods are supported:

Billing method	Plan setting
Billing by model unit	`"plan": "mu"`
Billing by compute unit	`"plan": "cu"`
Provisioned throughput	`"plan": "ptu"`
LoRA shared deployment (billed by token usage)	`"plan": "lora"`

You can quickly find the supported deployment plans for a fine-tuned model in My Models.

Note

Fine-tuned CosyVoice models currently only support "plan": "mu".

name

String

body

Yes

The display name of the model in the console.

capacity

Integer

body

Required only when "plan": "mu" is specified. Specifies the number of resource units for the deployment. The value must be an integer multiple of base_capacity. The constraints vary based on the deploy_spec value. For example, for MU2, the value must be a multiple of 8, while for MU5, it can be 1. Example: "capacity": 1.

Note

CosyVoice models currently provide the following two deployment templates with corresponding capacity constraints:

single-node deployment: capacity must be an integer multiple of 1, such as 1, 2, 3, 4, or 5.
single-node deployment - flagship complex inference edition: capacity must be an integer multiple of 8, such as 8, 16, 24, or 32.

billing_method

String

body

Required only when "plan": "mu" is specified. Currently, only "POST_PAY" (Post-paid) is supported. Example: "billing_method": "POST_PAY".

deploy_spec

String

body

This setting is applicable only when "plan": "mu" is specified.

For details about feature support, see Feature support for model unit deployment.

This parameter is required when "plan": "mu" is specified. Example: "deploy_spec": "MU1".

Note

You can get this value from the template_id field returned by the Get Deployable Model List operation.

enable_thinking

Boolean

body

Supported by some models. You can set this to true or false.

max_context_length

Number

body

Supported by some models. Example: "max_context_length": 131072.

rpm_limit

Number

body

Supported by some models. Specifies the maximum number of requests per minute (RPM).

tpm_limit

Number

body

Supported by some models. Specifies the maximum number of tokens per minute (TPM).

ptu_capacity

Object

body

This setting is applicable only when "plan": "ptu" is specified.

For details about feature support, see Feature Support for PTU Deployment.

If you do not specify this parameter, the system defaults to 10,000 input_tpm and 1,000 output_tpm.

Example: "ptu_capacity": { "input_tpm": 10000, "output_tpm": 1000 }.

ptu_capacity.input_tpm

Number

body

Supported by all models. Specifies the maximum number of input tokens per minute (TPM).

ptu_capacity.output_tpm

Number

body

Supported by all models. Specifies the maximum number of output tokens per minute (TPM).

ptu_capacity.thinking_output_tpm

Number

body

Supported by some models. Specifies the maximum number of provisioned thinking output tokens per minute (TPM).

suffix

String

body

After a model is deployed, a new model name is generated. The suffix parameter specifies the suffix for this new name. It must be globally unique and have a maximum length of 8 characters. You can omit the suffix for the first deployment of a model. If you deploy the same model multiple times, you must specify a unique suffix for each deployment.

See the deployed_model output parameter for more information.

Supported models

View supported features and billing.

Billed by usage duration (Provisioned Throughput)

Fee = Usage duration × (Unit price for input TPM × Input TPM + Unit price for output TPM × Output TPM)

The pay-as-you-go billing method is charged hourly, and the unit price is listed in the "Sustained for 1 hour" column of the table below. The subscription billing method is charged daily, and the unit price is listed in the "Sustained for 1 day" column of the table below.

Subscription orders take effect immediately after payment. A subscription for N days is valid until 23:59 on the Nth day. If you place an order after 22:00, the expiration date is automatically extended by one day.
After a subscription order expires, the service stops after a 2-hour grace period. The resources are retained for 14 hours after the service stops and are then released.
You cannot terminate a subscription service in advance.
For the pay-as-you-go method, if your account has an overdue payment, the deployed resources are retained and continue to be billed for 24 hours before they are automatically released.

If the model input exceeds the maximum input tokens or the purchased TPM, the relevant calls automatically switch to the pay-as-you-go mode for the current model. In this case, inference performance may degrade. Rate limiting is controlled by the public traffic of the current snapshot model in the workspace. Fees are charged based on the standard rates for model calls (pay-as-you-go).

In this case, the API call returns a header that contains x-dashscope-ptu-overflow:true.
For TPM statistics, go to Model Monitoring (Beijing).

For the specific rules on fee reductions and refunds for scale-in scenarios (downgrades), see Refund rules for downgrades.

Singapore

Qwen

Model name	Model code	Maximum input tokens	Pay-as-you-go input Per 10K TPM/hour	Pay-as-you-go output Per 1K TPM/hour	Subscription input Per 10K TPM/day	Subscription output Per 1K TPM/day
Qwen3.6-Plus-2026-04-02	qwen3.6-plus-2026-04-02	128,000	$1.2	$0.72	$14.4	$8.64

Qwen3.5-Plus-2026-04-20	qwen3.5-plus-2026-04-20	128,000	$0.96	$0.576	$11.52	$6.912

Qwen-VL

Model name

Model code

Maximum input tokens

Pay-as-you-go input

Per 10K TPM/hour

Pay-as-you-go output

Per 1K TPM/hour

Subscription input

Per 10K TPM/day

Subscription output

Per 1K TPM/day

Qwen3-VL-Plus-2025-09-23

qwen3-vl-plus-2025-09-23

128,000

$0.48

$0.384

$5.76

$4.608

DeepSeek

Model name

Model code

Maximum input tokens

Pay-as-you-go input

Per 10K TPM/hour

Pay-as-you-go output

Per 1K TPM/hour

Subscription input

Per 10K TPM/day

Subscription output

Per 1K TPM/day

DeepSeek-v3.2

deepseek-v3.2

64,000

$2.05

$0.616

$24.62

$7.387

China (Beijing)

Qwen

Model name	Model code	Maximum input tokens	Pay-as-you-go input Per 10K TPM/hour	Pay-as-you-go output Per 1K TPM/hour	Subscription input Per 10K TPM/day	Subscription output Per 1K TPM/day
Qwen3.7-Max-2026-05-20	qwen3.7-max-2026-05-20	128,000	$3.96	$1.188	$47.53	$14.258

Qwen3.6-Flash-2026-04-16	qwen3.6-flash-2026-04-16	128,000	$0.4	$0.238	$4.75	$2.852
Qwen3.6-Plus-2026-04-02	qwen3.6-plus-2026-04-02	128,000	$0.67	$0.397	$7.93	$4.753

Qwen3.5-Plus-2026-04-20	qwen3.5-plus-2026-04-20	128,000	$0.26	$0.16	$3.17	$1.9

Qwen3-Max-2025-09-23	qwen3-max-2025-09-23	128,000	$1.11	$0.45	$13.32	$5.4

Qwen-Flash-2025-07-28	qwen-flash-2025-07-28	128,000	$0.06	$0.06	$0.72	$0.72
Qwen-Plus-2025-12-01	qwen-plus-2025-12-01	128,000	$0.28	Non-thinking: $0.07 Thinking: $0.28	$3.36	Non-thinking: $0.84 Thinking: $3.36

DeepSeek

Model name	Model code	Maximum input tokens	Pay-as-you-go input Per 10K TPM/hour	Pay-as-you-go output Per 1K TPM/hour	Subscription input Per 10K TPM/day	Subscription output Per 1K TPM/day
DeepSeek-v4-Pro	deepseek-v4-pro	64,000	$5.94	$1.188	$71.3	$14.26
DeepSeek-v3.2	deepseek-v3.2	64,000	$1.04	$0.16	$12.48	$1.92
DeepSeek-v3	deepseek-v3	64,000	$0.99	$0.396	$11.9	$4.75

Qwen-VL

Model name

Model code

Maximum input tokens

Pay-as-you-go input

Per 10K TPM/hour

Pay-as-you-go output

Per 1K TPM/hour

Subscription input

Per 10K TPM/day

Subscription output

Per 1K TPM/day

Qwen3-VL-Plus-2025-09-23

qwen3-vl-plus-2025-09-23

128,000

$0.35

$4.2

More models

Model name

Model code

Maximum input tokens

Pay-as-you-go input

Per 10K TPM/hour

Pay-as-you-go output

Per 1K TPM/hour

Subscription input

Per 10K TPM/day

Subscription output

Per 1K TPM/day

GLM-5.1

glm-5.1

64,000

$2.97

$1.19

$35.65

$14.26

Billed by usage duration (Model Unit)

Fee = Usage duration (hours) × Number of model units × Unit price per model unit

For the pay-as-you-go method, the 'Unit price per model unit' is the value in the 'Hourly price' column in the table below. For monthly subscriptions, the formula is: Number of months × Number of model units × Monthly price.

For a monthly subscription, if you cancel within the first month, the daily price (≈ monthly price / 30) is charged at a rate of 1.2 times the normal price. Usage for less than a day is billed as a full day.

Note

Computing power resources for the pay-as-you-go model unit method are allocated on a first-come, first-served basis. If the purchase is unsuccessful, you will receive a full refund.

Singapore

Text generation

Model name	Model code	Model unit specification	Hourly price ($)	Monthly price ($)
Qwen3.6-Plus-2026-04-02	qwen3.6-plus-2026-04-02	MU1 x 8	$88	$41,832

Qwen3.5-39B-A17B	qwen3.5-397b-a17b	MU2 x 8	$112	$52,392
Qwen3.5-35B-A3B	qwen3.5-35b-a3b	MU2 x 8	$112	$52,392

Qwen3-32B	qwen3-32b	MU1 x 4	$44	$20,916
Qwen3-32B	qwen3-32b	MU2 x 8	$112	$52,392
Qwen3-14B	qwen3-14b	MU1 x 4	$44	$20,916

GLM-5.1	glm-5.1	MU2 x 8	$112	$52,392

DeepSeek-V4-Flash	deepseek-v4-flash	MU1 x 8	$88	$41,832

Multimodal

Model name	Model code	Model unit specification	Hourly price ($)	Monthly price ($)
Qwen3-VL-32B-Instruct	qwen3-vl-32b-instruct	MU2 x 8	$112	$52,392
Qwen3-VL-8B-Instruct	qwen3-vl-8b-instruct	MU1 x 2	$22	$10,458

Model type:

Instruct - The model is deployed for inference in non-thinking mode.

China (Beijing)

Text generation

Qwen

Model name	Model code	Model unit specification	Hourly price ($)	Monthly price ($)
Qwen3.6-35B-A3B	qwen3.6-35b-a3b	MU8 x 1	$6.464	$3,080.477
Qwen3.6-35B-A3B	qwen3.6-35b-a3b	MU9 x 1	$7.014	$3,383.024
Qwen3.6-27B	qwen3.6-27b	MU9 x 1	$7.014	$3,383.024
Qwen3.6-Flash-2026-04-16	qwen3.6-flash-2026-04-16	MU1 x 2	$14.852	$7,183.564
Qwen3.6-Plus-2026-04-02	qwen3.6-plus-2026-04-02	MU1 x 8	$59.408	$28,734.256

Qwen3.5-397B-A17B	qwen3.5-397b-a17b	MU2 x 8	$69.312	$33,044.72
		MU3 x 8	$150.72	$72,577.152
		MU6 x 16	$55.008	$26,599.92
Qwen3.5-122B-A10B	qwen3.5-122b-a10b	MU1 x 4	$29.704	$14,367.128
		MU2 x 8	$69.312	$33,044.72
		MU6 x 16	$55.008	$26,599.92
		MU9 x 2	$14.028	$6,766.048
Qwen3.5-35B-A3B	qwen3.5-35b-a3b	MU1 x 2	$14.852	$7,183.564
		MU2 x 8	$69.312	$33,044.72
		MU8 x 1	$6.464	$3,080.477
		MU9 x 1	$7.014	$3,383.024
Qwen3.5-27B	qwen3.5-27b	MU1 x 2	$14.852	$7,183.564
Qwen3.5-27B	qwen3.5-27b	MU9 x 1	$7.014	$3,383.024
Qwen3.5-9B	qwen3.5-9b	MU1 x 2	$14.852	$7,183.564
		MU8 x 1	$6.464	$3,080.477
		MU9 x 1	$7.014	$3,383.024
Qwen3.5-Flash-2026-02-23	qwen3.5-flash-2026-02-23	MU1 x 2	$14.852	$7,183.564
Qwen3.5-Flash-2026-02-23	qwen3.5-flash-2026-02-23	MU8 x 1 (model compression)	$6.464	$3,080.477
Qwen3.5-Plus-2026-02-15	qwen3.5-plus-2026-02-15	MU1 x 8	$59.408	$28,734.256
Qwen3.5-Plus-2026-02-15	qwen3.5-plus-2026-02-15	MU3 x 8	$150.72	$72,577.152

Qwen3-235B-A22B-Instruct	qwen3-235b-a22b-instruct-2507	MU1 x 4	$29.704	$14,367.128
Qwen3-235B-A22B-Instruct	qwen3-235b-a22b-instruct-2507	MU2 x 8	$69.312	$33,044.72
Qwen3-Next-80B-A3B-Instruct	qwen3-next-80b-a3b-instruct	MU1 x 2	$14.852	$7,183.564
Qwen3-32B	qwen3-32b	MU1 x 4	$29.704	$14,367.128
Qwen3-32B	qwen3-32b	MU6 x 4	$13.752	$6,649.98
Qwen3-30B-A3B	qwen3-30b-a3b	MU9 x 2	$14.028	$6,766.048
Qwen3-30B-A3B-Instruct-2507	qwen3-30b-a3b-instruct-2507	MU1 x 4	$29.704	$14,367.128
Qwen3-30B-A3B-Instruct-2507	qwen3-30b-a3b-instruct-2507	MU2 x 8	$69.312	$33,044.72
Qwen3-8B	qwen3-8b	MU1 x 2	$14.852	$7,183.564
		MU2 x 2	$17.328	$8,261.18
		MU5 x 1	$2.888	$1,394.329
Qwen3-4B	qwen3-4b	MU1 x 2	$14.852	$7,183.564
Qwen3-4B	qwen3-4b	MU5 x 1	$2.888	$1,394.329
Qwen3-1.7B	qwen3-1.7b	MU1 x 2	$14.852	$7,183.564
Qwen3-1.7B	qwen3-1.7b	MU5 x 1	$2.888	$1,394.329
Qwen3-Max-2025-09-23	qwen3-max-2025-09-23	MU2 x 8	$69.312	$33,044.72
Qwen3-Max-2025-09-23	qwen3-max-2025-09-23	MU3 x 8	$150.72	$72,577.152

Qwen2.5-72B	qwen2.5-72b-instruct	MU1 x 4	$29.704	$14,367.128
Qwen2.5-32B	qwen2.5-32b-instruct	MU1 x 4	$29.704	$14,367.128
Qwen2.5-14B	qwen2.5-14b-instruct	MU1 x 2	$14.852	$7,183.564
Qwen2.5-7B	qwen2.5-7b-instruct	MU1 x 2	$14.852	$7,183.564
Qwen2.5-7B	qwen2.5-7b-instruct	MU5 x 1	$2.888	$1,394.329
Qwen2.5-3B-Instruct	qwen2.5-3b-instruct	MU5 x 1	$2.888	$1,394.329

Qwen-Flash-2025-07-28	qwen-flash-2025-07-28	MU1 x 4	$29.704	$14,367.128
Qwen-Plus-2025-07-28	qwen-plus-2025-07-28	MU1 x 4	$29.704	$14,367.128
Qwen-Plus-2025-12-01	qwen-plus-2025-12-01	MU1 x 4	$29.704	$14,367.128

GLM

Model name	Model code	Model unit specification	Hourly price ($)	Monthly price ($)
GLM-5	glm-5	MU3 x 8	$150.72	$72,577.152
GLM-4.7	glm-4.7	MU6 x 16	$55.008	$26,599.92

DeepSeek

Model name	Model code	Model unit specification	Hourly price ($)	Monthly price ($)
DeepSeek-V4-Flash	deepseek-v4-flash	MU1 x 8	$59.408	$28,734.256
DeepSeek-V3.2	deepseek-v3.2	MU2 x 8	$69.312	$33,044.72

Other models

Model name	Model code	Model unit specification	Hourly price ($)	Monthly price ($)
MiniMax-M2.5	MiniMax-M2.5	MU1 x 8	$59.408	$28,734.256

Kimi-K2.5	kimi-k2.5	MU2 x 8	$69.312	$33,044.72

Multimodal

Qwen-VL

Model name	Model code	Model unit specification	Hourly price ($)	Monthly price ($)
Qwen3-VL-235B-A22B-Instruct	qwen3-vl-235b-a22b-instruct	MU1 x 4	$29.704	$14,367.128
Qwen3-VL-235B-A22B-Thinking	qwen3-vl-235b-a22b-thinking	MU1 x 4	$29.704	$14,367.128
Qwen3-VL-32B-Instruct	qwen3-vl-32b-instruct	MU2 x 8	$69.312	$33,044.72
Qwen3-VL-8B-Instruct	qwen3-vl-8b-instruct	MU1 x 2	$14.852	$7,183.564
Qwen3-VL-Flash-2025-10-15	qwen3-vl-flash-2025-10-15	MU1 x 4	$29.704	$14,367.128
Qwen3-VL-Plus-2025-09-23	qwen3-vl-plus-2025-09-23	MU1 x 4	$29.704	$14,367.128

Qwen-VL-Max-2025-08-13	qwen-vl-max-2025-08-13	MU6 x 4	$13.752	$6,649.98
Qwen-VL-OCR-2025-11-20	qwen-vl-ocr-2025-11-20	MU6 x 4	$13.752	$6,649.98

Qwen Omni

Model name	Model code	Model unit specification	Hourly price ($)	Monthly price ($)
Qwen3.5-Omni-Flash	qwen3.5-omni-flash	MU8 x 1	$6.464	$3,080.477
Qwen3.5-Omni-Flash	qwen3.5-omni-flash	MU9 x 1	$7.014	$3,383.024
Qwen3.5-Omni-Plus	qwen3.5-omni-plus	MU9 x 8	$56.112	$27,064.192

Model type:

Instruct - The model is deployed for inference in non-thinking mode.
Thinking - The model is deployed for inference in thinking mode.

By model token usage

Fee = Number of input tokens × Unit price for input + Number of output tokens × Unit price for output (Minimum billing unit: 1 token)

Billing by model token usage is supported only after you complete efficient SFT training on the following foundation models and create a custom model.

Singapore

Foundation model

Model code

Input

CNY/1,000 tokens

Output

CNY/1,000 tokens

Qwen3-14B

qwen3-14b

$0.00035

Non-thinking mode: $0.0014

Thinking mode: $0.0042

Response example

The command returns the following:

{
  "request_id": "f2ae64f7-83cc-410c-bc0b-840443f7eb86",
  "output": {
    "deployed_model": "emo-35b3f106-sample01",
    "gmt_create": "2025-06-17T11:00:38.68",
    "gmt_modified": "2025-06-17T11:00:38.68",
    "status": "PENDING",
    "model_name": "emo",
    "base_model": "emo",
    "base_capacity": 1,
    "capacity": 1,
    "ready_capacity": 0,
    "workspace_id": "llm-v71tlv3d***",
    "charge_type": "post_paid",
    "creator": "175805416***",
    "modifier": "175805416***"
  }
}

Response parameters

Parameter	Type	Description
request_id	String	The ID of the request.
output	Object	Details of the deployment task.
deployed_model	String	A unique identifier for the deployed model. This ID is used for API operations, such as querying deployment details, modifying deployment rate limiting, deployment scaling, and deleting deployments, and is also passed as an SDK parameter when you invoke the model.
gmt_create	String	The creation time of the deployment task.
gmt_modified	String	The last modification time of the deployment task.
status	String	The status of the deployment task. PENDING: The task is being created. UPDATING: The task is being updated. RUNNING: The deployment task is running, and the deployed model can process requests. STOPPED: The deployment task is stopped and is not billed. DELETING: The task is being deleted. FAILED: The creation or update of the task failed.
model_name	String	The name of the model used in the deployment task.
base_model	String	The ID of the base model used in the deployment task.
base_capacity	Number	The minimum number of resource units required to run the base model.
capacity	Number	The number of resource units used by the deployment task.
ready_capacity	Number	The number of resource units that are ready to process requests immediately. Resource initialization speed or hardware status can limit this value.
workspace_id	String	The ID of the deployment task's workspace.
charge_type	String	The billing method for the deployment task. `post_paid`: Post-paid.
creator	String	The UID of the user who created the deployment task.
modifier	String	The UID of the user who last modified the deployment task.
plan	String	The billing model for the deployment task. This parameter is not returned for some billing models.
Returned only for Model Unit deployments.
model_unit_spec	String	The model unit specification.
enable_thinking	Boolean	Specifies if Thinking mode is enabled. This feature is only available for certain models.
max_context_length	Number	The maximum context length.
rpm_limit	String	The maximum number of requests per minute (RPM).
tpm_limit	Number	The maximum number of tokens per minute (TPM).
Returned only for provisioned throughput (PTU) deployments.
ptu_capacity	Object	This parameter takes effect only when `"plan": "ptu"` is set. Example: `"ptu_capacity": { "input_tpm": 10000, "output_tpm": 1000 }`.
ptu_capacity.input_tpm	Number	The maximum number of input tokens per minute (TPM) for the deployed model. This feature is supported by all models.
ptu_capacity.output_tpm	Number	The maximum number of output tokens per minute (TPM) for the deployed model. This feature is supported by all models.
ptu_capacity.thinking_output_tpm	Number	The maximum number of thinking output tokens per minute (TPM) for the deployed model. This feature is only available for certain models.

Error response

Response example

{
    "request_id": "ca218d57-b91b-46b2-bd35-c41c6287bcf4",
    "message": "Model: qwen-plus-20230703-cx7f not found!",
    "code": "NotFound"
}

Response parameters

Parameter	Type	Description
request_id	String	The unique ID of the request.
code	String	The error code.
message	String	The error message.

The following errors can occur when a request fails:

Error code	Error message	Reason
NotFound	Model: xxx not found!	You are creating a deployment task with a model that does not exist. You are querying, updating, or deleting a deployment task with a model that does not exist.
Conflict	Deployed model xxx already exists, please specify a suffix.	You are creating a deployment task with a suffix that is already in use.
InvalidParameter	Invalid capacity (xx), capacity must be larger than or equal to 0 and multiples of 1 and less than 1000!	You are creating or updating a deployment task with an invalid number of capacity units.