This topic shows you how to deploy a Qwen model on Alibaba Cloud Model Studio by using API calls.
Prerequisites
You have read the Model Deployment Overview and are familiar with the supported models and basic deployment steps.
Create an API key and export the API key as an environment variable.
1. Deploy the model
The following command uses the fine-tuned custom model qwen3-8b-ft-202511132025-0260 to create a dedicated service named qwen3-8b-ft-202511132025-0260.
To obtain the custom model ID, go to the Model Studio console - model finetuning page. Click the Task Name of the model you want to deploy, click Outputs, and then click the blue model name. This opens the My Models page, where you can find the Model ID in the basic information section.
Use the Model ID as the input for the model_name parameter to deploy the model by using the API.
Provisioned Throughput (PTU)
After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.
The provisioned throughput billing method charges based on usage duration. This method is suitable for scenarios that require stable throughput, high concurrency, low latency, and predictable traffic. In this mode, the platform provisions both throughput/concurrency and generation speed, which you cannot adjust.
curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"name": "my_qwen_flash",
"model_name": "qwen-flash-2025-07-28",
"plan": "ptu",
"ptu_capacity": {
"input_tpm": 10000,
"output_tpm": 1000
}
}'Model unit
After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.
Computing resources for the post-paid model unit plan are allocated on a first-come, first-served basis. If the purchase is unsuccessful, a full refund will be issued.
The model unit billing method charges you based on usage duration. This billing method is ideal for large-scale inference tasks after model finetuning, offering dedicated resources with flexible performance and cost adjustments. You can customize both throughput/concurrency and generation speed.
curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"name": "my_qwen_plus",
"model_name": "qwen-plus-2025-12-01",
"plan": "mu",
"deploy_spec": "MU1",
"enable_thinking": true,
"capacity": 4,
"max_context_length": 10000,
"rpm_limit": 500,
"tpm_limit": 1000
}'The model unit deployment mode supports the following additional settings:
|
Configuration |
Details |
|
Configure model inference mode |
For some models, you can configure the inference mode, maximum context length, and other settings when deploying them using the Model Unit method.
|
|
Maximum context length |
This setting is supported for the Model Unit deployment mode of some models. The maximum context length depends on the model type. |
|
Service throttling |
This setting is supported for the Model Unit deployment mode of some models. It lets you limit the RPM and TPM of model calls. |
To learn how to configure these settings by using the API, see Create a model deployment task by using an API.
Token usage
With token usage billing, you are charged based on token usage. This method is suitable for cost-sensitive scenarios where concurrency and latency requirements are not critical. This mode offers the best price advantage; the platform provisions throughput/concurrency and generation speed, which you cannot adjust.
curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model_name": "qwen3-8b-ft-202511132025-0260",
"plan": "lora",
"capacity": 1,
"name": "qwen3-8b-ft"
}'The capacity parameter is required but currently has no effect. To request scaling, go to the model deployment console and submit a form.On success, the command returns the following result (using a LoRA deployment as an example):
{
"request_id": "83b173ab-2b2f-41aa-8c57-b173e8be934e",
"output":
{
"deployed_model": "qwen3-8b-ft-202511132025-0260",
"gmt_create": "2025-11-20T20:06:46.405",
"gmt_modified": "2025-11-20T20:06:46.405",
"status": "PENDING",
"model_name": "qwen3-8b-ft-202511132025-0260",
"base_model": "qwen3-8b",
"workspace_id": "llm-8v*****",
"charge_type": "post_paid",
"creator": "16542*****",
"modifier": "16542*****",
"plan": "***"
}
}Where deployed_model is the unique ID of the dedicated service.
2. Query service status
Run the following command to query the details of a specific dedicated service:
curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments/qwen3-8b-ft-202511132025-0260" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' On success, the command returns the following result:
{
"request_id": "ca36952d-9136-426e-ab08-68a97ad72719",
"output":
{
"deployed_model": "qwen3-8b-ft-202511132025-0260",
"gmt_create": "2025-11-20T20:32:08",
"gmt_modified": "2025-11-20T20:42:25",
"status": "RUNNING",
"model_name": "qwen3-8b-ft-202511132025-0260",
"base_model": "qwen3-8b",
"base_capacity": 2,
"capacity": 2,
"ready_capacity": 2,
"workspace_id": "llm-8v53etv3hwb8orx1",
"charge_type": "post_paid",
"creator": "1654290265984853",
"modifier": "1654290265984853",
"plan": "mu",
"model_unit_spec": "MU1"
}
}When the service status is RUNNING, the service deployment is complete.
3. Make inference requests
If this is your first time using the DashScope SDK, see Install the SDK.
Ensure the API Key's workspace matches the model's deployment workspace.
Use the SDK to send a request to the dedicated service:
from dashscope import Generation
from http import HTTPStatus
import os
response = Generation.call(
model='qwen3-8b',
prompt='Who are you?',
enable_thinking=False,
api_key=os.getenv('DASHSCOPE_API_KEY'),
)
if response.status_code == HTTPStatus.OK:
print(response.output)
print(response.usage)
else:
print(response.code)
print(response.message)
A successful execution returns the following result:
{"text": "I am Qwen, a large language model developed by Alibaba Cloud. I am designed to generate various types of text, such as articles, stories, and poems, and to engage in conversations, answer questions, provide information, and offer help in different scenarios. I'm happy to serve you! If you have any questions or need assistance, please feel free to let me know.", "finish_reason": "stop", "choices": null}
{"input_tokens": 11, "output_tokens": 63, "total_tokens": 74}4. Delete the dedicated service
Running the delete command immediately takes the deployment service offline. This action is irreversible and has the following effects:
You can no longer call the model.
Billing for the service stops.
If you no longer need a dedicated service, you can delete it by using the following command:
curl --request DELETE 'https://dashscope-intl.aliyuncs.com/api/v1/deployments/qwen3-8b-ft-202511132025-0260' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' On success, the command returns the following result:
{
"request_id": "8f726017-6042-420e-a465-0d366a3aba59",
"output":
{
"deployed_model": "qwen3-8b-ft-202511132025-0260",
"gmt_create": "2025-11-20T20:32:08",
"gmt_modified": "2025-11-27T16:35:31.591",
"status": "DELETING",
"model_name": "qwen3-8b-ft-202511132025-0260",
"base_model": "qwen3-8b",
"base_capacity": 2,
"capacity": 2,
"ready_capacity": 2,
"workspace_id": "llm-8v53etv3hwb8orx1",
"charge_type": "post_paid",
"creator": "1654290265984853",
"modifier": "1654290265984853",
"plan": "mu",
"model_unit_spec": "MU1"
}
}
After deleting the service, you can no longer query its status using the method described in 2. Query service status.
API reference
For details on the API calls, see API details.
FAQ
Permission errors during deployment
When deploying a model by using an API, ensure the following:
The API Key's Workspace must have permission to manage the model. Go to the Business Space Management page in Model Studio and check the model deployment permission settings for the corresponding workspace.
API call error:
Workspace xxx does not have deployment privilege for model xxxx.In the Actions column for the corresponding workspace, click Model Permission and Flow Control Settings.
In the Model List, find the target model and check the authorization status in the Model Deployment column. If it shows Not Authorized, click Edit in the Actions column to grant permission.
If a "permission denied" message appears, contact your organization or IT administrator to add the required permissions or perform the operation for you.
The Owner Account that owns the API Key has the required permissions in the Workspace. Go to the Model Studio console, click the workspace in the lower-left corner, switch to the correct workspace, and then click
to check the model deployment permission settings.API call error:
Workspace access denied.In the left-side navigation pane, click Permission Management and confirm that the user list includes the API Key's owner account (with type primary account).
If a "permission denied" message appears, contact your organization or IT administrator to add the required permissions or perform the operation for you.