Deploy a dedicated service - Alibaba Cloud Model Studio - Alibaba Cloud Documentation Center

This topic shows you how to deploy a Qwen model on Alibaba Cloud Model Studio by using API calls.

Prerequisites

You have read the Model Deployment Overview and are familiar with the supported models and basic deployment steps.
Obtain an API key and export the API key as an environment variable.

1. Deploy the model

The following command uses the fine-tuned custom model qwen3-8b-ft-202511132025-0260 to create a dedicated service named qwen3-8b-ft-202511132025-0260.

To obtain the custom model ID, go to the Model Studio console - model finetuning page. Click the Task Name of the model you want to deploy, click Outputs, and then click the blue model name. This opens the My Models page, where you can find the Model ID in the basic information section.

Use the Model ID as the input for the model_name parameter to deploy the model by using the API.

Provisioned Throughput (PTU)

Note

After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.

The provisioned throughput billing method charges based on usage duration. This method is suitable for scenarios that require stable throughput, high concurrency, low latency, and predictable traffic. In this mode, the platform provisions both throughput/concurrency and generation speed, which you cannot adjust.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "name": "my_qwen_flash",
    "model_name": "qwen-flash-2025-07-28",
    "plan": "ptu",
    "ptu_capacity": {
        "input_tpm": 10000,
	"output_tpm": 1000
    }
}'

Model unit

Note

After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.
Computing resources for the post-paid model unit plan are allocated on a first-come, first-served basis. If the purchase is unsuccessful, a full refund will be issued.

The model unit billing method charges you based on usage duration. This billing method is ideal for large-scale inference tasks after model finetuning, offering dedicated resources with flexible performance and cost adjustments. You can customize both throughput/concurrency and generation speed.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "name": "my_qwen_plus",
    "model_name": "qwen-plus-2025-12-01",
    "plan": "mu",
    "deploy_spec": "MU1",
    "enable_thinking": true,
    "capacity": 4,
    "max_context_length": 10000,
    "rpm_limit": 500,
    "tpm_limit": 1000
}'

The model unit deployment mode supports the following additional settings:

Parameter	Description
service name	A custom name for the deployed service.
Model	The model to deploy. You can select from preset and fine-tuned models.
Model Unit type	The deployment specification. Different specifications provide different computing power and performance.
number of replicas	The initial number of replicas. This setting affects the service's concurrency.
deployment template	Specifies the deployment template, such as "Single-node deployment". Different templates correspond to different resource configurations. This parameter is available only with the Model Unit billing method.
model inference mode	For some models deployed as a Model Unit, you can configure the inference mode. The options are: Instruct - Once deployed, the model runs inference in instruct mode. Thinking - Once deployed, the model runs inference in thinking mode.
max context	The maximum context length for the model, which varies by type. This setting is available only for certain models deployed as a Model Unit.
service throttling	Configures rate limits for the service, such as requests per minute (RPM) and tokens per minute (TPM). This setting is available only for some models deployed as a Model Unit.

To learn how to configure these settings by using the API, see Create a model deployment task by using an API.

Token usage

With token usage billing, you are charged based on token usage. This method is suitable for cost-sensitive scenarios where concurrency and latency requirements are not critical. This mode offers the best price advantage; the platform provisions throughput/concurrency and generation speed, which you cannot adjust.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model_name": "qwen3-8b-ft-202511132025-0260",
    "plan": "lora",
    "capacity": 1,
    "name": "qwen3-8b-ft"
}'

The capacity parameter is required but currently has no effect. To request scaling, go to the model deployment console and submit a form.

On success, the command returns the following result (using a LoRA deployment as an example):

{
    "request_id": "83b173ab-2b2f-41aa-8c57-b173e8be934e",
    "output":
    {
        "deployed_model": "qwen3-8b-ft-202511132025-0260",
        "gmt_create": "2025-11-20T20:06:46.405",
        "gmt_modified": "2025-11-20T20:06:46.405",
        "status": "PENDING",
        "model_name": "qwen3-8b-ft-202511132025-0260",
        "base_model": "qwen3-8b",
        "workspace_id": "llm-8v*****",
        "charge_type": "post_paid",
        "creator": "16542*****",
        "modifier": "16542*****",
        "plan": "***"
    }
}

Where deployed_model is the unique ID of the dedicated service.

2. Query service status

Run the following command to query the details of a specific dedicated service:

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments/qwen3-8b-ft-202511132025-0260" \
    --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
    --header 'Content-Type: application/json'

On success, the command returns the following result:

{
    "request_id": "ca36952d-9136-426e-ab08-68a97ad72719",
    "output":
    {
        "deployed_model": "qwen3-8b-ft-202511132025-0260",
        "gmt_create": "2025-11-20T20:32:08",
        "gmt_modified": "2025-11-20T20:42:25",
        "status": "RUNNING",
        "model_name": "qwen3-8b-ft-202511132025-0260",
        "base_model": "qwen3-8b",
        "base_capacity": 2,
        "capacity": 2,
        "ready_capacity": 2,
        "workspace_id": "llm-8v53etv3hwb8orx1",
        "charge_type": "post_paid",
        "creator": "1654290265984853",
        "modifier": "1654290265984853",
        "plan": "mu",
        "model_unit_spec": "MU1"
    }
}

When the service status is RUNNING, the service deployment is complete.

3. Make inference requests

Note

If this is your first time using the DashScope SDK, see Install the SDK.

Ensure the API Key's workspace matches the model's deployment workspace.

Use the SDK to send a request to the dedicated service:

from dashscope import Generation
from http import HTTPStatus
import os
response = Generation.call(
    model='qwen3-8b',
    prompt='Who are you?',
    enable_thinking=False,
    api_key=os.getenv('DASHSCOPE_API_KEY'),
)
if response.status_code == HTTPStatus.OK:
    print(response.output)
    print(response.usage)
else:
    print(response.code)
    print(response.message)

A successful execution returns the following result:

{"text": "I am Qwen, a large language model developed by Alibaba Cloud. I am designed to generate various types of text, such as articles, stories, and poems, and to engage in conversations, answer questions, provide information, and offer help in different scenarios. I'm happy to serve you! If you have any questions or need assistance, please feel free to let me know.", "finish_reason": "stop", "choices": null}
{"input_tokens": 11, "output_tokens": 63, "total_tokens": 74}

4. Delete the dedicated service

Warning

Running the delete command immediately takes the deployment service offline. This action is irreversible and has the following effects:

You can no longer call the model.
Billing for the service stops.

If you no longer need a dedicated service, you can delete it by using the following command:

curl --request DELETE 'https://dashscope-intl.aliyuncs.com/api/v1/deployments/qwen3-8b-ft-202511132025-0260' \
    --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
    --header 'Content-Type: application/json'

On success, the command returns the following result:

{
    "request_id": "8f726017-6042-420e-a465-0d366a3aba59",
    "output":
    {
        "deployed_model": "qwen3-8b-ft-202511132025-0260",
        "gmt_create": "2025-11-20T20:32:08",
        "gmt_modified": "2025-11-27T16:35:31.591",
        "status": "DELETING",
        "model_name": "qwen3-8b-ft-202511132025-0260",
        "base_model": "qwen3-8b",
        "base_capacity": 2,
        "capacity": 2,
        "ready_capacity": 2,
        "workspace_id": "llm-8v53etv3hwb8orx1",
        "charge_type": "post_paid",
        "creator": "1654290265984853",
        "modifier": "1654290265984853",
        "plan": "mu",
        "model_unit_spec": "MU1"
    }
}

After deleting the service, you can no longer query its status using the method described in 2. Query service status.

API reference

For details on the API calls, see API details.

FAQ

Permission errors during deployment

When deploying a model by using an API, ensure the following:

The API Key's Workspace must have permission to manage the model. Go to the Business Space Management page in Model Studio and check the model deployment permission settings for the corresponding workspace.

API call error: Workspace xxx does not have deployment privilege for model xxxx.

In the Actions column for the corresponding workspace, click Model Permission and Flow Control Settings.

In the Model List, find the target model and check the authorization status in the Model Deployment column. If it shows Not Authorized, click Edit in the Actions column to grant permission.

If the permission error persists, contact your organization or IT administrator to grant the required permission or perform the operation for you.
The Owner Account that owns the API Key has the required permissions in the Workspace. Go to the Model Studio console, click the workspace in the lower-left corner, switch to the correct workspace, and then click to check the model deployment permission settings.

API call error: Workspace access denied.

In the left-side navigation pane, click Permission Management and confirm that the user list includes the API Key's owner account (with type primary account).

If the permission error persists, contact your organization or IT administrator to grant the required permission or perform the operation for you.