All Products
Search
Document Center

Alibaba Cloud Model Studio:Deploy a model by using an API or CLI

Last Updated:Jun 06, 2026

This topic shows you how to deploy a Qwen model on Alibaba Cloud Model Studio by using API calls.

Prerequisites

1. Deploy the model

The following command uses the fine-tuned custom model qwen3-8b-ft-202511132025-0260 to create a dedicated service named qwen3-8b-ft-202511132025-0260.

To obtain the custom model ID, go to the Model Studio console - model finetuning page. Click the Task Name of the model you want to deploy, click Outputs, and then click the blue model name. This opens the My Models page, where you can find the Model ID in the basic information section.

Use the Model ID as the input for the model_name parameter to deploy the model by using the API.

Provisioned Throughput (PTU)

Note

After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.

The provisioned throughput billing method charges based on usage duration. This method is suitable for scenarios that require stable throughput, high concurrency, low latency, and predictable traffic. In this mode, the platform provisions both throughput/concurrency and generation speed, which you cannot adjust.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "name": "my_qwen_flash",
    "model_name": "qwen-flash-2025-07-28",
    "plan": "ptu",
    "ptu_capacity": {
        "input_tpm": 10000,
	"output_tpm": 1000
    }
}'

Model unit

Note
  • After you run the deployment command, billing starts as soon as the service is successfully deployed, even if you do not use it. Before proceeding, we recommend you review the service billing rules.

  • Computing resources for the post-paid model unit plan are allocated on a first-come, first-served basis. If the purchase is unsuccessful, a full refund will be issued.

The model unit billing method charges you based on usage duration. This billing method is ideal for large-scale inference tasks after model finetuning, offering dedicated resources with flexible performance and cost adjustments. You can customize both throughput/concurrency and generation speed.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "name": "my_qwen_plus",
    "model_name": "qwen-plus-2025-12-01",
    "plan": "mu",
    "deploy_spec": "MU1",
    "enable_thinking": true,
    "capacity": 4,
    "max_context_length": 10000,
    "rpm_limit": 500,
    "tpm_limit": 1000
}'

The model unit deployment mode supports the following additional settings:

Configuration

Details

Configure model inference mode

For some models, you can configure the inference mode, maximum context length, and other settings when deploying them using the Model Unit method.

  • Instruct - The model is deployed for inference in non-thinking mode.

  • Thinking - The model is deployed for inference in thinking mode.

Maximum context length

This setting is supported for the Model Unit deployment mode of some models. The maximum context length depends on the model type.

Service throttling

This setting is supported for the Model Unit deployment mode of some models. It lets you limit the RPM and TPM of model calls.

To learn how to configure these settings by using the API, see Create a model deployment task by using an API.

Token usage

With token usage billing, you are charged based on token usage. This method is suitable for cost-sensitive scenarios where concurrency and latency requirements are not critical. This mode offers the best price advantage; the platform provisions throughput/concurrency and generation speed, which you cannot adjust.

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments" \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model_name": "qwen3-8b-ft-202511132025-0260",
    "plan": "lora",
    "capacity": 1,
    "name": "qwen3-8b-ft"
}'
The capacity parameter is required but currently has no effect. To request scaling, go to the model deployment console and submit a form.

On success, the command returns the following result (using a LoRA deployment as an example):

{
    "request_id": "83b173ab-2b2f-41aa-8c57-b173e8be934e",
    "output":
    {
        "deployed_model": "qwen3-8b-ft-202511132025-0260",
        "gmt_create": "2025-11-20T20:06:46.405",
        "gmt_modified": "2025-11-20T20:06:46.405",
        "status": "PENDING",
        "model_name": "qwen3-8b-ft-202511132025-0260",
        "base_model": "qwen3-8b",
        "workspace_id": "llm-8v*****",
        "charge_type": "post_paid",
        "creator": "16542*****",
        "modifier": "16542*****",
        "plan": "***"
    }
}

Where deployed_model is the unique ID of the dedicated service.

2. Query service status

Run the following command to query the details of a specific dedicated service:

curl "https://dashscope-intl.aliyuncs.com/api/v1/deployments/qwen3-8b-ft-202511132025-0260" \
    --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
    --header 'Content-Type: application/json' 

On success, the command returns the following result:

{
    "request_id": "ca36952d-9136-426e-ab08-68a97ad72719",
    "output":
    {
        "deployed_model": "qwen3-8b-ft-202511132025-0260",
        "gmt_create": "2025-11-20T20:32:08",
        "gmt_modified": "2025-11-20T20:42:25",
        "status": "RUNNING",
        "model_name": "qwen3-8b-ft-202511132025-0260",
        "base_model": "qwen3-8b",
        "base_capacity": 2,
        "capacity": 2,
        "ready_capacity": 2,
        "workspace_id": "llm-8v53etv3hwb8orx1",
        "charge_type": "post_paid",
        "creator": "1654290265984853",
        "modifier": "1654290265984853",
        "plan": "mu",
        "model_unit_spec": "MU1"
    }
}

When the service status is RUNNING, the service deployment is complete.

3. Make inference requests

Note

If this is your first time using the DashScope SDK, see Install the SDK.

Ensure the API Key's workspace matches the model's deployment workspace.

Use the SDK to send a request to the dedicated service:

from dashscope import Generation
from http import HTTPStatus
import os
response = Generation.call(
    model='qwen3-8b',
    prompt='Who are you?',
    enable_thinking=False,
    api_key=os.getenv('DASHSCOPE_API_KEY'),
)
if response.status_code == HTTPStatus.OK:
    print(response.output)
    print(response.usage)
else:
    print(response.code)
    print(response.message)

A successful execution returns the following result:

{"text": "I am Qwen, a large language model developed by Alibaba Cloud. I am designed to generate various types of text, such as articles, stories, and poems, and to engage in conversations, answer questions, provide information, and offer help in different scenarios. I'm happy to serve you! If you have any questions or need assistance, please feel free to let me know.", "finish_reason": "stop", "choices": null}
{"input_tokens": 11, "output_tokens": 63, "total_tokens": 74}

4. Delete the dedicated service

Warning

Running the delete command immediately takes the deployment service offline. This action is irreversible and has the following effects:

  1. You can no longer call the model.

  2. Billing for the service stops.

If you no longer need a dedicated service, you can delete it by using the following command:

curl --request DELETE 'https://dashscope-intl.aliyuncs.com/api/v1/deployments/qwen3-8b-ft-202511132025-0260' \
    --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
    --header 'Content-Type: application/json' 

On success, the command returns the following result:

{
    "request_id": "8f726017-6042-420e-a465-0d366a3aba59",
    "output":
    {
        "deployed_model": "qwen3-8b-ft-202511132025-0260",
        "gmt_create": "2025-11-20T20:32:08",
        "gmt_modified": "2025-11-27T16:35:31.591",
        "status": "DELETING",
        "model_name": "qwen3-8b-ft-202511132025-0260",
        "base_model": "qwen3-8b",
        "base_capacity": 2,
        "capacity": 2,
        "ready_capacity": 2,
        "workspace_id": "llm-8v53etv3hwb8orx1",
        "charge_type": "post_paid",
        "creator": "1654290265984853",
        "modifier": "1654290265984853",
        "plan": "mu",
        "model_unit_spec": "MU1"
    }
}

After deleting the service, you can no longer query its status using the method described in 2. Query service status.

API reference

For details on the API calls, see API details.

FAQ

Permission errors during deployment

When deploying a model by using an API, ensure the following:

  1. The API Key's Workspace must have permission to manage the model. Go to the Business Space Management page in Model Studio and check the model deployment permission settings for the corresponding workspace.

    API call error: Workspace xxx does not have deployment privilege for model xxxx.

    In the Actions column for the corresponding workspace, click Model Permission and Flow Control Settings.

    In the Model List, find the target model and check the authorization status in the Model Deployment column. If it shows Not Authorized, click Edit in the Actions column to grant permission.

    If a "permission denied" message appears, contact your organization or IT administrator to add the required permissions or perform the operation for you.

  2. The Owner Account that owns the API Key has the required permissions in the Workspace. Go to the Model Studio console, click the workspace in the lower-left corner, switch to the correct workspace, and then click image to check the model deployment permission settings.

    API call error: Workspace access denied.

    In the left-side navigation pane, click Permission Management and confirm that the user list includes the API Key's owner account (with type primary account).

    If a "permission denied" message appears, contact your organization or IT administrator to add the required permissions or perform the operation for you.