×
Community Blog One-Click Deployment of DeepSeek-V3 and DeepSeek-R1 Models

One-Click Deployment of DeepSeek-V3 and DeepSeek-R1 Models

The Model Gallery offers vLLM or BladeLLM accelerated deployment features, enabling you to deploy the DeepSeek-V3 and DeepSeek-R1 series models with a single click.

DeepSeek-V3 is a mixture of experts (MoE) large language model with 671 billion parameters, launched by DeepSeek. DeepSeek-R1 is a high-performance inference model trained on the DeepSeek-V3-Base. The Model Gallery offers vLLM or BladeLLM accelerated deployment features, enabling you to deploy the DeepSeek-V3 and DeepSeek-R1 series models with a single click.

Supported Model List

Note: The full version models of DeepSeek-R1 and DeepSeek-V3 have a large parameter size (671B), which requires a high configuration and cost (8 cards with 96 GB video memory or more). Consider choosing distilled models, which offer more machine resources and lower deployment costs.

Tests indicate that the DeepSeek-R1-Distill-Qwen-32B model delivers better performance and cost-efficiency, making it suitable for cloud deployment. Other distilled models such as 7B, 8B, and 14B are also available for deployment. The Model Gallery also features a model evaluation tool to assess the actual performance of the model (the evaluation entry is located in the upper right corner of the model product page).

The table below shows the DeepSeek models supported for deployment on PAI (Platform for AI), along with their corresponding configurations and pricing. (When deploying DeepSeek through the PAI Model Gallery following the official tutorial, the platform will automatically preselect the recommended model configuration.)

1

Deployment Method Description:

  • BladeLLM Accelerated Deployment: BladeLLM is a high-performance inference framework independently developed by Alibaba Cloud PAI.
  • vLLM Accelerated Deployment: vLLM is a widely recognized library in the industry for LLM inference acceleration.
  • Standard Deployment: This is the standard deployment method without any inference acceleration.

For optimal performance and maximum supported token count, accelerated deployment (BladeLLM, vLLM) is recommended.

Accelerated deployment supports only the API call method. Standard deployment supports both the API call method and the WebUI chat interface.

Deploy Models

1.  Navigate to the Model Gallery page.

  • Log on to the PAI console.
  • In the upper-left corner, select a region that meets your business requirements.
  • In the left-side navigation pane, select Workspace List, then click the specified workspace name to enter the corresponding workspace.
  • In the left-side navigation pane, select Getting Started > Model Gallery.

2.  On the Model Gallery page, locate the model card you want to deploy, such as the DeepSeek-R1-Distill-Qwen-32B model, and click to access the model product page.

3.  Click Deploy in the upper right corner, select the deployment method and resources, and deploy with one click to create a PAI-EAS service.

Note: For deploying DeepSeek-R1 and DeepSeek-V3, in addition to the ml.gu8v.c192m1024.8-gu120 and ecs.gn8v-8x.48xlarge models in the public resource group (inventory may be limited), the ecs.ebmgn8v.48xlarge model is also an option. Please note that this model is not available through public resources. You must purchase EAS dedicated resources.

1

Use Inference Services

After successful deployment, click View Call Information on the service page to obtain the Endpoint and Token for the call.

The service call methods vary depending on the deployment method. Detailed instructions are available on the model introduction page of the Model Gallery.

BladeLLM deployment vLLM deployment Standard deployment
WebUI Not supported. You can download the Web UI code and start a Web UI locally. Note: The Web UI codes for BladeLLM and vLLM are different.
BladeLLM: BladeLLM_github, BladeLLM_oss
vLLM: vLLM_github, vLLM_oss


python webui_client.py --eas_endpoint "<EAS API Endpoint>" --eas_token "<EAS API Token>"
Supported
Online debugging Supported. You can select the deployment task in Task Management-Deployment Tasks to enter the product page and find the entry for online debugging.
API calls completions interface:
<EAS_ENDPOINT>/v1/completions
chat interface:
<EAS_ENDPOINT>/v1/chat/completions
API description file:
<EAS_ENDPOINT>/openapi.json
model list:<EAS_ENDPOINT>/v1/models
completions interface:
<EAS_ENDPOINT>/v1/completions
chat interface:
<EAS_ENDPOINT>/v1/chat/completions
<EAS_ENDPOINT>
Compatible with OpenAI SDK Not compatible Compatible Not compatible
Request data format The request data formats for completions and chat are different.

Compared to BladeLLM, the model parameter needs to be added. The value of the model parameter can be obtained through the model list interface <EAS_ENDPOINT>/v1/models. Supports string and JSON types.

View Request Data Examples

BladeLLM accelerated deployment

Completions request data:

{"prompt":"hello world", "stream":"true"}

Chat request data:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Hello World!!"
        }
    ]
}

vLLM accelerated deployment

In the example below, substitute <model_name> with the name of the model retrieved from the model list interface at <EAS_ENDPOINT>/v1/models API.

Completions request data:

{"model": "<model_name>", "prompt":"hello world"}

Chat request data:

{
    "model": "<model_name>",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Hello!"
        }
    ]
}

Standard deployment

{
    "max_new_tokens": 4,096,
    "use_stream_chat": false,
    "prompt": "What is the capital of Canada?",
    "system_prompt": "Act like you are a knowledgeable assistant who can provide information on geography and related topics.",
    "history": [
        [
            "Can you tell me what's the capital of France?",
            "The capital of France is Paris."
        ]
    ],
    "temperature": 0.8,
    "top_k": 10,
    "top_p": 0.8,
    "do_sample": true,
    "use_cache": true
}

For standard deployments, Web applications are supported. In PAI-Model Gallery > Task Management > Deployment Tasks, click the deployed service name. On the Service Product Page, click View WEB Application in the upper right corner for real-time interaction through the ChatLLM WebUI.

2

For API calls, see how to use the API for model inference.

About Costs

  • Due to the large size of the DeepSeek-V3 and DeepSeek-R1 models, the deployment cost is substantial. These models are recommended for formal production environments.
  • Alternatively, you can deploy lightweight models distilled from the original models. These distilled models have significantly fewer parameters, which greatly reduces deployment costs.
  • For long-term use, consider combining the public resource group with a savings plan or purchasing a subscription EAS resource group to reduce costs.
  • For non-production environments, you can enable the preemptible mode during deployment. Note that successful bidding is subject to certain conditions, and there is a risk of resource instability.
0 1 0
Share on

You may also like

Comments