All Products
Search
Document Center

:One-click deployment of DeepSeek-V3 and DeepSeek-R1

Last Updated:Mar 19, 2025

DeepSeek-V3 is a large language model (LLM) with 671 billion parameters, featuring a mixture of experts (MoE) architecture. DeepSeek-R1 is a high-performance reasoning model trained based on DeepSeek-V3-Base. The Model Gallery of Platform for AI (PAI) provides standard and accelerated deployment options that enables one-click deployment of the DeepSeek-V3 and DeepSeek-R1 models.

Deploy the model

  1. Go to the Model Gallery page.

    1. Log on to the PAI console. Select a region in the upper left corner.

    2. From the left-side navigation pane, choose Workspaces and click the name of the desired workspace.

    3. In the left-side navigation pane, choose QuickStart > Model Gallery.

  2. Choose a model.

    On the Model Gallery page, find the model you want to deploy and click to enter the model details page.

    For example, consider DeepSeek-R1-Distill-Qwen-7B, a distilled model that is smaller in size, making it ideal for quick practice. It has low computational resource requirements and can be deployed using free trial resources.

    You can find more DeepSeek models in the model list, where you can also learn about their deployment methods and token limits.

  3. Configure deployment parameters.

    Click Deploy in the upper right corner. The system provides default deployment parameters, which you can modify as needed. After confirming all settings, click Deploy and wait for the deployment to complete.

    Important

    If deploying with public resources, billing starts after the service enters the Running state. Fees will be incurred even if the service is not actually called. Stop unused model services in time to avoid additional expenses.

    Deployment Method: We recommend SGLang or vLLM accelerated deployment (fully compatible with Open API standards and mainstream AI applications). For more information, see the Deployment methods.

    Resource Deployment: The default settings use public resources and recommended specifications.

    • When deploying with public resources, the system automatically filters out the specifications available for the model. If the inventory is insufficient, consider switching to another region. For deploying DeepSeek-R1 or DeepSeek-V3, you can select based on resources for full-version models.

    • When deploying with resource quotas, you must select deployment method according to the node type. For GP7V type, select Single-Node-GP7V under SGLang Accelerate Deployment. Otherwise, deployment will fail.

    Deployment page

  4. View more information.

    1. Go to Model Gallery > Job Management > Deployment Jobs.

    2. Click the name of the deployed service.

    3. View the deployment progress and call information.

    4. You can also click More Infor in the upper right corner to jump to the service details page in Elastic Algorithm Service (EAS) of PAI.

      View more information

Call the model

The official usage recommendations for the DeepSeek-R1 series are:

  • Set temperature between 0.5 - 0.7, ideally 0.6, to prevent repetitive or incoherent output.

  • Do not use system prompt. Include all instructions in user prompt.

  • For mathematical problems, include "Reason step by step and put the final answer in \boxed{}" in the prompt.

Important

The default value of max_tokens for BladeLLM accelerated deployment is 16. Excessive tokens will be truncated. Adjust max_tokens according to your needs.

Use WebUI

Only Transformers standard deployment method supports Web application. Accelerated deployment methods provides WebUI code instead. You can download the code and start the Web application locally.

Transformers standard deployment

  1. Go to Model Gallery > Job Management > Deployment Jobs.

  2. Click the name of the deployed service.

  3. In the upper-right corner of the details page, click View Web App to interact in real-time through the ChatLLM WebUI.

View Web application

image

Accelerated deployment

Gradio is a user-friendly interface library based on Python that can quickly create interactive interfaces for machine learning models. We provide the code for building WebUI using Gradio. Follow the steps below to start the web application locally.

  1. Download the WebUI code from GitHub or click the OSS link to download it directly. The code downloaded by both methods is the same.

  2. Run the following command to start the Web application.

    python webui_client.py --eas_endpoint "<EAS API Endpoint>" --eas_token "<EAS API Token>"

    Replace <EAS API Endpoint> with the endpoint of the deployed service, and <EAS API Token> with the service token. To view the endpoint and token:

    1. Go to Model Gallery > Job Management > Deployment Jobs.

    2. Click the name of the deployed service.

    3. Click View Call Information.

Online debugging

  1. Go to Model Gallery > Job Management > Deployment jobs.

  2. Click the name of the deployed service.

  3. In the Online Debugging section, find the entry for EAS online debugging.

  4. Take SGLang deployment as an example, initiate a POST request using <EAS_ENDPOINT>/v1/chat/completions.

    1. Supplement the request path. The prefilled path is <EAS_ENDPOINT>. Append v1/chat/completions to the end of it.

    2. Construct the request body.

      Suppose your prompt is: What is 3+5?

      For the conversation interface, the format of your request body should include the model parameter, which is the model name retrieved from the Model list interface at <EAS_ENDPOINT>/v1/models. Take DeepSeek-R1-Distill-Qwen-7B as an example:

      {
          "model": "DeepSeek-R1-Distill-Qwen-7B",
          "messages": [
              {
                  "role": "system",
                  "content": "You are a helpful assistant."
              },
              {
                  "role": "user",
                  "content": "What is 3+5?"
              }
          ]
      }
    3. Initiate the request.

      SGLang online testing

Here are the request paths and corresponding data samples for other deployment methods:

View request data samples

SGLang or vLLM

In the example below, replace <model_name> with the model name obtained from <EAS_ENDPOINT>/v1/models. You can also obtain the API description file from <EAS_ENDPOINT>/openapi.json.

Request data for chat interface <EAS_ENDPOINT>/v1/chat/completions.

Important

The official recommendation for DeepSeek-R1 models is not to use system prompt. All instructions should be included in user prompt.

{
    "model": "<model_name>",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is 3+5?"
        }
    ]
}

BladeLLM

Request data for chat interface <EAS_ENDPOINT>/v1/chat/completions.

Important

The official recommendation for DeepSeek-R1 models is not to use system prompt. All instructions should be included in user prompt.

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is 3+5?"
        }
    ],
    "max_tokens": 2000
}

Transformers standard deployment

  • Request data for chat interface <EAS_ENDPOINT>/v1/chat/completions.

    Important

    The official recommendation for DeepSeek-R1 models is not to use system prompt. All instructions should be included in user prompt.

    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "What is 3+5?"
            }
        ]
    }
  • Request path: <EAS_ENDPOINT>

    Supports data request formats for completions and chat interfaces, along with direct string requests.

    String type

    What is 3+5?

    completion

    {"prompt":"What is 3+5?"}

    chat

    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "What is 3+5?"
            }
        ]
    }

Use API

  1. Obtain the endpoint and token of the service.

    1. Go to Model Gallery > Job Management > Deployment Jobs, click the name of the deployed service.

    2. Click View Call Information to obtain the endpoint and token.

      MG view call information

  2. Call examples of chat interface

    Replace <EAS_ENDPOINT> with the endpoint and <EAS_TOKEN> with the token.

    OpenAI SDK

    Note:

    • Add /v1 at the end of the endpoint.

    • BladeLLM and Transformers standard deployment methods do not support using client.models.list() to obtain the model list. You can directly specify the model value as "" for compatibility.

    SGLang or vLLM

    from openai import OpenAI
    
    ##### API Configuration #####
    # <EAS_ENDPOINT> needs to be replaced with the endpoint of the deployed service, and <EAS_TOKEN> needs to be replaced with the token of the deployed service.
    openai_api_key = "<EAS_TOKEN>"
    openai_api_base = "<EAS_ENDPOINT>/v1"
    
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    
    models = client.models.list()
    model = models.data[0].id
    print(model)
    
    stream = True
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "user", "content": "Hello, please introduce yourself."}
        ],
        model=model,
        max_completion_tokens=2048,
        stream=stream,
    )
    
    if stream:
        for chunk in chat_completion:
            print(chunk.choices[0].delta.content, end="")
    else:
        result = chat_completion.choices[0].message.content
        print(result)

    BladeLLM or Transformers standard deployment

    from openai import OpenAI
    
    ##### API Configuration #####
    # <EAS_ENDPOINT> needs to be replaced with the endpoint of the deployed service, and <EAS_TOKEN> needs to be replaced with the token of the deployed service.
    openai_api_key = "<EAS_TOKEN>"
    openai_api_base = "<EAS_ENDPOINT>/v1"
    
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    
    # BladeLLM accelerated deployment and Transformers standard deployment currently do not support using client.models.list() to obtain the model name. You can directly specify the model value as "" for compatibility.
    model=""
    stream = True
    
    chat_completion = client.chat.completions.create(
        messages=[
                  {"role": "user", "content": "Hello, please introduce yourself."}
        ],
         model=model,
         max_completion_tokens=2048,
         stream=stream,
        )
    
    if stream:
        for chunk in chat_completion:
            print(chunk.choices[0].delta.content, end="")
    else:
        result = chat_completion.choices[0].message.content
        print(result)

    HTTP

    SGLang/vLLM Accelerated Deployment

    In the example below, <model_name> should be replaced with the model name obtained from the Model List Interface <EAS_ENDPOINT>/v1/models.

    curl -X POST \
        -H "Content-Type: application/json" \
        -H "Authorization: <EAS_TOKEN>" \
        -d '{
            "model": "<model_name>",
            "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "hello!"
            }
            ]
        }' \
        <EAS_ENDPOINT>/v1/chat/completions
    
    import json
    import requests
    
    # <EAS_ENDPOINT> needs to be replaced with the endpoint of the deployed service, and <EAS_TOKEN> needs to be replaced with the token of the deployed service.
    EAS_ENDPOINT = "<EAS_ENDPOINT>"
    EAS_TOKEN = "<EAS_TOKEN>"
    
    url = f"{EAS_ENDPOINT}/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": EAS_TOKEN,
    }
    
    # <model_name> should be replaced with the model name obtained from the model list interface <EAS_ENDPOINT>/v1/models.
    model = "<model_name>"
    stream = True
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, please introduce yourself."},
    ]
    # When using the BladeLLM accelerated deployment method, if the max_tokens parameter is not specified, it will be truncated by default with max_tokens=16. It is recommended to adjust the request parameter max_tokens according to actual needs.
    req = {
        "messages": messages,
        "stream": stream,
        "temperature": 0.0,
        "top_p": 0.5,
        "top_k": 10,
        "max_tokens": 300,
        "model": model,
    }
    response = requests.post(
        url,
        json=req,
        headers=headers,
        stream=stream,
    )
    
    if stream:
        for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
            msg = chunk.decode("utf-8")
            if msg.startswith("data"):
                info = msg[6:]
                if info == "[DONE]":
                    break
                else:
                    resp = json.loads(info)
                    print(resp["choices"][0]["delta"]["content"], end="", flush=True)
    else:
        resp = json.loads(response.text)
        print(resp["choices"][0]["message"]["content"])
    

    BladeLLM accelerated deployment/Transformers standard deployment

    curl -X POST \
        -H "Content-Type: application/json" \
        -H "Authorization: <EAS_TOKEN>" \
        -d '{
            "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "hello!"
            }
            ]
        }' \
        <EAS_ENDPOINT>/v1/chat/completions
    
    import json
    import requests
    
    # <EAS_ENDPOINT> needs to be replaced with the endpoint of the deployed service, and <EAS_TOKEN> needs to be replaced with the token of the deployed service.
    EAS_ENDPOINT = "<EAS_ENDPOINT>"
    EAS_TOKEN = "<EAS_TOKEN>"
    
    url = f"{EAS_ENDPOINT}/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": EAS_TOKEN,
    }
    
    
    stream = True
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, please introduce yourself."},
    ]
    # When using the BladeLLM accelerated deployment method, if the max_tokens parameter is not specified, it will be truncated by default with max_tokens=16. It is recommended to adjust the request parameter max_tokens according to actual needs.
    req = {
        "messages": messages,
        "stream": stream,
        "temperature": 0.2,
        "top_p": 0.5,
        "top_k": 10,
        "max_tokens": 300,
    }
    response = requests.post(
        url,
        json=req,
        headers=headers,
        stream=stream,
    )
    
    if stream:
        for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
            msg = chunk.decode("utf-8")
            if msg.startswith("data"):
                info = msg[6:]
                if info == "[DONE]":
                    break
                else:
                    resp = json.loads(info)
                    if(resp["choices"][0]["delta"].get("content") is not None):
                          print(resp["choices"][0]["delta"]["content"], end="", flush=True)
    else:
        resp = json.loads(response.text)
        print(resp["choices"][0]["message"]["content"])
    
  3. Different models and deployment frameworks may be different when using inference services. For more detailed API reference, refer to the model details page in the Model Gallery.

    image

Resource cleanup

Instances deployed with public resources are billed by running duration. If the duration is less than 1 hour, the cost will be calculated based on the number of minutes. To avoid excessive fees, stop or delete the instance in time.

Resource cleanup

Deployment methods

The introduction of different deployment methods is as follows:

  • BladeLLM accelerated deployment (recommended): BladeLLM is a high-performance inference framework developed by Alibaba Cloud PAI.

  • SGLang accelerated deployment (recommended): SGLang is a fast service framework for large and visual language models.

  • vLLM accelerated deployment: vLLM is a widely-used library for LLM inference acceleration.

  • Transformers standard deployment: The standard deployment without inference acceleration.

Different models support different deployment methods. Refer to the model list for supported deployment methods, maximum token counts, and minimum deployment requirements.

Usage notes:

  • For performance and maximum token count, we recommend BladeLLM and SGLang accelerated deployment.

  • For compatibility with the OpenAI API, we recommend SGLang and vLLM accelerated deployment. They are fully compatible with the OpenAI API and can be easily integrated into applications.

Important

Transformers standard deployment supports both API and WebUI. Accelerated deployment only supports API.

Model list

Note

The full-version models DeepSeek-R1 and DeepSeek-V3 have a large parameter size of 671B, requiring high specifications and cost (8 GPU cores with 96 GB of video memory). We recommend using distill models for more available machine resources and lower costs.

Tests indicate that DeepSeek-R1-Distill-Qwen-32B offers better performance and lower cost, making it suitable for cloud deployment. Distill models such as 7B, 8B, or 14B are also available. Model Gallery provides the model evaluation feature to assess performance. Click Evaluate in the upper right corner of the model details page to use.

Full-version models

The following table shows the minimum specification required and the maximum token count supported by different methods across machine types.

Model

Maximum token count (input + output)

Minimum specification

SGLang (recommended)

vLLM

Transformers standard deployment

DeepSeek-R1

163,840

  • Single-machine standard type: 4,096

  • Single-machine GP7V type: 16,384

  • Distributed GU7X type: 163,840

  • Distributed Lingjun resources: 163,840

Not supported

Single-machine 8 × GU120 (8 × 96 GB video memory)

DeepSeek-V3

163,840

  • Single-machine standard type: 4,096

  • Single-machine GP7V type: 16,384

  • Distributed GU7X type: 163,840

  • Distributed Lingjun AI Computing Service resources: 163,840

2,000

Single-machine 8 × GU120 (8 × 96 GB video memory)

Select resources

When deploying DeepSeek-R1 or DeepSeek-V3, the available instance types include:

  • Single-machine standard type:

  • Single-machine GP7V type:

    ml.gp7vf.16.40xlarge: public resources, available for bidding only. You can switch to the China (Ulanqab) region for GP7V types when standard types are tight. Make sure to configure VPC during deployment.

For higher performance, consider distributed deployment:

  • Distributed GU7X type:

    4 × ml.gu7xf.8xlarge-gu108: public resources, available for bidding only. You must switch to the China (Ulanqab) region to use. Make sure to configure VPC during deployment.

  • Distributed Lingjun resources:

    Whitelist access required. For inquiries, contact your sales manager or submit a ticket. You must switch to the China (Ulanqab) region to use. Make sure to configure VPC during deployment.

    PAI-Lingjun AI Computing Service offers high-performance and flexible heterogeneous computing services, that can triple resource utilization.

Distill models

Model

Maximum token count (input + output)

Minimum specification

BladeLLM (recommended)

SGLang (recommended)

vLLM

Transformers standard deployment

DeepSeek-R1-Distill-Qwen-1.5B

131,072

131,072

131,072

131,072

1 × A10 (24 GB video memory)

DeepSeek-R1-Distill-Qwen-7B

131,072

131,072

32,768

131,072

1 × A10 (24 GB video memory)

DeepSeek-R1-Distill-Llama-8B

131,072

131,072

32,768

131,072

1 × A10 (24 GB video memory)

DeepSeek-R1-Distill-Qwen-14B

131,072

131,072

32,768

131,072

1 × GPU L (48 GB video memory)

DeepSeek-R1-Distill-Qwen-32B

131,072

131,072

32,768

131,072

2 × GPU L (2 × 48 GB video memory)

DeepSeek-R1-Distill-Llama-70B

131,072

131,072

32,768

131,072

2 × GU120 (2 × 96 GB video memory)

Billing details

  • Due to their size, the DeepSeek-V3 and DeepSeek-R1 models incur high deployment costs. We recommend that you use them only in production environments.

  • Alternatively, you can deploy lightweight models distilled based on them, which have significantly fewer parameters and thus lower deployment costs.

  • For long-term use, consider using a public resource group with a savings plan or purchasing a EAS resource group to reduce costs.

  • In a non-production environment, you can activate the preemptible mode when deploying. However, it requires specific conditions and carries the risk of resource instability.

  • If you deploy using public resources, stopping the service will stop the billing. For more information, see Billing of EAS.

FAQ about deployment

The service waits for a long time after I click Deploy

Possible causes:

  • Insufficient resources in the current region.

  • The model requires a long time to load due to its size (for models like DeepSeek-R1 and DeepSeek-V3, it can take 20-30 minutes).

If the service still won't start after a long wait, consider the following steps:

  1. Go to Job Management > Deployment Jobs.

    1. Click the name of the job to go to the details page.

    2. Click More > More Infor in the upper-right corner.

    3. Check the Instance Status in the Service Instance section.

    EAS实例状态

  2. Terminate the current service. Switch to another region in upper-left corner of the page and deploy again.

    Note

    Models with large parameter sizes, such as DeepSeek-R1 and DeepSeek-V3, requires at least 8 GPU cores, and the resource inventory is limited. Consider smaller distill models such as DeepSeek-R1-Distill-Qwen-7B.

FAQ about calling

API request returns 404

Check whether the URL includes the OpenAI API suffix, such as v1/chat/completions. Refer to the overview page of the model for details.

The request is too long and causes EAS gateway timeout

The default timeout of EAS gateway is 180 seconds. To extend the timeout, configure an EAS dedicated gateway and submit a ticket to adjust the dedicated gateway's timeout, to up to 600 seconds.

How to debug the model online

See How do I debug a deployed model service online?

Why is there no "web search"

The web search feature requires building an AI agent, not just deploying the model service.

You can use PAI's LangStudio to build such an AI agent, see Chat With Web Search.

What to do if the model skips thinking

If the deployed DeepSeek-R1 model occasionally skips the thinking process, use the force thinking chat template updated by DeepSeek:

  1. Modify the startup command.

    In the Service Configuration section, edit the JSON script and add --chat-template /model_dir/template_force_thinking.jinja to the containers-script field. It can be appended after --served-model-name DeepSeek-R1.JSON编辑

    If the service is already deployed, click the service name in Model Gallery > Job Management > Deployment jobs. Click Update service in the upper-right corner.

    更新服务

  2. Modify the request body. Append {"role": "assistant", "content": "<think>\n"} to the end of each request message.

How to connect to Chatbox or Dify

Take DeepSeek-R1-Distill-Qwen-7B as an example. SGLang or vLLM accelerated deployment is recommended.

Chatbox

  1. In Settings, select Model Provider and choose Add Custom Provider.

    image.png

  2. Perform API settings.

    1. Name: Enter "PAI_DeepSeek-R1-Distill-Qwen-7B" or customize your own name.

    2. API Host: The endpoint of the EAS service.

    3. API Path: v1/chat/completions.

    4. API Key: The EAS service token.

    5. Model: If you are using Transformers standard deployment, do not enter the model name. For other deployment methods, enter the specific model name, such as DeepSeek-R1-Distill-Qwen-7B.

    6. Click Save.

    Chatbox

  3. Chat with the model to test it.

    image

Dify

  1. In Dify, click Model Provider and add "OpenAI-API-compatible" in ADD MORE MODEL PROVIDER.

    image

  2. Enter "DeepSeek-R1-Distill-Qwen-7B" as Model Name, the EAS service token as the API Key, and the EAS endpoint (add /v1 at the end) as the API endpoint URL.

    image

How to implement multi-round conversation?

The model service itself does not save conversation history. Instead, the client must save the conversation history and then add it to the request. Take SGLang as an example.

curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: ZWUyMzU5MmU2NGFiZWU0ZDRhNWVjMWNhMzI2NTM1ZDllMzZkYTAyYQ==" \
    -d '{
        "model": "<model_name>",
        "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
         {
            "role": "user", 
            "content": "Hello"
        },
        {
            "role": "assistant",
            "content": "Hello! Nice to meet you. How can I help you?"
        },
        {
            "role": "user",
            "content": "What was my previous question?"
        }
        ]
    }' \
    <EAS_ENDPOINT>/v1/chat/completions

References