DeepSeek-V3 is a large language model (LLM) with 671 billion parameters, featuring a mixture of experts (MoE) architecture. DeepSeek-R1 is a high-performance reasoning model trained based on DeepSeek-V3-Base. The Model Gallery of Platform for AI (PAI) provides standard and accelerated deployment options that enables one-click deployment of the DeepSeek-V3 and DeepSeek-R1 models.
Deploy the model
Go to the Model Gallery page.
Log on to the PAI console. Select a region in the upper left corner.
From the left-side navigation pane, choose Workspaces and click the name of the desired workspace.
In the left-side navigation pane, choose QuickStart > Model Gallery.
Choose a model.
On the Model Gallery page, find the model you want to deploy and click to enter the model details page.
For example, consider DeepSeek-R1-Distill-Qwen-7B, a distilled model that is smaller in size, making it ideal for quick practice. It has low computational resource requirements and can be deployed using free trial resources.
You can find more DeepSeek models in the model list, where you can also learn about their deployment methods and token limits.
Configure deployment parameters.
Click Deploy in the upper right corner. The system provides default deployment parameters, which you can modify as needed. After confirming all settings, click Deploy and wait for the deployment to complete.
ImportantIf deploying with public resources, billing starts after the service enters the Running state. Fees will be incurred even if the service is not actually called. Stop unused model services in time to avoid additional expenses.
Deployment Method: We recommend SGLang or vLLM accelerated deployment (fully compatible with Open API standards and mainstream AI applications). For more information, see the Deployment methods.
Resource Deployment: The default settings use public resources and recommended specifications.
When deploying with public resources, the system automatically filters out the specifications available for the model. If the inventory is insufficient, consider switching to another region. For deploying DeepSeek-R1 or DeepSeek-V3, you can select based on resources for full-version models.
When deploying with resource quotas, you must select deployment method according to the node type. For GP7V type, select Single-Node-GP7V under SGLang Accelerate Deployment. Otherwise, deployment will fail.
View more information.
Go to Model Gallery > Job Management > Deployment Jobs.
Click the name of the deployed service.
View the deployment progress and call information.
You can also click More Infor in the upper right corner to jump to the service details page in Elastic Algorithm Service (EAS) of PAI.
Call the model
The official usage recommendations for the DeepSeek-R1 series are:
Set temperature between 0.5 - 0.7, ideally 0.6, to prevent repetitive or incoherent output.
Do not use system prompt. Include all instructions in user prompt.
For mathematical problems, include "
Reason step by step and put the final answer in \boxed{}
" in the prompt.
The default value of max_tokens for BladeLLM accelerated deployment is 16. Excessive tokens will be truncated. Adjust max_tokens according to your needs.
Use WebUI
Only Transformers standard deployment method supports Web application. Accelerated deployment methods provides WebUI code instead. You can download the code and start the Web application locally.
Transformers standard deployment
Go to Model Gallery > Job Management > Deployment Jobs.
Click the name of the deployed service.
In the upper-right corner of the details page, click View Web App to interact in real-time through the ChatLLM WebUI.
Accelerated deployment
Gradio is a user-friendly interface library based on Python that can quickly create interactive interfaces for machine learning models. We provide the code for building WebUI using Gradio. Follow the steps below to start the web application locally.
Download the WebUI code from GitHub or click the OSS link to download it directly. The code downloaded by both methods is the same.
BladeLLM: BladeLLM_github, BladeLLM_oss
vLLM, SGLang: vLLM/SGLang_github, vLLM/SGLang_oss
Run the following command to start the Web application.
python webui_client.py --eas_endpoint "<EAS API Endpoint>" --eas_token "<EAS API Token>"
Replace <EAS API Endpoint> with the endpoint of the deployed service, and <EAS API Token> with the service token. To view the endpoint and token:
Go to Model Gallery > Job Management > Deployment Jobs.
Click the name of the deployed service.
Click View Call Information.
Online debugging
Go to Model Gallery > Job Management > Deployment jobs.
Click the name of the deployed service.
In the Online Debugging section, find the entry for EAS online debugging.
Take SGLang deployment as an example, initiate a POST request using
<EAS_ENDPOINT>/v1/chat/completions
.Supplement the request path. The prefilled path is
<EAS_ENDPOINT>
. Appendv1/chat/completions
to the end of it.Construct the request body.
Suppose your prompt is:
What is 3+5?
For the conversation interface, the format of your request body should include the
model
parameter, which is the model name retrieved from the Model list interface at<EAS_ENDPOINT>/v1/models
. Take DeepSeek-R1-Distill-Qwen-7B as an example:{ "model": "DeepSeek-R1-Distill-Qwen-7B", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is 3+5?" } ] }
Initiate the request.
Here are the request paths and corresponding data samples for other deployment methods:
Use API
Obtain the endpoint and token of the service.
Go to Model Gallery > Job Management > Deployment Jobs, click the name of the deployed service.
Click View Call Information to obtain the endpoint and token.
Call examples of chat interface
Replace <EAS_ENDPOINT> with the endpoint and <EAS_TOKEN> with the token.
OpenAI SDK
Note:
Add
/v1
at the end of the endpoint.BladeLLM and Transformers standard deployment methods do not support using client.models.list() to obtain the model list. You can directly specify the model value as "" for compatibility.
SGLang or vLLM
from openai import OpenAI ##### API Configuration ##### # <EAS_ENDPOINT> needs to be replaced with the endpoint of the deployed service, and <EAS_TOKEN> needs to be replaced with the token of the deployed service. openai_api_key = "<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id print(model) stream = True chat_completion = client.chat.completions.create( messages=[ {"role": "user", "content": "Hello, please introduce yourself."} ], model=model, max_completion_tokens=2048, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result)
BladeLLM or Transformers standard deployment
from openai import OpenAI ##### API Configuration ##### # <EAS_ENDPOINT> needs to be replaced with the endpoint of the deployed service, and <EAS_TOKEN> needs to be replaced with the token of the deployed service. openai_api_key = "<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) # BladeLLM accelerated deployment and Transformers standard deployment currently do not support using client.models.list() to obtain the model name. You can directly specify the model value as "" for compatibility. model="" stream = True chat_completion = client.chat.completions.create( messages=[ {"role": "user", "content": "Hello, please introduce yourself."} ], model=model, max_completion_tokens=2048, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result)
HTTP
SGLang/vLLM Accelerated Deployment
In the example below, <model_name> should be replaced with the model name obtained from the Model List Interface
<EAS_ENDPOINT>/v1/models
.curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "model": "<model_name>", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "hello!" } ] }' \ <EAS_ENDPOINT>/v1/chat/completions
import json import requests # <EAS_ENDPOINT> needs to be replaced with the endpoint of the deployed service, and <EAS_TOKEN> needs to be replaced with the token of the deployed service. EAS_ENDPOINT = "<EAS_ENDPOINT>" EAS_TOKEN = "<EAS_TOKEN>" url = f"{EAS_ENDPOINT}/v1/chat/completions" headers = { "Content-Type": "application/json", "Authorization": EAS_TOKEN, } # <model_name> should be replaced with the model name obtained from the model list interface <EAS_ENDPOINT>/v1/models. model = "<model_name>" stream = True messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello, please introduce yourself."}, ] # When using the BladeLLM accelerated deployment method, if the max_tokens parameter is not specified, it will be truncated by default with max_tokens=16. It is recommended to adjust the request parameter max_tokens according to actual needs. req = { "messages": messages, "stream": stream, "temperature": 0.0, "top_p": 0.5, "top_k": 10, "max_tokens": 300, "model": model, } response = requests.post( url, json=req, headers=headers, stream=stream, ) if stream: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False): msg = chunk.decode("utf-8") if msg.startswith("data"): info = msg[6:] if info == "[DONE]": break else: resp = json.loads(info) print(resp["choices"][0]["delta"]["content"], end="", flush=True) else: resp = json.loads(response.text) print(resp["choices"][0]["message"]["content"])
BladeLLM accelerated deployment/Transformers standard deployment
curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "hello!" } ] }' \ <EAS_ENDPOINT>/v1/chat/completions
import json import requests # <EAS_ENDPOINT> needs to be replaced with the endpoint of the deployed service, and <EAS_TOKEN> needs to be replaced with the token of the deployed service. EAS_ENDPOINT = "<EAS_ENDPOINT>" EAS_TOKEN = "<EAS_TOKEN>" url = f"{EAS_ENDPOINT}/v1/chat/completions" headers = { "Content-Type": "application/json", "Authorization": EAS_TOKEN, } stream = True messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello, please introduce yourself."}, ] # When using the BladeLLM accelerated deployment method, if the max_tokens parameter is not specified, it will be truncated by default with max_tokens=16. It is recommended to adjust the request parameter max_tokens according to actual needs. req = { "messages": messages, "stream": stream, "temperature": 0.2, "top_p": 0.5, "top_k": 10, "max_tokens": 300, } response = requests.post( url, json=req, headers=headers, stream=stream, ) if stream: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False): msg = chunk.decode("utf-8") if msg.startswith("data"): info = msg[6:] if info == "[DONE]": break else: resp = json.loads(info) if(resp["choices"][0]["delta"].get("content") is not None): print(resp["choices"][0]["delta"]["content"], end="", flush=True) else: resp = json.loads(response.text) print(resp["choices"][0]["message"]["content"])
Different models and deployment frameworks may be different when using inference services. For more detailed API reference, refer to the model details page in the Model Gallery.
Resource cleanup
Instances deployed with public resources are billed by running duration. If the duration is less than 1 hour, the cost will be calculated based on the number of minutes. To avoid excessive fees, stop or delete the instance in time.
Deployment methods
The introduction of different deployment methods is as follows:
BladeLLM accelerated deployment (recommended): BladeLLM is a high-performance inference framework developed by Alibaba Cloud PAI.
SGLang accelerated deployment (recommended): SGLang is a fast service framework for large and visual language models.
vLLM accelerated deployment: vLLM is a widely-used library for LLM inference acceleration.
Transformers standard deployment: The standard deployment without inference acceleration.
Different models support different deployment methods. Refer to the model list for supported deployment methods, maximum token counts, and minimum deployment requirements.
Usage notes:
For performance and maximum token count, we recommend BladeLLM and SGLang accelerated deployment.
For compatibility with the OpenAI API, we recommend SGLang and vLLM accelerated deployment. They are fully compatible with the OpenAI API and can be easily integrated into applications.
Transformers standard deployment supports both API and WebUI. Accelerated deployment only supports API.
Model list
The full-version models DeepSeek-R1 and DeepSeek-V3 have a large parameter size of 671B, requiring high specifications and cost (8 GPU cores with 96 GB of video memory). We recommend using distill models for more available machine resources and lower costs.
Tests indicate that DeepSeek-R1-Distill-Qwen-32B offers better performance and lower cost, making it suitable for cloud deployment. Distill models such as 7B, 8B, or 14B are also available. Model Gallery provides the model evaluation feature to assess performance. Click Evaluate in the upper right corner of the model details page to use.
Full-version models
The following table shows the minimum specification required and the maximum token count supported by different methods across machine types.
Model | Maximum token count (input + output) | Minimum specification | ||
SGLang (recommended) | vLLM | Transformers standard deployment | ||
DeepSeek-R1 | 163,840 |
| Not supported | Single-machine 8 × GU120 (8 × 96 GB video memory) |
DeepSeek-V3 | 163,840 |
| 2,000 | Single-machine 8 × GU120 (8 × 96 GB video memory) |
Select resources
When deploying DeepSeek-R1 or DeepSeek-V3, the available instance types include:
Single-machine standard type:
ml.gu8v.c192m1024.8-gu120, ecs.gn8v-8x.48xlarge: public resources, inventory may be tight.
ecs.ebmgn8v.48xlarge: not available as public resources, purchase EAS dedicated resources first.
Single-machine GP7V type:
ml.gp7vf.16.40xlarge: public resources, available for bidding only. You can switch to the China (Ulanqab) region for GP7V types when standard types are tight. Make sure to configure VPC during deployment.
For higher performance, consider distributed deployment:
Distributed GU7X type:
4 × ml.gu7xf.8xlarge-gu108: public resources, available for bidding only. You must switch to the China (Ulanqab) region to use. Make sure to configure VPC during deployment.
Distributed Lingjun resources:
Whitelist access required. For inquiries, contact your sales manager or submit a ticket. You must switch to the China (Ulanqab) region to use. Make sure to configure VPC during deployment.
PAI-Lingjun AI Computing Service offers high-performance and flexible heterogeneous computing services, that can triple resource utilization.
Distill models
Model | Maximum token count (input + output) | Minimum specification | |||
BladeLLM (recommended) | SGLang (recommended) | vLLM | Transformers standard deployment | ||
DeepSeek-R1-Distill-Qwen-1.5B | 131,072 | 131,072 | 131,072 | 131,072 | 1 × A10 (24 GB video memory) |
DeepSeek-R1-Distill-Qwen-7B | 131,072 | 131,072 | 32,768 | 131,072 | 1 × A10 (24 GB video memory) |
DeepSeek-R1-Distill-Llama-8B | 131,072 | 131,072 | 32,768 | 131,072 | 1 × A10 (24 GB video memory) |
DeepSeek-R1-Distill-Qwen-14B | 131,072 | 131,072 | 32,768 | 131,072 | 1 × GPU L (48 GB video memory) |
DeepSeek-R1-Distill-Qwen-32B | 131,072 | 131,072 | 32,768 | 131,072 | 2 × GPU L (2 × 48 GB video memory) |
DeepSeek-R1-Distill-Llama-70B | 131,072 | 131,072 | 32,768 | 131,072 | 2 × GU120 (2 × 96 GB video memory) |
Billing details
Due to their size, the DeepSeek-V3 and DeepSeek-R1 models incur high deployment costs. We recommend that you use them only in production environments.
Alternatively, you can deploy lightweight models distilled based on them, which have significantly fewer parameters and thus lower deployment costs.
For long-term use, consider using a public resource group with a savings plan or purchasing a EAS resource group to reduce costs.
In a non-production environment, you can activate the preemptible mode when deploying. However, it requires specific conditions and carries the risk of resource instability.
If you deploy using public resources, stopping the service will stop the billing. For more information, see Billing of EAS.
FAQ about deployment
The service waits for a long time after I click Deploy
Possible causes:
Insufficient resources in the current region.
The model requires a long time to load due to its size (for models like DeepSeek-R1 and DeepSeek-V3, it can take 20-30 minutes).
If the service still won't start after a long wait, consider the following steps:
Go to Job Management > Deployment Jobs.
Click the name of the job to go to the details page.
Click
in the upper-right corner.Check the Instance Status in the Service Instance section.
Terminate the current service. Switch to another region in upper-left corner of the page and deploy again.
NoteModels with large parameter sizes, such as DeepSeek-R1 and DeepSeek-V3, requires at least 8 GPU cores, and the resource inventory is limited. Consider smaller distill models such as DeepSeek-R1-Distill-Qwen-7B.
FAQ about calling
API request returns 404
Check whether the URL includes the OpenAI API suffix, such as v1/chat/completions. Refer to the overview page of the model for details.
The request is too long and causes EAS gateway timeout
The default timeout of EAS gateway is 180 seconds. To extend the timeout, configure an EAS dedicated gateway and submit a ticket to adjust the dedicated gateway's timeout, to up to 600 seconds.
How to debug the model online
Why is there no "web search"
The web search feature requires building an AI agent, not just deploying the model service.
You can use PAI's LangStudio to build such an AI agent, see Chat With Web Search.
What to do if the model skips thinking
If the deployed DeepSeek-R1 model occasionally skips the thinking process, use the force thinking chat template updated by DeepSeek:
Modify the startup command.
In the Service Configuration section, edit the JSON script and add
--chat-template /model_dir/template_force_thinking.jinja
to thecontainers-script
field. It can be appended after--served-model-name DeepSeek-R1
.If the service is already deployed, click the service name in Model Gallery > Job Management > Deployment jobs. Click Update service in the upper-right corner.
Modify the request body. Append
{"role": "assistant", "content": "<think>\n"}
to the end of each request message.
How to connect to Chatbox or Dify
Take DeepSeek-R1-Distill-Qwen-7B as an example. SGLang or vLLM accelerated deployment is recommended.
Chatbox
In Settings, select Model Provider and choose Add Custom Provider.
Perform API settings.
Name: Enter "PAI_DeepSeek-R1-Distill-Qwen-7B" or customize your own name.
API Host: The endpoint of the EAS service.
API Path:
v1/chat/completions
.API Key: The EAS service token.
Model: If you are using Transformers standard deployment, do not enter the model name. For other deployment methods, enter the specific model name, such as DeepSeek-R1-Distill-Qwen-7B.
Click Save.
Chat with the model to test it.
Dify
In Dify, click Model Provider and add "OpenAI-API-compatible" in ADD MORE MODEL PROVIDER.
Enter "DeepSeek-R1-Distill-Qwen-7B" as Model Name, the EAS service token as the API Key, and the EAS endpoint (add
/v1
at the end) as the API endpoint URL.
How to implement multi-round conversation?
The model service itself does not save conversation history. Instead, the client must save the conversation history and then add it to the request. Take SGLang as an example.
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: ZWUyMzU5MmU2NGFiZWU0ZDRhNWVjMWNhMzI2NTM1ZDllMzZkYTAyYQ==" \
-d '{
"model": "<model_name>",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! Nice to meet you. How can I help you?"
},
{
"role": "user",
"content": "What was my previous question?"
}
]
}' \
<EAS_ENDPOINT>/v1/chat/completions