This topic describes the best practices for deploying the DeepSeek-R1-671B model across multiple nodes in Container Service for Kubernetes (ACK). To address the insufficient memory of a single GPU for the 671B model, a hybrid parallelism strategy (Pipeline Parallelism = 2 + Tensor Parallelism = 8) is proposed. Combined with Arena, efficient distributed deployment can be implemented on two ecs.ebmgn8v.48xlarge nodes (8 × 96GB). This topic also describes how to seamlessly integrate a DeepSeek-R1 deployed in ACK into the Dify platform to build an enterprise-level intelligent Q&A system that supports long text comprehension.
Background information
DeepSeek
DeepSeek-R1 is the first-generation inference model provided by DeepSeek to improve the inference performance of large language models (LLMs) by using large-scale enhanced learning. Statistics show that DeepSeek-R1 outperforms other closed source models in mathematical inference and programming performance. The performance of the model reaches or surpasses the OpenAI-o1 series in specific sectors. DeepSeek-R1 also excels in sectors related to knowledge, such as creation, writing, and Q&A. DeepSeek distills inference capabilities to smaller models, such as Qwen and Llama, to fine-tune the inference performance of these models. The 14B model distilled from DeepSeek surpasses the open source QwQ-32B model. The 32B and 70B models distilled from DeepSeek also set new records. For more information about DeepSeek, see DeepSeek AI GitHub repository.
vLLM
vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to significantly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.
Arena
Prerequisites
An ACK cluster that contains GPU-accelerated nodes is created. For more information, see Add GPU-accelerated nodes or ASIC-accelerated nodes to an ACK cluster. Recommended model:
ecs.ebmgn8v.48xlarge (8 × 96GB).ImportantUse driver version 550.x or later. You can specify driver versions by adding labels to the node pool. For example, add the following label to the GPU node pool to specify driver version 550.144.03:
ack.aliyun.com/nvidia-driver-version: 550.144.03.The ACK cluster runs Kubernetes 1.28 or later.
(Optional) The cloud-native AI suite is installed. For more information, see Deploy the cloud-native AI suite.
(Optional) The Arena client of version 0.14.0 or later is installed. For more information, see Configure the Arena client.
1. Deployment across multiple nodes
1.1 Split the model
DeepSeek-R1 provides 671 billion parameters. Each GPU can provide up to 96 GB of memory, which is insufficient to load the entire model. To resolve this issue, you must split the model. In this topic, the TP=8 and PP=2 splitting methods are used. The following figure shows the splitting methods. Model parallelism (PP=2) splits the model into two phases. Each phase runs on a GPU-accelerated node. For example, Model M is split into M1 and M2. M1 runs on the first GPU-accelerated node and passes results to M2 that runs on the second GPU-accelerated node. Data parallelism (TP=8) performs computing operations on eight GPUs in each phase (M1 or M2). In the M1 phase, input data is split into eight portions and processed on eight GPUs. Each GPU processes a portion and then the system merges the computing results from the eight GPUs.

In this topic, vLLM + ray is used to deploy the DeepSeek-R1 model in a distributed manner. The following figure shows the overall deployment architecture. Two vLLM pods are deployed on two ECS instances. Each vLLM pod has eight GPUs. One of the pods functions as a Ray head node and the other pod functions as a Ray worker node.

1.2 Download the model
In the example in this section, DeepSeek-R1 is used to describe how to download models from and upload models to Object Storage Service (OSS) and create persistent volumes (PVs) and persistent volume claims (PVCs) in ACK clusters.
For more information about how to upload a model to Apsara File Storage NAS (NAS), see Mount a statically provisioned NAS volume.
To accelerate file downloads and uploads, you can directly copy the files to your OSS bucket.
Download the model file.
Run the following command to install Git:
# Run yum install git or apt install git. yum install gitRun the following command to install the Git Large File Support (LFS) plug-in:
# Run yum install git-lfs or apt install git-lfs. yum install git-lfsRun the following command to clone the DeepSeek-R1 repository on ModelScope to your on-premises machine:
GIT_LFS_SKIP_SMUDGE=1 git clone https://modelscope.cn/models/deepseek-ai/DeepSeek-R1Run the following command to access the DeepSeek-R1 directory and pull large files managed by LFS:
cd DeepSeek-R1 git lfs pull
Upload the DeepSeek-R1 files to OSS.
Log on to the OSS console to view and copy the name of the OSS bucket that you created.
For more information about how to create an OSS bucket, see Create buckets.
Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.
Run the following command to create a directory named DeepSeek-R1 in OSS:
ossutil mkdir oss://<Your-Bucket-Name>/models/DeepSeek-R1Run the following command to upload the model files to OSS:
ossutil cp -r ./DeepSeek-R1 oss://<Your-Bucket-Name>/models/DeepSeek-R1
Configure PVs and PVCs for the destination cluster. For more information, see Mount a statically provisioned ossfs 1.0 volume.
Create a PV
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
In the upper-right corner of the Persistent Volumes page, click Create.
In the Create PV dialog box, configure the parameters that are described in the following table.
The following table describes the parameters of a PV.
Parameter
Description
PV Type
In this example, select OSS.
Volume Name
In this example, enter llm-model.
Access Certificate
The AccessKey pair used to access the OSS bucket. The AccessKey pair consists of an AccessKey ID and an AccessKey secret.
Bucket ID
Select the OSS bucket you created in the preceding step.
OSS Path
Enter the path of the model, such as
/models/DeepSeek-R1.
Create a PVC
On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose .
In the upper-right corner of the Persistent Volume Claims page, click Create.
In the Create PVC dialog box, configure the parameters that are described in the following table.
The following table describes the parameters of a PVC.
Parameter
Description
PVC Type
In this example, select OSS.
Name
In this example, enter llm-model.
Allocation Mode
In this example, select Existing Volumes.
Existing Storage Class
Click the Select PV hyperlink and select the PV that you created.
1.3 Deploy the model
Install LeaderWorkerSet.
Log on to the ACK console.
In the navigation pane on the left, click Clusters, then click the name of the cluster you created.
In the navigation pane on the left, click . On the Helm page, click Deploy.
In the Basic Information step, enter the Application Name and Namespace, find lws in the Chart section, and click Next. In this example, the application name (lws) and namespace (lws-system) are used.
In the Parameters step, select the latest Chart Version, and click OK to install lws.

Deploy the model.
The following figure shows a diagram of the vLLM distributed deployment architecture.

Deploy models by using Arena
Run the following command to deploy the service:
arena serve distributed \ --name=vllm-dist \ --version=v1 \ --restful-port=8080 \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8080" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --share-memory=30Gi \ --data=llm-model:/models/DeepSeek-R1 \ --leader-num=1 \ --leader-gpus=8 \ --leader-command="bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=\$(LWS_GROUP_SIZE); python3 -m vllm.entrypoints.openai.api_server --model /models/DeepSeek-R1 --port 8080 --trust-remote-code --served-model-name deepseek-r1 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enforce-eager" \ --worker-num=1 \ --worker-gpus=8 \ --worker-command="bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=\$(LWS_LEADER_ADDRESS)"Expected output:
configmap/vllm-dist-v1-cm created service/vllm-dist-v1 created leaderworkerset.leaderworkerset.x-k8s.io/vllm-dist-v1-distributed-serving created INFO[0002] The Job vllm-dist has been submitted successfully INFO[0002] You can run `arena serve get vllm-dist --type distributed-serving -n default` to check the job statusRun the following command to view the deployment progress of the inference service:
arena serve get vllm-distExpected output:
Name: vllm-dist Namespace: default Type: Distributed Version: v1 Desired: 1 Available: 1 Age: 3m Address: 192.168.138.65 Port: RESTFUL:8080 GPU: 16 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- vllm-dist-v1-distributed-serving-0 Running 3m 1/1 0 8 cn-beijing.10.x.x.x vllm-dist-v1-distributed-serving-0-1 Running 3m 1/1 0 8 cn-beijing.10.x.x.x
Deploy models by using kubectl
Run the
DeepSeek_R1.yamlfile to deploy the model service.apiVersion: leaderworkerset.x-k8s.io/v1 kind: LeaderWorkerSet metadata: name: vllm-dist spec: replicas: 1 leaderWorkerTemplate: size: 2 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader spec: volumes: - name: model persistentVolumeClaim: claimName: llm-model - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi containers: - name: vllm-leader image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0 command: - sh - -c - >- bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); python3 -m vllm.entrypoints.openai.api_server --model /models/DeepSeek-R1 --port 8080 --trust-remote-code --served-model-name deepseek-r1 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enforce-eager resources: limits: nvidia.com/gpu: "8" requests: nvidia.com/gpu: "8" ports: - containerPort: 8080 readinessProbe: initialDelaySeconds: 30 periodSeconds: 30 tcpSocket: port: 8080 volumeMounts: - mountPath: /models/DeepSeek-R1 name: model - mountPath: /dev/shm name: dshm workerTemplate: spec: volumes: - name: model persistentVolumeClaim: claimName: llm-model - name: dshm emptyDir: medium: Memory sizeLimit: 15Gi containers: - name: vllm-worker image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0 command: - sh - -c - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)" resources: limits: nvidia.com/gpu: "8" requests: nvidia.com/gpu: "8" ports: - containerPort: 8080 volumeMounts: - mountPath: /models/DeepSeek-R1 name: model - mountPath: /dev/shm name: dshm --- apiVersion: v1 kind: Service metadata: name: vllm-dist-v1 spec: type: ClusterIP ports: - port: 8080 protocol: TCP targetPort: 8080 selector: leaderworkerset.sigs.k8s.io/name: vllm-dist role: leaderkubectl create -f DeepSeek_R1.yamlRun the following command to view the deployment progress of the inference service:
kubectl get po |grep vllm-distExpected output:
NAME READY STATUS RESTARTS AGE vllm-dist-0 1/1 Running 0 20m vllm-dist-0-1 1/1 Running 0 20m
Create an on-premises port to forward model inference requests.
Run the
kubectl port-forwardcommand to configure port forwarding between the on-premises environment and inference service.NoteIf you run the
kubectl port-forwardcommand to configure port forwarding, the service is not reliable, secure, or extensible in production environments. You can use the service only for development and debugging. Do not run this command to configure port forwarding in production environments. For more information about the networking solutions used to facilitate production in ACK clusters, see Ingress management.kubectl port-forward svc/vllm-dist-v1 8080:8080Send requests to the inference service.
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "model": "deepseek-r1", "prompt": "San Francisco is a", "max_tokens": 10, "temperature": 0.6 }'Expected output:
{"id":"cmpl-15977abb0adc44d9aa03628abe9fcc81","object":"text_completion","created":1739346042,"model":"ds","choices":[{"index":0,"text":" city that needs no introduction. Known for its iconic","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":15,"completion_tokens":10,"prompt_tokens_details":null}}
2. Use Dify to build a DeepSeek Q&A assistant
You can install and configure Dify in an Container Service for Kubernetes cluster. For more information, see Install ack-dify.
2.1. Configure the DeepSeek model
Log on to the Dify platform. Click your profile picture and then click Settings. In the left-side navigation pane, click Model Provider. Find
OpenAI-API-compatibleand clickAdd.
The following table describes the parameters.
Parameter
Setting
Note
Model name
deepseek-r1You cannot modify this parameter.
API Key
Example:
api-deepseek-r1You can configure this parameter based on your business requirements.
API endpoint URL
http://vllm-dist-v1.default:8080/v1You cannot modify this parameter. The value of this parameter is the name of the local DeepSeek service deployed in the second step.

2.2. Create a Q&A assistant
Create a general-purpose AI-powered Q&A assistant. Choose Studio > Create from Blank. Specify a name and a description for the assistant. Use the default settings for other parameters.

2.3 Test the AI-powered Q&A assistant
On the right side of the page, you can initiate a conversation with DeepSeek.

You can integrate the configured DeepSeek Q&A assistant into your personal production environment. For more information, see Apply the AI-powered Q&A assistant to the production environment.
