All Products
Search
Document Center

Container Service for Kubernetes:Best practice for deploying the DeepSeek full version across multiple nodes in ACK

Last Updated:Aug 21, 2025

This topic describes the best practices for deploying the DeepSeek-R1-671B model across multiple nodes in Container Service for Kubernetes (ACK). To address the insufficient memory of a single GPU for the 671B model, a hybrid parallelism strategy (Pipeline Parallelism = 2 + Tensor Parallelism = 8) is proposed. Combined with Arena, efficient distributed deployment can be implemented on two ecs.ebmgn8v.48xlarge nodes (8 × 96GB). This topic also describes how to seamlessly integrate a DeepSeek-R1 deployed in ACK into the Dify platform to build an enterprise-level intelligent Q&A system that supports long text comprehension.

Background information

  • DeepSeek

    DeepSeek-R1 is the first-generation inference model provided by DeepSeek to improve the inference performance of large language models (LLMs) by using large-scale enhanced learning. Statistics show that DeepSeek-R1 outperforms other closed source models in mathematical inference and programming performance. The performance of the model reaches or surpasses the OpenAI-o1 series in specific sectors. DeepSeek-R1 also excels in sectors related to knowledge, such as creation, writing, and Q&A. DeepSeek distills inference capabilities to smaller models, such as Qwen and Llama, to fine-tune the inference performance of these models. The 14B model distilled from DeepSeek surpasses the open source QwQ-32B model. The 32B and 70B models distilled from DeepSeek also set new records. For more information about DeepSeek, see DeepSeek AI GitHub repository.

  • vLLM

    vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to significantly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.

  • Arena

    Arena is a command-line interface tool for managing Kubernetes-based machine learning tasks. It streamlines the entire machine learning lifecycle, including data preparation, model development, training, and prediction, enhancing the efficiency of data scientists. Arena is deeply integrated with Alibaba Cloud services, supporting GPU sharing and the Cloud Paralleled File System (CPFS), and can operate within Alibaba Cloud-optimized deep learning frameworks to maximize the performance of heterogeneous computing resources. For more information about Arena, see the Arena GitHub repository.

Prerequisites

1. Deployment across multiple nodes

1.1 Split the model

DeepSeek-R1 provides 671 billion parameters. Each GPU can provide up to 96 GB of memory, which is insufficient to load the entire model. To resolve this issue, you must split the model. In this topic, the TP=8 and PP=2 splitting methods are used. The following figure shows the splitting methods. Model parallelism (PP=2) splits the model into two phases. Each phase runs on a GPU-accelerated node. For example, Model M is split into M1 and M2. M1 runs on the first GPU-accelerated node and passes results to M2 that runs on the second GPU-accelerated node. Data parallelism (TP=8) performs computing operations on eight GPUs in each phase (M1 or M2). In the M1 phase, input data is split into eight portions and processed on eight GPUs. Each GPU processes a portion and then the system merges the computing results from the eight GPUs.

image.png

In this topic, vLLM + ray is used to deploy the DeepSeek-R1 model in a distributed manner. The following figure shows the overall deployment architecture. Two vLLM pods are deployed on two ECS instances. Each vLLM pod has eight GPUs. One of the pods functions as a Ray head node and the other pod functions as a Ray worker node.

image

1.2 Download the model

In the example in this section, DeepSeek-R1 is used to describe how to download models from and upload models to Object Storage Service (OSS) and create persistent volumes (PVs) and persistent volume claims (PVCs) in ACK clusters.

For more information about how to upload a model to Apsara File Storage NAS (NAS), see Mount a statically provisioned NAS volume.

Note

To accelerate file downloads and uploads, you can directly copy the files to your OSS bucket.

  1. Download the model file.

    1. Run the following command to install Git:

      # Run yum install git or apt install git. 
      yum install git
    2. Run the following command to install the Git Large File Support (LFS) plug-in:

      # Run yum install git-lfs or apt install git-lfs. 
      yum install git-lfs
    3. Run the following command to clone the DeepSeek-R1 repository on ModelScope to your on-premises machine:

      GIT_LFS_SKIP_SMUDGE=1 git clone https://modelscope.cn/models/deepseek-ai/DeepSeek-R1
    4. Run the following command to access the DeepSeek-R1 directory and pull large files managed by LFS:

      cd DeepSeek-R1
      git lfs pull
  2. Upload the DeepSeek-R1 files to OSS.

    1. Log on to the OSS console to view and copy the name of the OSS bucket that you created.

      For more information about how to create an OSS bucket, see Create buckets.

    2. Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.

    3. Run the following command to create a directory named DeepSeek-R1 in OSS:

      ossutil mkdir oss://<Your-Bucket-Name>/models/DeepSeek-R1
    4. Run the following command to upload the model files to OSS:

      ossutil cp -r ./DeepSeek-R1 oss://<Your-Bucket-Name>/models/DeepSeek-R1
  3. Configure PVs and PVCs for the destination cluster. For more information, see Mount a statically provisioned ossfs 1.0 volume.

    1. Create a PV

      • Log on to the ACK console. In the navigation pane on the left, click Clusters.

      • On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Volumes > Persistent Volumes.

      • In the upper-right corner of the Persistent Volumes page, click Create.

      • In the Create PV dialog box, configure the parameters that are described in the following table.

        The following table describes the parameters of a PV.

        Parameter

        Description

        PV Type

        In this example, select OSS.

        Volume Name

        In this example, enter llm-model.

        Access Certificate

        The AccessKey pair used to access the OSS bucket. The AccessKey pair consists of an AccessKey ID and an AccessKey secret.

        Bucket ID

        Select the OSS bucket you created in the preceding step.

        OSS Path

        Enter the path of the model, such as /models/DeepSeek-R1.

    2. Create a PVC

      • On the Clusters page, find the cluster you want and click its name. In the left-side pane, choose Volumes > Persistent Volume Claims.

      • In the upper-right corner of the Persistent Volume Claims page, click Create.

      • In the Create PVC dialog box, configure the parameters that are described in the following table.

        The following table describes the parameters of a PVC.

        Parameter

        Description

        PVC Type

        In this example, select OSS.

        Name

        In this example, enter llm-model.

        Allocation Mode

        In this example, select Existing Volumes.

        Existing Storage Class

        Click the Select PV hyperlink and select the PV that you created.

1.3 Deploy the model

  1. Install LeaderWorkerSet.

    1. Log on to the ACK console.

    2. In the navigation pane on the left, click Clusters, then click the name of the cluster you created.

    3. In the navigation pane on the left, click Applications > Helm. On the Helm page, click Deploy.

    4. In the Basic Information step, enter the Application Name and Namespace, find lws in the Chart section, and click Next. In this example, the application name (lws) and namespace (lws-system) are used.

    5. In the Parameters step, select the latest Chart Version, and click OK to install lws.image

  2. Deploy the model.

    The following figure shows a diagram of the vLLM distributed deployment architecture. image.png

    Deploy models by using Arena

    1. Run the following command to deploy the service:

      arena serve distributed \
              --name=vllm-dist \
              --version=v1 \
              --restful-port=8080 \
              --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0 \
              --readiness-probe-action="tcpSocket" \
              --readiness-probe-action-option="port: 8080" \
              --readiness-probe-option="initialDelaySeconds: 30" \
              --readiness-probe-option="periodSeconds: 30" \
              --share-memory=30Gi \
              --data=llm-model:/models/DeepSeek-R1 \
              --leader-num=1 \
              --leader-gpus=8 \
              --leader-command="bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=\$(LWS_GROUP_SIZE); python3 -m vllm.entrypoints.openai.api_server --model /models/DeepSeek-R1 --port 8080 --trust-remote-code --served-model-name deepseek-r1 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enforce-eager" \
              --worker-num=1 \
              --worker-gpus=8 \
              --worker-command="bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=\$(LWS_LEADER_ADDRESS)" 

      View the parameter description

      Parameter

      Required

      Description

      --name

      Yes

      The name of the inference service. The name must be globally unique.

      --image

      Yes

      The address of the inference service image.

      --restful-port

      Yes

      The port of the service.

      --version

      No

      The version of the service. The default value is the current version.

      --readiness-probe-*

      No

      Configure readiness detection for a service. The service can be externally provided only after the service is ready.

      --share-memory

      No

      Configure the shared memory.

      --leader-num

      Yes

      The number of leader pods. You can set this value only to 1.

      --leader-gpus

      No

      The number of GPUs used by each leader pod.

      --leader-command

      Yes

      The command that starts leader pods.

      --data

      No

      The model address of the service. Format: <pvc-name >:< pod-path>. Run the llm-model:/mnt/models command to mount the llm-model PVC to the /mnt/models directory of the container.

      Expected output:

      configmap/vllm-dist-v1-cm created
      service/vllm-dist-v1 created
      leaderworkerset.leaderworkerset.x-k8s.io/vllm-dist-v1-distributed-serving created
      INFO[0002] The Job vllm-dist has been submitted successfully
      INFO[0002] You can run `arena serve get vllm-dist --type distributed-serving -n default` to check the job status
    2. Run the following command to view the deployment progress of the inference service:

      arena serve get vllm-dist

      Expected output:

      Name:       vllm-dist
      Namespace:  default
      Type:       Distributed
      Version:    v1
      Desired:    1
      Available:  1
      Age:        3m
      Address:    192.168.138.65
      Port:       RESTFUL:8080
      GPU:        16
      
      Instances:
        NAME                                  STATUS   AGE  READY  RESTARTS  GPU  NODE
        ----                                  ------   ---  -----  --------  ---  ----
        vllm-dist-v1-distributed-serving-0    Running  3m   1/1    0         8    cn-beijing.10.x.x.x
        vllm-dist-v1-distributed-serving-0-1  Running  3m   1/1    0         8    cn-beijing.10.x.x.x

    Deploy models by using kubectl

    1. Run the DeepSeek_R1.yaml file to deploy the model service.

      apiVersion: leaderworkerset.x-k8s.io/v1
      kind: LeaderWorkerSet
      metadata:
        name: vllm-dist
      spec:
        replicas: 1
        leaderWorkerTemplate:
          size: 2
          restartPolicy: RecreateGroupOnPodRestart
          leaderTemplate:
            metadata:
              labels: 
                role: leader
            spec:
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - name: dshm
                  emptyDir:
                    medium: Memory
                    sizeLimit: 15Gi
              containers:
                - name: vllm-leader
                  image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
                  command:
                    - sh
                    - -c
                    - >-
                      bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                      python3 -m vllm.entrypoints.openai.api_server --model /models/DeepSeek-R1 --port 8080 --trust-remote-code --served-model-name deepseek-r1 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --pipeline-parallel-size 2 --enforce-eager
                  resources:
                    limits:
                      nvidia.com/gpu: "8"
                    requests:
                      nvidia.com/gpu: "8"
                  ports:
                    - containerPort: 8080
                  readinessProbe:
                    initialDelaySeconds: 30
                    periodSeconds: 30
                    tcpSocket:
                      port: 8080
                  volumeMounts:
                    - mountPath: /models/DeepSeek-R1
                      name: model
                    - mountPath: /dev/shm
                      name: dshm
          workerTemplate:
            spec:
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - name: dshm
                  emptyDir:
                    medium: Memory
                    sizeLimit: 15Gi
              containers:
                - name: vllm-worker
                  image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
                  command:
                    - sh
                    - -c
                    - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
                  resources:
                    limits:
                      nvidia.com/gpu: "8"
                    requests:
                      nvidia.com/gpu: "8"
                  ports:
                    - containerPort: 8080
                  volumeMounts:
                    - mountPath: /models/DeepSeek-R1
                      name: model
                    - mountPath: /dev/shm
                      name: dshm
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: vllm-dist-v1
      spec:
        type: ClusterIP
        ports:
        - port: 8080
          protocol: TCP
          targetPort: 8080
        selector:
          leaderworkerset.sigs.k8s.io/name: vllm-dist
          role: leader
      kubectl create -f DeepSeek_R1.yaml
    2. Run the following command to view the deployment progress of the inference service:

      kubectl get po |grep vllm-dist

      Expected output:

      NAME            READY   STATUS    RESTARTS   AGE
      vllm-dist-0     1/1     Running   0          20m
      vllm-dist-0-1   1/1     Running   0          20m
  3. Create an on-premises port to forward model inference requests.

    1. Run the kubectl port-forward command to configure port forwarding between the on-premises environment and inference service.

      Note

      If you run the kubectl port-forward command to configure port forwarding, the service is not reliable, secure, or extensible in production environments. You can use the service only for development and debugging. Do not run this command to configure port forwarding in production environments. For more information about the networking solutions used to facilitate production in ACK clusters, see Ingress management.

      kubectl port-forward svc/vllm-dist-v1 8080:8080
    2. Send requests to the inference service.

      curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
          "model": "deepseek-r1",
          "prompt": "San Francisco is a",
          "max_tokens": 10,
          "temperature": 0.6
      }'

      Expected output:

      {"id":"cmpl-15977abb0adc44d9aa03628abe9fcc81","object":"text_completion","created":1739346042,"model":"ds","choices":[{"index":0,"text":" city that needs no introduction. Known for its iconic","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":15,"completion_tokens":10,"prompt_tokens_details":null}}

2. Use Dify to build a DeepSeek Q&A assistant

You can install and configure Dify in an Container Service for Kubernetes cluster. For more information, see Install ack-dify.

2.1. Configure the DeepSeek model

  1. Log on to the Dify platform. Click your profile picture and then click Settings. In the left-side navigation pane, click Model Provider. Find OpenAI-API-compatible and click Add. image

  2. The following table describes the parameters.

    Parameter

    Setting

    Note

    Model name

    deepseek-r1

    You cannot modify this parameter.

    API Key

    Example: api-deepseek-r1

    You can configure this parameter based on your business requirements.

    API endpoint URL

    http://vllm-dist-v1.default:8080/v1

    You cannot modify this parameter. The value of this parameter is the name of the local DeepSeek service deployed in the second step.

    image.png

2.2. Create a Q&A assistant

Create a general-purpose AI-powered Q&A assistant. Choose Studio > Create from Blank. Specify a name and a description for the assistant. Use the default settings for other parameters.

image

2.3 Test the AI-powered Q&A assistant

  1. On the right side of the page, you can initiate a conversation with DeepSeek.

    image.png

  2. You can integrate the configured DeepSeek Q&A assistant into your personal production environment. For more information, see Apply the AI-powered Q&A assistant to the production environment.

    2024-08-23_14-14-02 (1)