ACK One Registered Clusters Help Solve GPU Resource Shortage in Data Centers

With the help of ACK One registered clusters, we can make full use of ACS GPU computing power of Alibaba Cloud to efficiently deploy the DeepSeek inference model.

By Xiaobing Meng (Zhishi)

As an intelligent inference model based on deep learning, DeepSeek quickly gained popularity due to its outstanding performance in natural language processing (NLP), image recognition, and other fields. Whether in enterprise-level applications or academic research, DeepSeek has shown strong potential. However, with the continuous expansion of its application scenarios, the computing power bottleneck of data centers is gradually highlighted.

Faced with insufficient computing power, the ACK One registered cluster provided by Alibaba Cloud offers a flexible and effective solution for enterprises. By connecting the Kubernetes cluster in the on-premises data center to the ACK One registered cluster, enterprises can seamlessly expand their computing resources and fully utilize the powerful ACS GPU computing power, efficiently deploying the DeepSeek inference model.

Background

ACK One Registered Clusters

You can register clusters deployed in data centers or on a third-party cloud to Alibaba Cloud Distributed Cloud Container Platform for Kubernetes (ACK One). This way, you can build hybrid clusters and manage clusters in a centralized manner.

ACS Computing Power

Container Compute Service (ACS) is an upgrade of ACK Serverless clusters (FKA ASK). It is more cost-effective, easy-to-use, and elastic. ACS is intended for a variety of business scenarios. It defines cost-effective serverless compute classes and compute QoS classes, allowing you to request resources on demand and pay for them on a per-second basis. It saves you the need to worry about cluster and node O&M.

DeepSeek-R1

DeepSeek-R1 is the first-generation inference model provided by DeepSeek. It aims to improve the inference performance of LLMs through large-scale reinforcement learning. Statistics show that DeepSeek-R1 outperforms other closed source models in mathematical inference and programming competition. Its performance even reaches or surpasses the OpenAI-01 series in certain sectors. DeepSeek-R1 also excels in sectors related to knowledge, such as creation, writing, and Q&A. DeepSeek also distills inference capabilities to smaller models, such as Qwen and Llama, to fine-tune their inference performance. The 14B model distilled from DeepSeek surpasses the open source QwQ-32B model. The 32B and 70B models distilled from DeepSeek also hit new records. For more information about DeepSeek, see DeepSeek AI GitHub repository.

vLLM

vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports multiple commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.

Quick Integration

• Log on to the Container Service for Kubernetes (ACK) console and activate Container Service as prompted.

• Log on to the ACS console. Follow the on-screen instructions to activate ACS.

• Create an ACK One registered cluster and connect it to a data center or a Kubernetes cluster of another cloud service provider. We recommend that you set the ACK One cluster to 1.24 or later. For more information, see Create an ACK One registered cluster and connect Cluster B to the registered cluster.

• The Arena client is installed and configured. For more information, see Configure the Arena client.

• Install the ack-virtual-node component. For more information, see Use ACS computing power in ACK One registered clusters

Procedure

Step 1: Prepare the DeepSeek-R1-Distill-Qwen-7B model files

1. Run the following command to download the DeepSeek-R1-Distill-Qwen-7B model from ModelScope.

Note: Check whether the git-lfs plug-in is installed. If not, run yum install git-lfs or apt-get install git-lfs to install it. For more information, see Install git-lfs

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B/
git lfs pull

2. Create an OSS directory and upload the model files to the directory.

Note: To install and use ossutil, see Install ossutil.

ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B

3. Create a PV and a PVC. Configure a persistent volume (PV) named llm-model and a persistent volume claim (PVC) named llm-model for the cluster. For more information, see Mount a statically provisioned OSS volume.

The following table describes the parameters of the PV.

Parameter or setting	Description
PV Type	OSS
Name	llm-model
Access Certificate	Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.
Bucket ID	Select the OSS bucket that you created in the previous step.
OSS Path	Select the path of the model, such as /models/DeepSeek-R1-Distill-Qwen-7B.

The following table describes the parameters of the PVC.

Parameter or setting	Description
PVC Type	OSS
Name	llm-model
Allocation Mode	In this example, Existing Volumes is selected.
Existing Volumes	Click Existing Volumes and select the PV that you created.

The following code block shows the YAML template:

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
  akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # The name of the OSS bucket.
      url: <your-bucket-endpoint> # The endpoint. We recommend internal endpoints, such as oss-cn-hangzhou-internal.aliyuncs.com.
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path> # The model path, such as /models/DeepSeek-R1-Distill-Qwen-7B/ in this example.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 2: Deploy the model

1. Run the following command to view the status of nodes in the cluster.

kubectl get no -owide

Expected output:

NAME                            STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                              KERNEL-VERSION           CONTAINER-RUNTIME
idc-master-0210-001             Ready    control-plane   26h   v1.28.2   192.168.8.XXX   <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.32
idc-worker-0210-001             Ready    <none>          26h   v1.28.2   192.168.8.XXX   <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.32
idc-worker-0210-002             Ready    <none>          26h   v1.28.2   192.168.8.XXX   <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.32
virtual-kubelet-cn-hangzhou-b   Ready    agent           20h   v1.28.2   10.244.11.XXX   <none>        <unknown>                                             <unknown>                <unknown>
virtual-kubelet-cn-hangzhou-h   Ready    agent           25h   v1.28.2   10.244.11.XXX   <none>        <unknown>                                             <unknown>                <unknown>

From the output results, we can see that there are virtual-kubelet-cn-hangzhou- nodes in the node.

2. Run the following command to deploy the Qwen model as an inference service by using vLLM.

Note: Suggested GPU resources: 1 GPU, 8 vCPUs, and 32 GiB of memory.

arena serve custom \
--name=deepseek-r1 \
--version=v1 \
--gpus=1 \
--cpu=8 \
--memory=32Gi \
--replicas=1 \
--env-from-secret=akId=oss-secret \
--env-from-secret=akSecret=oss-secret \
--label=alibabacloud.com/acs="true" \                         # Use ACS computing power.
--label=alibabacloud.com/compute-class=gpu \
--label=alibabacloud.com/gpu-model-series=<example-model> \   # Specify the GPU instance series.
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=registry-cn-hangzhou-vpc.ack.aliyuncs.com/ack-demo/vllm:v0.6.6 \
--data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
"vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8000 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

Note: You must use the following labels to describe how to use ACS GPU computing power.

--label=alibabacloud.com/acs="true"
--label=alibabacloud.com/compute-class=gpu
--label=alibabacloud.com/gpu-model-series=<example-model>

Expected output:

service/deepseek-r1-v1 created
deployment.apps/deepseek-r1-v1-custom-serving created
INFO[0001] The Job deepseek-r1 has been submitted successfully 
INFO[0001] You can run `arena serve get deepseek-r1 --type custom-serving -n default` to check the job status

The following table describes the parameters.

Parameters	Description
--name	The name of the inference service.
--version	The version of the inference service.
--gpus	The number of GPUs used by each inference service replica.
--cpu	The number of CPUs used by each inference service replica.
--memory	The amount of memory used by each inference service replica.
--replicas	The number of inference service replicas.
--label	Add the following labels to specify ACS GPU computing power. `--label=alibabacloud.com/acs="true"` `--label=alibabacloud.com/compute-class=gpu` `--label=alibabacloud.com/gpu-model-series=example-model` Note: To view the supported GPU models, submit a ticket.
--restful-port	The port of the inference service to be exposed.
--readiness-probe-action	The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.
--readiness-probe-action-option	The connection method of readiness probes.
--readiness-probe-option	The readiness probe configuration.
--image	The image address of the inference service.
--data	Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. To obtain the name of the PVC, run the arena data list command. This command queries the PVCs that are available for the specified cluster. Specify the path to which the PV claimed by the PVC is mounted on the right side of the colon. The training data will be read from the specified path. This way, your training job can retrieve the data stored in the corresponding PV claimed by the PVC.

3. Run the following command to query the details of the inference service. Wait until the service is ready.

arena serve get deepseek-r1

Expected output:

Name:       deepseek-r1
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        17m
Address:    10.100.136.39
Port:       RESTFUL:8000
GPU:        1
Instances:
  NAME                                            STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                            ------   ---  -----  --------  ---  ----
  deepseek-r1-v1-custom-serving-5f59745cbd-bsrdq  Running  17m  1/1    0         1    virtual-kubelet-cn-hangzhou-b

4. Run the following command to check whether the inference service has been deployed to the virtual node.

kubectl get po -owide  |grep deepseek-r1-v1

Expected output:

NAME                                             READY   STATUS    RESTARTS   AGE     IP              NODE                            NOMINATED NODE   READINESS GATES
deepseek-r1-v1-custom-serving-5f59745cbd-r8drs   1/1     Running   0          3m16s   192.168.2.XXX   virtual-kubelet-cn-hangzhou-b   <none>           <none>

From the results, we can see that the business pods of the inference service are scheduled to the virtual node.

Step 3: Verify the inference services

1. Run kubectl port-forward to configure port forwarding between the inference service and the local environment.

kubectl port-forward svc/deepseek-r1-v1 8000:8000

2. Send requests to the inference service.

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Hello, DeepSeek."}], "max_tokens": 100, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"chatcmpl-cef570252f324ed2b34953b8062f793f","object":"chat.completion","created":1739245450,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I am DeepSeek-R1, an intelligent assistant independently developed by China's DeepSeek company. I am delighted to serve you! \n</think>\n\n Hello! I am DeepSeek-R1, an intelligent assistant independently developed by China's DeepSeek company. I am delighted to serve you!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":68,"completion_tokens":58,"prompt_tokens_details":null},"prompt_logprobs":null}

Conclusion

DeepSeek-R1 outperforms other closed source models in mathematical inference and programming competition. Its performance even reaches or surpasses the OpenAI-01 series in certain sectors. Once released, many people tried it. This article describes how to use ACS GPU computing power to deploy the DeepSeek inference service that is available for production on an ACK One registered cluster. This solves computing power insufficiency in the data center and makes it easier for enterprises to deal with complex and changeable business challenges, fully releasing the productivity potential of the cloud.

Visit the ACK One official website to learn more details and start your journey of intelligent scaling.

Community

ACK One Registered Clusters Help Solve GPU Resource Shortage in Data Centers

Background

ACK One Registered Clusters

ACS Computing Power

DeepSeek-R1

vLLM

Quick Integration

Procedure

Step 1: Prepare the DeepSeek-R1-Distill-Qwen-7B model files

Step 2: Deploy the model

Step 3: Verify the inference services

Conclusion

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

Container Service for Kubernetes

AI Acceleration Solution

Function Compute

Offline Visual Intelligence Software Packages