×
Community Blog Solving GPU Shortages in IDC with Alibaba Cloud ACK Edge and Virtual Nodes for DeepSeek Deployment

Solving GPU Shortages in IDC with Alibaba Cloud ACK Edge and Virtual Nodes for DeepSeek Deployment

This article describes how to use ACK Edge and virtual nodes to meet the elasticity requirements of DeepSeek deployment.

By Yu Zhuang and Bingchang Tang

ACK Edge clusters adopt the cloud-edge integrated architecture. ACK Edge clusters manage the Kubernetes control plane on the cloud and connect to servers in the data center as the data plane nodes of the Kubernetes cluster. Containerized Kubernetes management of servers in the data center is implemented to reuse existing resources, improving the efficiency of application deployment and O&M.

At present, AI large model services are developing rapidly. ACK Edge has helped a large number of customers manage GPU-accelerated nodes in the data center and use containers to quickly deploy the AI large model inference services. However, with the release of the DeepSeek-R1 model, the demand for GPUs in the model is increasing. DeepSeek-R1 uses the MOE model, which requires at least 8 GPUs for deployment. In addition, since the DeepSeek-R1 model is trained natively with FP8, a newer GPU is required to promise high cost-effectiveness. All these pose challenges to the GPU resources in the data center. By using virtual nodes of ACK Edge, you can quickly access ACS serverless GPU computing power on the cloud to deploy and run the DeepSeek inference service.

This article describes how to use ACK Edge to manage GPU-accelerated nodes in the data center and deploy the DeepSeek inference service by using the ACK AI suite to preferentially run inference pods on GPU-accelerated nodes. When GPU-accelerated nodes are insufficient, you can create ACS serverless GPU computing power on the cloud through virtual nodes on ACK Edge to run the DeepSeek inference pods, meeting your business requirements and optimizing costs.

Elastic ACS Serverless GPU Solution Based on ACK Edge and Virtual Nodes

1

• Connect the resources in the data center to the VPC.

• Connect on-premises resources to ACK Edge to manage and schedule services in the data center from the cloud.

• Configure a custom scheduling policy for the service, and preferentially schedule the services to resources in the data center. If the on-premises resources are insufficient, the resources are then scheduled to virtual nodes on the cloud.

• Configure HPA for services. When the resource threshold is reached, scale-out is automatically triggered.

Benefits

Extreme elasticity: It provides large-scale auto scaling capabilities in seconds, enabling quick responses to traffic peaks.
Refined cost control: You do not need to purchase servers. With the pay-as-you-go billing method, the cost is transparent and controllable.
Rich elastic resources: The solution is applicable for different CPU and GPU models.

Examples

Prepare An Environment

• Select a region as the central region and create an ACK Edge cluster in the region.

• Install the virtual-node component. For more information, see Component management.

• Install Kserve. For more information, see Manage ack-kserve components.

• Install Arena. For more information, see Configure the Arena client.

• Deploy the monitoring component and configure GPU monitoring metrics. For more information, see Enable auto scaling based on GPU metrics.

Create an edge node pool and add the resources in the data center to the edge node pool.

Procedure

Step 1: Prepare the DeepSeek-R1-Distill-Qwen-7B model files

1) Run the following command to download the DeepSeek-R1-Distill-Qwen-7B model from ModelScope.

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B/
git lfs pull

2) Create an OSS directory and upload the model files to the directory.

To install and use ossutil, see Install ossutil.

ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B

3) Create a persistent volume (PV) and a persistent volume claim (PVC). Configure a PV named llm-model and a PVC named llm-model for the cluster where you want to deploy the inference services. For more information, see Mount a statically provisioned OSS volume. The following code block shows a sample YAML template:

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
  akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # The name of the OSS bucket.
      url: <your-bucket-endpoint> # The endpoint. We recommend internal endpoints, such as oss-cn-hangzhou-internal.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path> # The model path. In this example, the model path is set to /models/DeepSeek-R1-Distill-Qwen-7B/.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

Step 2: Create custom scheduling policies

Configure the scheduling priority to preferentially schedule to the edge node pool. If the resources in the edge node pool are insufficient, the virtual node is scheduled to.

• Run the kubectl create -f nginx-resoucepolicy.yaml command.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: deepseek
  namespace: default
spec:
  selector:
    app: isvc.deepseek-predictor # You must specify the label of the pods to which you want to apply the ResourcePolicy. 
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: np*********  # The ID of the edge node pool.
  - resource: eci

Step 3: Deploy a model

1) Run the following command to query the nodes in the cluster:

kubectl get nodes -owide

Expected output:

NAME                            STATUS   ROLES    AGE     VERSION            INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                              KERNEL-VERSION           CONTAINER-RUNTIME
cn-hangzhou.10.4.0.25           Ready    <none>   10d     v1.30.7-aliyun.1   10.4.0.25     <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.36
cn-hangzhou.10.4.0.26           Ready    <none>   10d     v1.30.7-aliyun.1   10.4.0.26     <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.36
idc001                          Ready    <none>   31s     v1.30.7-aliyun.1   10.4.0.185    <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.36
virtual-kubelet-cn-hangzhou-b   Ready    agent    7d21h   v1.30.7-aliyun.1   10.4.0.180    <none>        <unknown>                                             <unknown>                <unknown>

There is an on-premises node (idc001) and a virtual node (virtual-kubelet-cn-hangzhou-b). The on-premises node has a V100 GPU.

2) Run the following command to deploy the DeepSeek model inference service that uses the vLLM framework.

arena serve kserve \
    --name=deepseek \
    --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6e-c12g1.3xlarge \
    --annotation=k8s.aliyun.com/eci-vswitch=vsw-*********,vsw-********* \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
    --scale-target=50 \
    --min-replicas=1  \
    --max-replicas=3  \
    --data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"

Expected output:

WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /Users/bingchang/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /Users/bingchang/.kube/config
horizontalpodautoscaler.autoscaling/deepseek-hpa created
inferenceservice.serving.kserve.io/deepseek created
INFO[0002] The Job deepseek has been submitted successfully
INFO[0002] You can run `arena serve get deepseek --type kserve -n default` to check the job status

3) Run the following command to query the details of the inference service.

arena serve get deepseek

Expected output:

Name:       deepseek
Namespace:  default
Type:       KServe
Version:    1
Desired:    1
Available:  1
Age:        1m
Address:    http://deepseek-default.example.com
Port:       :80
GPU:        1


Instances:
  NAME                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                 ------   ---  -----  --------  ---  ----
  deepseek-predictor-6b9455f8c5-wl5lc  Running  1m   1/1    0         1    idc001

The result shows that the pods of the inference service are scheduled to the on-premises node.

4) After the deployment, you can directly request the service to verify whether the deployment is successful. The requested address can be found in the details of the Ingress resource that is automatically created by KServe.

curl -H "Host: deepseek-default.example.com" -H "Content-Type: application/json" http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"chatcmpl-efc1225ad2f33cc39a8ddbc4039a41b9","object":"chat.completion","created":1739861087,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, so I need to figure out how to say \"This is a test!\" in Spanish. Hmm, I'm not super fluent in Spanish, but I know some basic phrases. Let me think about how to approach this.\n\nFirst, I remember that \"test\" is \"prueba\" in Spanish. So maybe I can start with \"Esto es una prueba.\" But I'm not sure if that's the best way to say it. Maybe there's a more common expression or a different structure.\n\nWait, I think there's a phrase that's commonly used in tests. Isn't it something like \"This is a test.\" or \"This is a quiz.\"? I think the Spanish equivalent would be \"Este es un test.\" That sounds more natural. Let me check if that makes sense.\n\nI can also think about how people use phrases in tests. Maybe they use \"This is the test\" or \"This is an exam.\" So perhaps \"Este es el test.\" or \"Este es el examen.\" I'm not sure which one is more appropriate.\n\nI should also consider the grammar. \"This is a test\" is a simple statement, so the subject is \"this\" (using \"este\"), the verb is \"is\" (using \"es\"), and the object is \"a test\" (using \"un test\"). So putting it together, it would be \"Este es un test.\"\n\nWait, but sometimes people use \"This is the test\" when referring to an important one, so maybe \"Este es el test.\" But I'm not entirely sure if that's the correct structure. Let me think about other similar phrases.\n\nI also recall that in some contexts, people might say \"This is a practice test\" or \"This is a sample test.\" But since the user just said \"This is a test,\" the most straightforward translation would be \"Este es un test.\"\n\nI should also consider if there are any idiomatic expressions or common phrases that are used in this context. For example, \"This is the test\" is often used to mean a significant exam or evaluation, so \"Este es el test\" might be more appropriate in that context.\n\nBut I'm a bit confused because I'm not 100% sure about the correct structure. Maybe I should look up some examples. Oh, wait, I can't look things up right now, so I'll have to rely on my memory.\n\nI think the basic structure is subject + verb + object. So \"this\" (this is \"este","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":11,"total_tokens":523,"completion_tokens":512,"prompt_tokens_details":null},"prompt_logprobs":null}

Step 4: Simulate peak service requests to trigger cloud elasticity

We use the stress testing tool hey to simulate sending a large number of requests to this service.

hey -z 5m -c 5 \
-m POST -host deepseek-default.example.com \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \
http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions

The GPU usage exceeds the threshold due to too many requests. In this case, pods are scaled out. Run the following command to view the details of the inference service.

arena serve get deepseek

Expected output:

Name:       deepseek
Namespace:  default
Type:       KServe
Version:    1
Desired:    3
Available:  2
Age:        18m
Address:    http://deepseek-default.example.com
Port:       :80
GPU:        3


Instances:
  NAME                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                 ------   ---  -----  --------  ---  ----
  deepseek-predictor-6b9455f8c5-dtzdv  Running  1m   0/1    0         1    virtual-kubelet-cn-hangzhou-h
  deepseek-predictor-6b9455f8c5-wl5lc  Running  18m  1/1    0         1    idc001
  deepseek-predictor-6b9455f8c5-zmpg8  Running  5m   1/1    0         1    virtual-kubelet-cn-hangzhou-h

In this case, two replicas are scaled out on the virtual node.

Conclusion

ACK Edge uses a cloud-edge integrated architecture to manage the control plane of Kubernetes clusters. ACK Edge allows you to manage resources in the data center, ENS resources, and cross-region ECS resources. While reducing the complexity of distributed resource and business management for users, it seamlessly integrates with the existing elasticity capabilities on the cloud to meet the elasticity requirements of local services. Combined with virtual nodes, ACK Edge can better handle unexpected scenarios and control resource costs in a more refined manner, ensuring the stable operation of the business.

References

[1] Create an ACK Edge cluster
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/create-an-ack-edge-cluster-1

[2] Add an edge node
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/add-an-edge-node

[3] Virtual node management
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/virtual-node-management

[4] Configure priority-based resource scheduling
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/configure-priority-based-resource-scheduling

[5] Component management
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/component-overview?spm=a2c4g.11186623.0.0.40a6637cQeYfzf

[6] Deploy and manage the ack-kserve component in an ACK cluster
https://www.alibabacloud.com/help/en/ack/cloud-native-ai-suite/user-guide/installation-ack-kserve?spm=a2c4g.11186623.help-menu-85222.d_2_4_5_1.605e3186rU3j8a&scm=20140722.H_2784216._.OR_help-T_cn~zh-V_1

[7] Configure the Arena client
https://www.alibabacloud.com/help/en/ack/cloud-native-ai-suite/user-guide/install-arena#task-1917487

[8] Enable auto scaling based on GPU metrics
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/enable-auto-scaling-based-on-gpu-metrics?spm=a2c4g.11186623.0.0.5c2963c5zsiWs4#section-hh4-ss2-qbu

[9] Create and manage an edge node pool
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/edge-node-pool-management?spm=a2c4g.11186623.help-menu-85222.d_1_1_1.799a608dBO9wuh&scm=20140722.H_199462._.OR_help-T_cn~zh-V_1

[10] Add an edge node
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/add-an-edge-node?spm=a2c4g.11186623.help-menu-85222.d_1_2_1_0.61e68488cS5Gul

[11] Install ossutil
https://www.alibabacloud.com/help/en/oss/developer-reference/install-ossutil?spm=a2c4g.11186623.0.0.384b6557sdxlwN

[12] Mount a statically provisioned OSS volume
https://www.alibabacloud.com/help/en/cs/user-guide/oss-child-node-1

0 1 0
Share on

Alibaba Container Service

210 posts | 33 followers

You may also like

Comments