By Yu Zhuang and Bingchang Tang
ACK Edge clusters adopt the cloud-edge integrated architecture. ACK Edge clusters manage the Kubernetes control plane on the cloud and connect to servers in the data center as the data plane nodes of the Kubernetes cluster. Containerized Kubernetes management of servers in the data center is implemented to reuse existing resources, improving the efficiency of application deployment and O&M.
At present, AI large model services are developing rapidly. ACK Edge has helped a large number of customers manage GPU-accelerated nodes in the data center and use containers to quickly deploy the AI large model inference services. However, with the release of the DeepSeek-R1 model, the demand for GPUs in the model is increasing. DeepSeek-R1 uses the MOE model, which requires at least 8 GPUs for deployment. In addition, since the DeepSeek-R1 model is trained natively with FP8, a newer GPU is required to promise high cost-effectiveness. All these pose challenges to the GPU resources in the data center. By using virtual nodes of ACK Edge, you can quickly access ACS serverless GPU computing power on the cloud to deploy and run the DeepSeek inference service.
This article describes how to use ACK Edge to manage GPU-accelerated nodes in the data center and deploy the DeepSeek inference service by using the ACK AI suite to preferentially run inference pods on GPU-accelerated nodes. When GPU-accelerated nodes are insufficient, you can create ACS serverless GPU computing power on the cloud through virtual nodes on ACK Edge to run the DeepSeek inference pods, meeting your business requirements and optimizing costs.
• Connect the resources in the data center to the VPC.
• Connect on-premises resources to ACK Edge to manage and schedule services in the data center from the cloud.
• Configure a custom scheduling policy for the service, and preferentially schedule the services to resources in the data center. If the on-premises resources are insufficient, the resources are then scheduled to virtual nodes on the cloud.
• Configure HPA for services. When the resource threshold is reached, scale-out is automatically triggered.
• Extreme elasticity: It provides large-scale auto scaling capabilities in seconds, enabling quick responses to traffic peaks.
• Refined cost control: You do not need to purchase servers. With the pay-as-you-go billing method, the cost is transparent and controllable.
• Rich elastic resources: The solution is applicable for different CPU and GPU models.
• Select a region as the central region and create an ACK Edge cluster in the region.
• Install the virtual-node component. For more information, see Component management.
• Install Kserve. For more information, see Manage ack-kserve components.
• Install Arena. For more information, see Configure the Arena client.
• Deploy the monitoring component and configure GPU monitoring metrics. For more information, see Enable auto scaling based on GPU metrics.
• Create an edge node pool and add the resources in the data center to the edge node pool.
1) Run the following command to download the DeepSeek-R1-Distill-Qwen-7B model from ModelScope.
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B/
git lfs pull
2) Create an OSS directory and upload the model files to the directory.
To install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
3) Create a persistent volume (PV) and a persistent volume claim (PVC). Configure a PV named llm-model and a PVC named llm-model for the cluster where you want to deploy the inference services. For more information, see Mount a statically provisioned OSS volume. The following code block shows a sample YAML template:
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak> # The AccessKey ID used to access the OSS bucket.
akSecret: <your-oss-sk> # The AccessKey secret used to access the OSS bucket.
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: llm-model
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: <your-bucket-name> # The name of the OSS bucket.
url: <your-bucket-endpoint> # The endpoint. We recommend internal endpoints, such as oss-cn-hangzhou-internal.aliyuncs.com
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: <your-model-path> # The model path. In this example, the model path is set to /models/DeepSeek-R1-Distill-Qwen-7B/.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-model
Configure the scheduling priority to preferentially schedule to the edge node pool. If the resources in the edge node pool are insufficient, the virtual node is scheduled to.
• Run the kubectl create -f nginx-resoucepolicy.yaml
command.
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: deepseek
namespace: default
spec:
selector:
app: isvc.deepseek-predictor # You must specify the label of the pods to which you want to apply the ResourcePolicy.
strategy: prefer
units:
- resource: ecs
nodeSelector:
alibabacloud.com/nodepool-id: np********* # The ID of the edge node pool.
- resource: eci
1) Run the following command to query the nodes in the cluster:
kubectl get nodes -owide
Expected output:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
cn-hangzhou.10.4.0.25 Ready <none> 10d v1.30.7-aliyun.1 10.4.0.25 <none> Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition) 5.10.134-18.al8.x86_64 containerd://1.6.36
cn-hangzhou.10.4.0.26 Ready <none> 10d v1.30.7-aliyun.1 10.4.0.26 <none> Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition) 5.10.134-18.al8.x86_64 containerd://1.6.36
idc001 Ready <none> 31s v1.30.7-aliyun.1 10.4.0.185 <none> Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition) 5.10.134-18.al8.x86_64 containerd://1.6.36
virtual-kubelet-cn-hangzhou-b Ready agent 7d21h v1.30.7-aliyun.1 10.4.0.180 <none> <unknown> <unknown> <unknown>
There is an on-premises node (idc001) and a virtual node (virtual-kubelet-cn-hangzhou-b). The on-premises node has a V100 GPU.
2) Run the following command to deploy the DeepSeek model inference service that uses the vLLM framework.
arena serve kserve \
--name=deepseek \
--annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6e-c12g1.3xlarge \
--annotation=k8s.aliyun.com/eci-vswitch=vsw-*********,vsw-********* \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
--gpus=1 \
--cpu=4 \
--memory=12Gi \
--scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
--scale-target=50 \
--min-replicas=1 \
--max-replicas=3 \
--data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
"vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"
Expected output:
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /Users/bingchang/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /Users/bingchang/.kube/config
horizontalpodautoscaler.autoscaling/deepseek-hpa created
inferenceservice.serving.kserve.io/deepseek created
INFO[0002] The Job deepseek has been submitted successfully
INFO[0002] You can run `arena serve get deepseek --type kserve -n default` to check the job status
3) Run the following command to query the details of the inference service.
arena serve get deepseek
Expected output:
Name: deepseek
Namespace: default
Type: KServe
Version: 1
Desired: 1
Available: 1
Age: 1m
Address: http://deepseek-default.example.com
Port: :80
GPU: 1
Instances:
NAME STATUS AGE READY RESTARTS GPU NODE
---- ------ --- ----- -------- --- ----
deepseek-predictor-6b9455f8c5-wl5lc Running 1m 1/1 0 1 idc001
The result shows that the pods of the inference service are scheduled to the on-premises node.
4) After the deployment, you can directly request the service to verify whether the deployment is successful. The requested address can be found in the details of the Ingress resource that is automatically created by KServe.
curl -H "Host: deepseek-default.example.com" -H "Content-Type: application/json" http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'
Expected output:
{"id":"chatcmpl-efc1225ad2f33cc39a8ddbc4039a41b9","object":"chat.completion","created":1739861087,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, so I need to figure out how to say \"This is a test!\" in Spanish. Hmm, I'm not super fluent in Spanish, but I know some basic phrases. Let me think about how to approach this.\n\nFirst, I remember that \"test\" is \"prueba\" in Spanish. So maybe I can start with \"Esto es una prueba.\" But I'm not sure if that's the best way to say it. Maybe there's a more common expression or a different structure.\n\nWait, I think there's a phrase that's commonly used in tests. Isn't it something like \"This is a test.\" or \"This is a quiz.\"? I think the Spanish equivalent would be \"Este es un test.\" That sounds more natural. Let me check if that makes sense.\n\nI can also think about how people use phrases in tests. Maybe they use \"This is the test\" or \"This is an exam.\" So perhaps \"Este es el test.\" or \"Este es el examen.\" I'm not sure which one is more appropriate.\n\nI should also consider the grammar. \"This is a test\" is a simple statement, so the subject is \"this\" (using \"este\"), the verb is \"is\" (using \"es\"), and the object is \"a test\" (using \"un test\"). So putting it together, it would be \"Este es un test.\"\n\nWait, but sometimes people use \"This is the test\" when referring to an important one, so maybe \"Este es el test.\" But I'm not entirely sure if that's the correct structure. Let me think about other similar phrases.\n\nI also recall that in some contexts, people might say \"This is a practice test\" or \"This is a sample test.\" But since the user just said \"This is a test,\" the most straightforward translation would be \"Este es un test.\"\n\nI should also consider if there are any idiomatic expressions or common phrases that are used in this context. For example, \"This is the test\" is often used to mean a significant exam or evaluation, so \"Este es el test\" might be more appropriate in that context.\n\nBut I'm a bit confused because I'm not 100% sure about the correct structure. Maybe I should look up some examples. Oh, wait, I can't look things up right now, so I'll have to rely on my memory.\n\nI think the basic structure is subject + verb + object. So \"this\" (this is \"este","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":11,"total_tokens":523,"completion_tokens":512,"prompt_tokens_details":null},"prompt_logprobs":null}
We use the stress testing tool hey to simulate sending a large number of requests to this service.
hey -z 5m -c 5 \
-m POST -host deepseek-default.example.com \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \
http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions
The GPU usage exceeds the threshold due to too many requests. In this case, pods are scaled out. Run the following command to view the details of the inference service.
arena serve get deepseek
Expected output:
Name: deepseek
Namespace: default
Type: KServe
Version: 1
Desired: 3
Available: 2
Age: 18m
Address: http://deepseek-default.example.com
Port: :80
GPU: 3
Instances:
NAME STATUS AGE READY RESTARTS GPU NODE
---- ------ --- ----- -------- --- ----
deepseek-predictor-6b9455f8c5-dtzdv Running 1m 0/1 0 1 virtual-kubelet-cn-hangzhou-h
deepseek-predictor-6b9455f8c5-wl5lc Running 18m 1/1 0 1 idc001
deepseek-predictor-6b9455f8c5-zmpg8 Running 5m 1/1 0 1 virtual-kubelet-cn-hangzhou-h
In this case, two replicas are scaled out on the virtual node.
ACK Edge uses a cloud-edge integrated architecture to manage the control plane of Kubernetes clusters. ACK Edge allows you to manage resources in the data center, ENS resources, and cross-region ECS resources. While reducing the complexity of distributed resource and business management for users, it seamlessly integrates with the existing elasticity capabilities on the cloud to meet the elasticity requirements of local services. Combined with virtual nodes, ACK Edge can better handle unexpected scenarios and control resource costs in a more refined manner, ensuring the stable operation of the business.
[1] Create an ACK Edge cluster
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/create-an-ack-edge-cluster-1
[2] Add an edge node
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/add-an-edge-node
[3] Virtual node management
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/virtual-node-management
[4] Configure priority-based resource scheduling
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/configure-priority-based-resource-scheduling
[5] Component management
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/component-overview?spm=a2c4g.11186623.0.0.40a6637cQeYfzf
[6] Deploy and manage the ack-kserve component in an ACK cluster
https://www.alibabacloud.com/help/en/ack/cloud-native-ai-suite/user-guide/installation-ack-kserve?spm=a2c4g.11186623.help-menu-85222.d_2_4_5_1.605e3186rU3j8a&scm=20140722.H_2784216._.OR_help-T_cn~zh-V_1
[7] Configure the Arena client
https://www.alibabacloud.com/help/en/ack/cloud-native-ai-suite/user-guide/install-arena#task-1917487
[8] Enable auto scaling based on GPU metrics
https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/enable-auto-scaling-based-on-gpu-metrics?spm=a2c4g.11186623.0.0.5c2963c5zsiWs4#section-hh4-ss2-qbu
[9] Create and manage an edge node pool
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/edge-node-pool-management?spm=a2c4g.11186623.help-menu-85222.d_1_1_1.799a608dBO9wuh&scm=20140722.H_199462._.OR_help-T_cn~zh-V_1
[10] Add an edge node
https://www.alibabacloud.com/help/en/ack/ack-edge/user-guide/add-an-edge-node?spm=a2c4g.11186623.help-menu-85222.d_1_2_1_0.61e68488cS5Gul
[11] Install ossutil
https://www.alibabacloud.com/help/en/oss/developer-reference/install-ossutil?spm=a2c4g.11186623.0.0.384b6557sdxlwN
[12] Mount a statically provisioned OSS volume
https://www.alibabacloud.com/help/en/cs/user-guide/oss-child-node-1
Observability Principles and Best Practices of GPU-accelerated Edge Nodes
210 posts | 33 followers
FollowAlibaba Container Service - May 19, 2025
ray - April 16, 2025
Alibaba Container Service - April 17, 2025
Alibaba Container Service - November 21, 2024
Alibaba Container Service - May 27, 2025
Alibaba Cloud Native - October 18, 2023
210 posts | 33 followers
FollowProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreA secure image hosting platform providing containerized image lifecycle management
Learn MoreAn agile and secure serverless container instance service.
Learn MoreMore Posts by Alibaba Container Service