This topic describes how to use a DataCache to deploy a DeepSeek-R1 series model in Elastic Container Instance. This topic also describes how to configure Horizontal Pod Autoscaler (HPA) to implement scaling of pods based on custom metrics. In this example, DeepSeek-R1-Distill-Qwen-7B is used.
Why is Elastic Container Instance used to deploy DeepSeek?
Elastic Container Instance does not require O&M, can be flexibly deployed, and can help you build elastic and cost-effective business. For more information, see Benefits.
Elastic Container Instance uses DataCaches and ImageCaches to save time of image pulls and model downloads, reduce network resource consumption, and improve system efficiency.
NoteThe deployment of a containerized large model inference service involves the following stages: create and start a container, pull the image, download the model file, and load and start the model. An extended period of time and a large amount of network traffic are required to pull the image and model of a large model inference service due to the large size of the image and model. For example, the vLLM image is about 16.5 GB in size and the DeepSeek-R1-Distill-Qwen-7B model is about 14 GB in size. Elastic Container Instance uses DataCaches and ImageCaches to save time of image pulls and model downloads.
Prerequisites
A DataCache custom resource definition (CRD) is deployed in the cluster. For more information, see Deploy a DataCache CRD.
The virtual private cloud (VPC) in which the cluster resides is associated with an Internet NAT gateway. An SNAT entry is configured for the Internet NAT gateway to allow resources in the VPC or resources connected to vSwitches in the VPC to access the Internet.
NoteIf the VPC is not associated with an Internet NAT gateway, you must associate an elastic IP address (EIP) with the VPC when you create the DataCache and deploy the application. This way, you can pull data from the Internet.
The ARMS Prometheus component (ack-arms-prometheus) is installed in the cluster. For more information, see Use Managed Service for Prometheus.
The ack-alibaba-cloud-metrics-adapter component is deployed in the cluster.
To deploy ack-Alibaba-cloud-metrics-adapter, log on to the Container Service for Kubernetes (ACK) console. In the left-side navigation pane, choose . On the Marketplace page, find and deploy ack-alibaba-cloud-metrics-adapter.
KServe is deployed in the cluster.
KServe is Kubernetes-based machine learning model serving framework. KServe allows you to deploy one or more trained models to a model serving runtime by using Kubernetes CRD. For more information, see Install ack-kserve️.
Prepare a runtime environment
Different DeepSeek models have different requirements for runtime environments. In this topic, DeepSeek-R1-Distill-Qwen-7B model is used.
Recommended specifications
The GPU-accelerated Elastic Compute Service (ECS) instance family that is used to create the Elastic Container Instance-based pod must meet the following conditions. For information about the GPU-accelerated ECS instance families that can be used to create pods, see Supported instance families.
CPU: no strict limits
Memory size: greater than 16 GiB
Number of GPUs: 1 or more
Size of the GPU memory: 20 GB or more. For example, the A10 GPU can meet the requirements. If the GPU memory size is less than 20 GB, an out-of-memory (OOM) error may occur.
Software requirements
The deployment of a large model depends on a large number of libraries and configurations. vLLM is a mainstream large model inference engine and used to deploy the inference service in this topic. Elastic Container Instance provides a public container image. You can directly use the public container image or perform secondary development based on the public container image. The image address is
registry.cn-hangzhou.aliyuncs.com/eci_open/vllm-openai:v0.7.2
. The image size is about 16.5 GB.
Step 1: Create a DataCache
When you deploy DeepSeek for the first time, you must create a DataCache in advance to eliminate the need to pull the model data. This accelerates the deployment of DeepSeek.
Visit Hugging Face and obtain the ID of the model.
In this topic, the main version of DeepSeek-R1-Distill-Qwen-7B is used. Find the model in Hugging Face and copy the ID of the model in the upper part of the model details page.
Write a YAML configuration file for the DataCache. Then, use the YAML file to create the DataCache and pull the DeepSeek-R1-Distill-Qwen-7B model data to store in the DataCache.
kubectl create -f datacache-test.yaml
Example of the DataCache YAML configuration file that is named datacache-test.yaml.
apiVersion: eci.aliyun.com/v1alpha1 kind: DataCache metadata: name: deepseek-r1-distill-qwen-7b spec: bucket: test path: /model/deepseek-r1-distill-qwen-7b dataSource: type: URL options: repoSource: HuggingFace/Model # Specify the model whose data source is Hugging Face. repoId: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B # Specify the ID of the model. revision: main # Specify the version of the model. netConfig: securityGroupId: sg-bp1*********** vSwitchId: vsw-bp1uo************ # Specify a vSwitch for which SNAT entries are configured. # If no SNAT entries are configured for the vSwitch to enable Internet access for the model, an elastic IP address (EIP) must be created and associated with the vSwitch. eipCreateParam: bandwidth: 5 # EIP bandwidth
Query the status of the DataCache.
kubectl get edc
After the model data is downloaded and the status of the DataCache becomes Available, the DataCache is ready for use. Alibaba Cloud provides the hot load capability for DeepSeek-R1 series models, which makes a DataCache be created within seconds.
NAME AGE DATACACHEID STATUS PROGRESS BUCKET PATH deepseek-r1-distill-qwen-7b 40s edc-uf6btsb4q5j4b9ue**** Available 100% test /model/deepseek-r1-distill-qwen-7b
Step 2: Configure rules for ack-alibaba-cloud-metrics-adapter
Custom metric-based auto scaling is implemented based on the ack-alibaba-cloud-metrics-adapter component and the Kubernetes HPA mechanism provided by ACK. The following table describes the GPU metrics that are supported by HPA. For more information, see Enable auto scaling based on GPU metrics.
Metric | Description | Unit |
DCGM_FI_DEV_GPU_UTIL | The utilization of the GPU card. This metric is available only for GPUs that are scheduled in exclusive mode. | % |
DCGM_FI_DEV_FB_USED | The used memory of the GPU card. This metric is available only for GPUs that are scheduled in exclusive mode. | MiB |
DCGM_CUSTOM_PROCESS_SM_UTIL | The GPU utilization of pods. | % |
DCGM_CUSTOM_PROCESS_MEM_USED | The used amount in the GPU memory that is assigned to pods. | MiB |
Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, click the name of the cluster.
In the left-side navigation pane of the cluster management page, choose
.On the Helm page, click Update in the Actions column of ack-alibaba-cloud-metrics-adapter.
Add the following
rules
in thecustom
field.
Step 3: Deploy the scalable DeepSeek model inference service
Write a YAML configuration file for the DeepSeek application and then deploy the application based on the YAML file.
kubectl create -f deepseek-r1-7b-kserve.yaml
The following sample code provides the content of the deepseek-r1-7b-kserve.yaml file. Description of the file:
The pod is based on a GPU-accelerated ECS instance type and is mounted with the DeepSeek-R1-Distill-Qwen-7B model.
The predictor of InferenceService uses an image that contains vLLM. After a container is started, the container runs
vllm serve /deepseek-r1-7b --tensor-parallel-size 1 --max-model-len 24384 --enforce-eager
to start OpenAI-Compatible Server.Scaling is predicted based on the DCGM_CUSTOM_PROCESS_SM_UTIL metric that indicates the GPU utilization of pods. When the average GPU utilization reaches 50%, HPA automatically scales out pods. The maximum number of total pods cannot be greater than the value of
maxReplicas
.
apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: deepseek-r1-7b-kserve labels: alibabacloud.com/eci: "true" annotations: serving.kserve.io/autoscalerClass: external k8s.aliyun.com/eci-use-specs: ecs.gn7i-c16g1.4xlarge,ecs.gn7i-c32g1.8xlarge # Specify GPU-accelerated ECS instance types. You can specify multiple ECS instance types to increase the creation success rate of the pod. k8s.aliyun.com/eci-extra-ephemeral-storage: "20Gi" # Specify an additional temporary storage space because the startup of the pod depends on a large framework. You are charged for the additional temporary storage space. k8s.aliyun.com/eci-data-cache-bucket: "test" # Specify a bucket to store the DataCache. # If you require a higher loading speed, you can use an AutoPL disk. k8s.aliyun.com/eci-data-cache-provisionedIops: "15000" # Specify the IOPS that is provisioned for the enhanced SSD (ESSD) AutoPL disk. k8s.aliyun.com/eci-data-cache-burstingEnabled: "true" # Enable the performance burst feature for the ESSD AutoPL disk to accelerate the startup of the application. spec: predictor: containers: - name: vllm command: - /bin/sh args: - -c - vllm serve /deepseek-r1-7b --port 8080 --tensor-parallel-size 1 --max-model-len 24384 --enforce-eager image: registry-vpc.cn-hangzhou.aliyuncs.com/eci_open/vllm-openai:v0.7.2 resources: limits: cpu: "16" memory: "60Gi" nvidia.com/gpu: "1" requests: cpu: "16" memory: "60Gi" nvidia.com/gpu: "1" volumeMounts: - mountPath: /deepseek-r1-7b # Specify the path of the model. name: llm-model volumes: - name: llm-model hostPath: path: /model/deepseek-r1-distill-qwen-7b # Specify the mount path of the DataCache. --- apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: deepseek-r1-7b-kserve-predictor namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: deepseek-r1-7b-kserve-predictor metrics: - pods: metric: name: DCGM_CUSTOM_PROCESS_SM_UTIL # Specify the name of the metric to be monitored. target: type: Utilization # Define the type of target value, such as utilization or raw value. averageValue: '50' # Set the target average value of the metric. type: Pods minReplicas: 1 # Specify the minimum number of pods that are in the Running state. maxReplicas: 100 # Specify the maximum allowed number of pods.
Expected output:
inferenceservice.serving.kserve.io/deepseek-r1-7b-kserve created horizontalpodautoscaler.autoscaling/deepseek-r1-7b-kserve-predictor created
Check whether the application is deployed.
kubectl get pods --selector=app=isvc.deepseek-r1-7b-kserve-predictor
Expected output:
NAME READY STATUS RESTARTS AGE deepseek-r1-7b-kserve-predictor-6785df7b7f-r7kjx 1/1 Running 0 116s
Step 4: Test the inference effect of the model
Obtain the IP address and hostname of the Server Load Balancer (SLB) instance used by the DeepSeek inference application and set the IP address and hostname as variables.
NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}') SERVICE_HOSTNAME=$(kubectl get inferenceservice deepseek-r1-7b-kserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
Test the DeepSeek model inference service.
Send a request to the DeepSeek model inference service. Example:
curl POST http://$NGINX_INGRESS_IP:80/v1/chat/completions \ -H "Host: $SERVICE_HOSTNAME" \ -H "Content-Type: application/json" \ -d '{ "model": "/deepseek-r1-7b", "messages": [ { "role": "user", "content": "Briefly describe containers in one sentence" } ], "temperature": 0.6, "max_tokens": 3000 }' \ --verbose
Expected output:
{"id":"chatcmpl-56e6ff393d999571ce6ead1b72f9302d","object":"chat.completion","created":1739340308,"model":"/deepseek-r1-7b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\n Ok, I need to briefly describe containers. What is a container? I remember that in programming, especially in Docker, a container seems to be an isolated environment that allows applications to run independently. Containerization makes development and deployment easier, right? Therefore, a container should be a lightweight runtime that can isolate applications from dependencies, making the development and deployment process more efficient. Right? Therefore, a one-sentence introduction to containers should cover the features of isolation, lightweight, and independent operation of containers. Application scenarios of containerization technology, such as cloud native services and the microservices architecture, may also be mentioned. Well, now I need to organize these ideas into one sentence. \n</think>\n\n A container is an isolated runtime environment that allows applications to run independently, provides lightweight and efficient management of resources, and supports cloud native services and the microservices architecture.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":168,"completion_tokens":160,"prompt_tokens_details":null},"prompt_logprobs":null}
The content in
<think></think>
represents the thinking process or inference steps before the model generates the final answer. These markers are not a part of the final output, but a record of the self-prompting or logical inference of the model before the model generates the answer.Extracted final answer:
A container is an isolated runtime environment that allows applications to run independently, provides lightweight and efficient management of resources, and supports cloud native services and the microservices architecture.
Test whether HPA scales out pods when the metric value exceeds the scaling threshold.
Run the following command to query the number of pods in the cluster:
kubectl get pods --selector=app=isvc.deepseek-r1-7b-kserve-predictor
The following output shows that the cluster contains one pod.
NAME READY STATUS RESTARTS AGE deepseek-r1-7b-kserve-predictor-6785df7b7f-r7kjx 1/1 Running 0 6m54s
Run the following command to query the status of HPA:
kubectl get hpa
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE deepseek-r1-7b-kserve-predictor Deployment/deepseek-r1-7b-kserve-predictor 9/50 1 100 1 8m
Use hey to perform a stress test.
For more information about hey, see hey.
hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "/deepseek-r1-7b", "messages": [{"role": "user", "content": "hello world!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions
Query the number of pods in the cluster to check whether HPA scaled out pods.
kubectl get pods --selector=app=isvc.deepseek-r1-7b-kserve-predictor
The following output shows that HPA scaled out pods:
NAME READY STATUS RESTARTS AGE deepseek-r1-7b-kserve-predictor-6785df7b7f-r7kjx 1/1 Running 0 8m5s deepseek-r1-7b-kserve-predictor-6785df7b7f-6l2kj 1/1 Running 0 104s deepseek-r1-7b-kserve-predictor-6785df7b7f-3q5dz 1/1 Running 0 104s
Run the following command to query the status of HPA:
kubecl get hpa
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE deepseek-r1-7b-kserve-predictor Deployment/deepseek-r1-7b-kserve-predictor 5/50 1 100 3 10m