All Products
Search
Document Center

Elastic Container Instance:Deploy a DeepSeek-R1 series model and implement scaling of pods in the model

Last Updated:Mar 21, 2025

This topic describes how to use a DataCache to deploy a DeepSeek-R1 series model in Elastic Container Instance. This topic also describes how to configure Horizontal Pod Autoscaler (HPA) to implement scaling of pods based on custom metrics. In this example, DeepSeek-R1-Distill-Qwen-7B is used.

Why is Elastic Container Instance used to deploy DeepSeek?

  • Elastic Container Instance does not require O&M, can be flexibly deployed, and can help you build elastic and cost-effective business. For more information, see Benefits.

  • Elastic Container Instance uses DataCaches and ImageCaches to save time of image pulls and model downloads, reduce network resource consumption, and improve system efficiency.

    Note

    The deployment of a containerized large model inference service involves the following stages: create and start a container, pull the image, download the model file, and load and start the model. An extended period of time and a large amount of network traffic are required to pull the image and model of a large model inference service due to the large size of the image and model. For example, the vLLM image is about 16.5 GB in size and the DeepSeek-R1-Distill-Qwen-7B model is about 14 GB in size. Elastic Container Instance uses DataCaches and ImageCaches to save time of image pulls and model downloads.

Prerequisites

  • A DataCache custom resource definition (CRD) is deployed in the cluster. For more information, see Deploy a DataCache CRD.

  • The virtual private cloud (VPC) in which the cluster resides is associated with an Internet NAT gateway. An SNAT entry is configured for the Internet NAT gateway to allow resources in the VPC or resources connected to vSwitches in the VPC to access the Internet.

    Note

    If the VPC is not associated with an Internet NAT gateway, you must associate an elastic IP address (EIP) with the VPC when you create the DataCache and deploy the application. This way, you can pull data from the Internet.

  • The ARMS Prometheus component (ack-arms-prometheus) is installed in the cluster. For more information, see Use Managed Service for Prometheus.

  • The ack-alibaba-cloud-metrics-adapter component is deployed in the cluster.

    To deploy ack-Alibaba-cloud-metrics-adapter, log on to the Container Service for Kubernetes (ACK) console. In the left-side navigation pane, choose Marketplace > Marketplace. On the Marketplace page, find and deploy ack-alibaba-cloud-metrics-adapter.

  • KServe is deployed in the cluster.

    KServe is Kubernetes-based machine learning model serving framework. KServe allows you to deploy one or more trained models to a model serving runtime by using Kubernetes CRD. For more information, see Install ack-kserve️.

Prepare a runtime environment

Different DeepSeek models have different requirements for runtime environments. In this topic, DeepSeek-R1-Distill-Qwen-7B model is used.

  • Recommended specifications

    The GPU-accelerated Elastic Compute Service (ECS) instance family that is used to create the Elastic Container Instance-based pod must meet the following conditions. For information about the GPU-accelerated ECS instance families that can be used to create pods, see Supported instance families.

    • CPU: no strict limits

    • Memory size: greater than 16 GiB

    • Number of GPUs: 1 or more

    • Size of the GPU memory: 20 GB or more. For example, the A10 GPU can meet the requirements. If the GPU memory size is less than 20 GB, an out-of-memory (OOM) error may occur.

  • Software requirements

    The deployment of a large model depends on a large number of libraries and configurations. vLLM is a mainstream large model inference engine and used to deploy the inference service in this topic. Elastic Container Instance provides a public container image. You can directly use the public container image or perform secondary development based on the public container image. The image address is registry.cn-hangzhou.aliyuncs.com/eci_open/vllm-openai:v0.7.2. The image size is about 16.5 GB.

Step 1: Create a DataCache

When you deploy DeepSeek for the first time, you must create a DataCache in advance to eliminate the need to pull the model data. This accelerates the deployment of DeepSeek.

  1. Visit Hugging Face and obtain the ID of the model.

    In this topic, the main version of DeepSeek-R1-Distill-Qwen-7B is used. Find the model in Hugging Face and copy the ID of the model in the upper part of the model details page.

  2. Write a YAML configuration file for the DataCache. Then, use the YAML file to create the DataCache and pull the DeepSeek-R1-Distill-Qwen-7B model data to store in the DataCache.

    kubectl create -f datacache-test.yaml

    Example of the DataCache YAML configuration file that is named datacache-test.yaml.

    apiVersion: eci.aliyun.com/v1alpha1
    kind: DataCache
    metadata:
      name: deepseek-r1-distill-qwen-7b
    spec:
      bucket: test
      path: /model/deepseek-r1-distill-qwen-7b
      dataSource:
        type: URL
        options:
          repoSource: HuggingFace/Model                          # Specify the model whose data source is Hugging Face.
          repoId: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B        # Specify the ID of the model.
          revision: main                                         # Specify the version of the model.
      netConfig:
        securityGroupId: sg-bp1***********
        vSwitchId: vsw-bp1uo************                 # Specify a vSwitch for which SNAT entries are configured.
        # If no SNAT entries are configured for the vSwitch to enable Internet access for the model, an elastic IP address (EIP) must be created and associated with the vSwitch.
        eipCreateParam:                                  
          bandwidth: 5                                   # EIP bandwidth
  3. Query the status of the DataCache.

    kubectl get edc

    After the model data is downloaded and the status of the DataCache becomes Available, the DataCache is ready for use. Alibaba Cloud provides the hot load capability for DeepSeek-R1 series models, which makes a DataCache be created within seconds.

    NAME                          AGE   DATACACHEID                STATUS      PROGRESS   BUCKET    PATH
    deepseek-r1-distill-qwen-7b   40s   edc-uf6btsb4q5j4b9ue****   Available   100%       test      /model/deepseek-r1-distill-qwen-7b

Step 2: Configure rules for ack-alibaba-cloud-metrics-adapter

Custom metric-based auto scaling is implemented based on the ack-alibaba-cloud-metrics-adapter component and the Kubernetes HPA mechanism provided by ACK. The following table describes the GPU metrics that are supported by HPA. For more information, see Enable auto scaling based on GPU metrics.

Metric

Description

Unit

DCGM_FI_DEV_GPU_UTIL

The utilization of the GPU card. This metric is available only for GPUs that are scheduled in exclusive mode.

%

DCGM_FI_DEV_FB_USED

The used memory of the GPU card. This metric is available only for GPUs that are scheduled in exclusive mode.

MiB

DCGM_CUSTOM_PROCESS_SM_UTIL

The GPU utilization of pods.

%

DCGM_CUSTOM_PROCESS_MEM_USED

The used amount in the GPU memory that is assigned to pods.

MiB

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters. On the Clusters page, click the name of the cluster.

  2. In the left-side navigation pane of the cluster management page, choose Applications > Helm.

  3. On the Helm page, click Update in the Actions column of ack-alibaba-cloud-metrics-adapter.

    Add the following rules in the custom field.

    Show sample code

    - metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
      resources:
        overrides:
          NodeName:
            resource: node
      seriesQuery: DCGM_FI_DEV_GPU_UTIL{} # This metric indicates the GPU utilization.
    - metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
      resources:
        overrides:
          NamespaceName:
            resource: namespace
          NodeName:
            resource: node
          PodName:
            resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_SM_UTIL{} # This metric indicates the GPU utilization of pods. 
    - metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
      resources:
        overrides:
          NodeName:
            resource: node
      seriesQuery: DCGM_FI_DEV_FB_USED{} # This metric indicates the amount of GPU memory that is used. 
    - metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
      resources:
        overrides:
          NamespaceName:
            resource: namespace
          NodeName:
            resource: node
          PodName:
            resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{} # This metric indicates the GPU memory usage of pods. 
    - metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>) / sum(DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED{}) by (<<.GroupBy>>)
      name:
        as: ${1}_GPU_MEM_USED_RATIO
        matches: ^(.*)_MEM_USED
      resources:
        overrides:
          NamespaceName:
            resource: namespace
          PodName:
            resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{NamespaceName!="",PodName!=""}  # This metric indicates the GPU memory utilization.

Step 3: Deploy the scalable DeepSeek model inference service

  1. Write a YAML configuration file for the DeepSeek application and then deploy the application based on the YAML file.

    kubectl create -f deepseek-r1-7b-kserve.yaml

    The following sample code provides the content of the deepseek-r1-7b-kserve.yaml file. Description of the file:

    • The pod is based on a GPU-accelerated ECS instance type and is mounted with the DeepSeek-R1-Distill-Qwen-7B model.

    • The predictor of InferenceService uses an image that contains vLLM. After a container is started, the container runs vllm serve /deepseek-r1-7b --tensor-parallel-size 1 --max-model-len 24384 --enforce-eager to start OpenAI-Compatible Server.

    • Scaling is predicted based on the DCGM_CUSTOM_PROCESS_SM_UTIL metric that indicates the GPU utilization of pods. When the average GPU utilization reaches 50%, HPA automatically scales out pods. The maximum number of total pods cannot be greater than the value of maxReplicas.

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: deepseek-r1-7b-kserve
      labels:
        alibabacloud.com/eci: "true"
      annotations:
        serving.kserve.io/autoscalerClass: external
        k8s.aliyun.com/eci-use-specs: ecs.gn7i-c16g1.4xlarge,ecs.gn7i-c32g1.8xlarge  # Specify GPU-accelerated ECS instance types. You can specify multiple ECS instance types to increase the creation success rate of the pod.
        k8s.aliyun.com/eci-extra-ephemeral-storage: "20Gi"   # Specify an additional temporary storage space because the startup of the pod depends on a large framework. You are charged for the additional temporary storage space.
        k8s.aliyun.com/eci-data-cache-bucket: "test"      # Specify a bucket to store the DataCache.
        # If you require a higher loading speed, you can use an AutoPL disk.
        k8s.aliyun.com/eci-data-cache-provisionedIops: "15000"   # Specify the IOPS that is provisioned for the enhanced SSD (ESSD) AutoPL disk.
        k8s.aliyun.com/eci-data-cache-burstingEnabled: "true"    # Enable the performance burst feature for the ESSD AutoPL disk to accelerate the startup of the application.
    spec:
      predictor:
        containers:
          - name: vllm
            command:
              - /bin/sh
            args:
              - -c
              - vllm serve /deepseek-r1-7b --port 8080 --tensor-parallel-size 1 --max-model-len 24384 --enforce-eager
            image: registry-vpc.cn-hangzhou.aliyuncs.com/eci_open/vllm-openai:v0.7.2
            resources:
              limits:
                cpu: "16"
                memory: "60Gi"
                nvidia.com/gpu: "1"
              requests:
                cpu: "16"
                memory: "60Gi"
                nvidia.com/gpu: "1"
            volumeMounts:
              - mountPath: /deepseek-r1-7b # Specify the path of the model.
                name: llm-model
        volumes:
          - name: llm-model
            hostPath:
              path: /model/deepseek-r1-distill-qwen-7b  # Specify the mount path of the DataCache.
    ---
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: deepseek-r1-7b-kserve-predictor
      namespace: default
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: deepseek-r1-7b-kserve-predictor
      metrics:
        - pods:
            metric:
              name: DCGM_CUSTOM_PROCESS_SM_UTIL       # Specify the name of the metric to be monitored.
            target:
              type: Utilization                      # Define the type of target value, such as utilization or raw value.
              averageValue: '50'                     # Set the target average value of the metric.
          type: Pods
      minReplicas: 1                                # Specify the minimum number of pods that are in the Running state.
      maxReplicas: 100                              # Specify the maximum allowed number of pods. 

    Expected output:

    inferenceservice.serving.kserve.io/deepseek-r1-7b-kserve created
    horizontalpodautoscaler.autoscaling/deepseek-r1-7b-kserve-predictor created
  2. Check whether the application is deployed.

    kubectl get pods --selector=app=isvc.deepseek-r1-7b-kserve-predictor

    Expected output:

    NAME                                               READY   STATUS    RESTARTS   AGE
    deepseek-r1-7b-kserve-predictor-6785df7b7f-r7kjx   1/1     Running   0          116s

Step 4: Test the inference effect of the model

  1. Obtain the IP address and hostname of the Server Load Balancer (SLB) instance used by the DeepSeek inference application and set the IP address and hostname as variables.

    NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
    SERVICE_HOSTNAME=$(kubectl get inferenceservice deepseek-r1-7b-kserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
  2. Test the DeepSeek model inference service.

    Send a request to the DeepSeek model inference service. Example:

    curl POST http://$NGINX_INGRESS_IP:80/v1/chat/completions \
         -H "Host: $SERVICE_HOSTNAME" \
         -H "Content-Type: application/json" \
         -d '{
               "model": "/deepseek-r1-7b",
               "messages": [
                   {
                       "role": "user",
                       "content": "Briefly describe containers in one sentence"
                   }
               ],
               "temperature": 0.6,
               "max_tokens": 3000
             }' \
         --verbose

    Expected output:

    {"id":"chatcmpl-56e6ff393d999571ce6ead1b72f9302d","object":"chat.completion","created":1739340308,"model":"/deepseek-r1-7b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\n Ok, I need to briefly describe containers. What is a container? I remember that in programming, especially in Docker, a container seems to be an isolated environment that allows applications to run independently. Containerization makes development and deployment easier, right? Therefore, a container should be a lightweight runtime that can isolate applications from dependencies, making the development and deployment process more efficient. Right? Therefore, a one-sentence introduction to containers should cover the features of isolation, lightweight, and independent operation of containers. Application scenarios of containerization technology, such as cloud native services and the microservices architecture, may also be mentioned. Well, now I need to organize these ideas into one sentence. \n</think>\n\n A container is an isolated runtime environment that allows applications to run independently, provides lightweight and efficient management of resources, and supports cloud native services and the microservices architecture.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":168,"completion_tokens":160,"prompt_tokens_details":null},"prompt_logprobs":null}

    The content in <think></think> represents the thinking process or inference steps before the model generates the final answer. These markers are not a part of the final output, but a record of the self-prompting or logical inference of the model before the model generates the answer.

    Extracted final answer:

    A container is an isolated runtime environment that allows applications to run independently, provides lightweight and efficient management of resources, and supports cloud native services and the microservices architecture.

  3. Test whether HPA scales out pods when the metric value exceeds the scaling threshold.

    1. Run the following command to query the number of pods in the cluster:

      kubectl get pods --selector=app=isvc.deepseek-r1-7b-kserve-predictor

      The following output shows that the cluster contains one pod.

      NAME                                               READY   STATUS    RESTARTS   AGE
      deepseek-r1-7b-kserve-predictor-6785df7b7f-r7kjx   1/1     Running   0          6m54s
    2. Run the following command to query the status of HPA:

      kubectl get hpa

      Expected output:

      NAME                              REFERENCE                                    TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
      deepseek-r1-7b-kserve-predictor   Deployment/deepseek-r1-7b-kserve-predictor   9/50       1         100       1          8m
    3. Use hey to perform a stress test.

      For more information about hey, see hey.

      hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "/deepseek-r1-7b", "messages": [{"role": "user", "content": "hello world!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions
    4. Query the number of pods in the cluster to check whether HPA scaled out pods.

      kubectl get pods --selector=app=isvc.deepseek-r1-7b-kserve-predictor

      The following output shows that HPA scaled out pods:

      NAME                                               READY   STATUS    RESTARTS   AGE
      deepseek-r1-7b-kserve-predictor-6785df7b7f-r7kjx   1/1     Running   0          8m5s
      deepseek-r1-7b-kserve-predictor-6785df7b7f-6l2kj   1/1     Running   0          104s
      deepseek-r1-7b-kserve-predictor-6785df7b7f-3q5dz   1/1     Running   0          104s
    5. Run the following command to query the status of HPA:

      kubecl get hpa

      Expected output:

      NAME                              REFERENCE                                    TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
      deepseek-r1-7b-kserve-predictor   Deployment/deepseek-r1-7b-kserve-predictor   5/50       1         100       3          10m