All Products
Search
Document Center

Container Service for Kubernetes:Enable auto scaling based on GPU metrics

Last Updated:Dec 14, 2023

Kubernetes supports auto scaling based on custom metrics. Kubernetes can work with Managed Service for Prometheus to implement auto scaling based on GPU metrics. This topic describes how to deploy Managed Service for Prometheus to monitor applications. This topic also provides examples on how to view GPU metrics that are collected by Managed Service for Prometheus and enable auto scaling of pods based on GPU metrics.

Prerequisites

An ACK cluster with GPU-accelerated nodes or ACK dedicated cluster with GPU-accelerated nodes is created.

Introduction

GPU-accelerated computing is widely used in high-performance computing scenarios, such as the training of deep learning models and inference. To reduce resource costs, you can enable cluster auto scaling based on GPU metrics, such as GPU utilization and GPU memory usage.

By default, Kubernetes enables horizontal pod autoscaling based on CPU and memory metrics. If you have higher requirements, you can use the Prometheus adapter to support the GPU metrics that are collected by Prometheus and use the custom metrics API to define custom metrics. This allows you to enable horizontal pod autoscaling based on GPU utilization and GPU memory usage. The following figure shows how auto scaling based on GPU metrics works.

hpa

Step 1: Deploy Managed Service for Prometheus and ack-alibaba-cloud-metrics-adapter

  1. Enable Managed Service for Prometheus.

    Note

    You can select Enable Managed Service for Prometheus when you create a cluster. This saves you the need to install Managed Service for Prometheus after the cluster is created.

  2. Install and configure ack-alibaba-cloud-metrics-adapter.

    a. Obtain the HTTP API endpoint

    1. Log on to the ARMS console.

    2. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.

    3. In the upper-left corner of the Managed Service for Prometheus page, select the region where your ACK cluster is deployed. Then, click the name of a Prometheus instance whose Instance Type is Prometheus for Container Service. The details page of the Prometheus instance appears.

    4. In the left-side navigation pane of the instance details page, click Settings and copy the internal endpoint in the HTTP API Address section.

    b. Configure the Prometheus URL

    1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

    2. On the Marketplace page, click the App Catalog tab. Find and click ack-alibaba-cloud-metrics-adapter.

    3. On the ack-alibaba-cloud-metrics-adapter page, click Deploy.

    4. On the Basic Information wizard page, select a cluster and a namespace, and then click Next.

    5. On the Parameters wizard page, select a chart version from the Chart Version drop-down list, set the Prometheus URL in the Parameters section to the HTTP API endpoint that you obtained, and then click OK.

Step 2: Configure rules for ack-alibaba-cloud-metrics-adapter

a. Query GPU metrics

Query GPU metrics. For more information, see Introduction to metrics.

b. Configure rules for ack-alibaba-cloud-metrics-adapter

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Helm in the left-side navigation pane.

  3. On the Helm page, click Update in the Actions column of ack-alibaba-cloud-metrics-adapter. Add the following rules below custom.

    Show sample code

    - metricsQuery: <<.Series>>{<<.LabelMatchers>>}
      resources:
        overrides:
          NodeName:
            resource: node
      seriesQuery: DCGM_FI_DEV_GPU_UTIL{} # This metric indicates the GPU utilization.
    - metricsQuery: <<.Series>>{<<.LabelMatchers>>}
      resources:
        overrides:
          NamespaceName:
            resource: namespace
          NodeName:
            resource: node
          PodName:
            resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_SM_UTIL{} # This metric indicates the GPU utilization of pods. 
    - metricsQuery: <<.Series>>{<<.LabelMatchers>>}
      resources:
        overrides:
          NodeName:
            resource: node
      seriesQuery: DCGM_FI_DEV_FB_USED{} # This metric indicates the amount of GPU memory that is used. 
    - metricsQuery: <<.Series>>{<<.LabelMatchers>>}
      resources:
        overrides:
          NamespaceName:
            resource: namespace
          NodeName:
            resource: node
          PodName:
            resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{} # This metric indicates the GPU memory usage of pods. 
    - metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>})by(<<.GroupBy>>) / sum(DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED{})by(<<.GroupBy>>)
        name:
          as: ${1}_GPU_MEM_USED_RATIO
          matches: ^(.*)_MEM_USED
        resources:
          overrides:
            NamespaceName:
              resource: namespace
            PodName:
              resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{NamespaceName!="",PodName!=""}  # This metric indicates the GPU memory utilization.

    The following figure provides an example.

    1690252651140-f693f03a-0f9e-4a6a-8772-b7abe9b2912a.png

    Run the following command. If the output includes DCGM_FI_DEV_GPU_UTIL, DCGM_CUSTOM_PROCESS_SM_UTIL, DCGM_FI_DEV_FB_USED, and DCGM_CUSTOM_PROCESS_MEM_USED, the rules are configured. In the following example, DCGM_CUSTOM_PROCESS_SM_UTIL is returned in the output.

    Show sample code

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"
    {
    	[
        ...
    		{
          "name": "nodes/DCGM_CUSTOM_PROCESS_SM_UTIL",
          "singularName": "",
          "namespaced": false,
          "kind": "MetricValueList",
          "verbs": [
            "get"
          ]
        },
        ...
        {
          "name": "pods/DCGM_CUSTOM_PROCESS_SM_UTIL",
          "singularName": "",
          "namespaced": true,
          "kind": "MetricValueList",
          "verbs": [
            "get"
          ]
        },
        ...
        {
          "name": "namespaces/DCGM_CUSTOM_PROCESS_SM_UTIL",
          "singularName": "",
          "namespaced": false,
          "kind": "MetricValueList",
          "verbs": [
            "get"
          ]
        }
        ...
        {
          "name": "DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO",
          "singularName": "",
          "namespaced": false,
          "kind": "MetricValueList",
          "verbs": [
            "get"
          ]
        }
        ...
    	]
    }

Step 3: Enable auto scaling based on GPU metrics

The following example shows how to deploy a model inference service on a GPU-accelerated node and perform stress tests on the node to check whether auto scaling can be performed based on GPU metrics.

1. Deploy an inference service.

  1. Run the following command to deploy the inference service:

    Show sample code

    cat <<EOF | kubectl create -f -
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: bert-intent-detection
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: bert-intent-detection
      template:
        metadata:
          labels:
            app: bert-intent-detection
        spec:
          containers:
          - name: bert-container
            image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
            ports:
            - containerPort: 80
            resources:
              limits:
                nvidia.com/gpu: 1
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: bert-intent-detection-svc
      labels:
        app: bert-intent-detection
    spec:
      selector:
        app: bert-intent-detection
      ports:
      - protocol: TCP
        name: http
        port: 80
        targetPort: 80
      type: LoadBalancer
    EOF
  2. Query the status of the pod and Service.

    • Run the following command to query the status of the pod:

      kubectl get pods -o wide

      Expected output:

      NAME                                    READY   STATUS    RESTARTS   AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
      bert-intent-detection-7b486f6bf-f****   1/1     Running   0          3m24s   10.15.1.17   cn-beijing.192.168.94.107   <none>           <none>

      The output indicates that only one pod is deployed on the GPU-accelerated node 192.168.94.107.

    • Run the following command to query the status of the Service:

      kubectl get svc bert-intent-detection-svc

      Expected output:

      NAME                        TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
      bert-intent-detection-svc   LoadBalancer   172.16.186.159   47.95.XX.XX   80:30118/TCP   5m1s

      If the output displays the name of the Service, the Service is deployed.

  3. Log on to the node 192.168.94.107 by using SSH and run the following command to query GPU utilization:

    nvidia-smi

    Expected output:

    Wed Feb 16 11:48:07 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
    | N/A   32C    P0    55W / 300W |  15345MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A   2305118      C   python                          15343MiB |
    +-----------------------------------------------------------------------------+

    The output indicates that the inference service is running on the GPU-accelerated node. The GPU utilization is 0 because no request is sent to the service.

  4. Run the following command to send requests to the inference service and check whether the service is deployed:

    curl -v  "http://47.95.XX.XX/predict?query=Music"

    Expected output:

    *   Trying 47.95.XX.XX...
    * TCP_NODELAY set
    * Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0)
    > GET /predict?query=Music HTTP/1.1
    > Host: 47.95.XX.XX
    > User-Agent: curl/7.64.1
    > Accept: */*
    >
    * HTTP 1.0, assume close after body
    < HTTP/1.0 200 OK
    < Content-Type: text/html; charset=utf-8
    < Content-Length: 9
    < Server: Werkzeug/1.0.1 Python/3.6.9
    < Date: Wed, 16 Feb 2022 03:52:11 GMT
    <
    * Closing connection 0
    PlayMusic # The query result.

    If the HTTP status code 200 and the query result are returned, the inference service is deployed.

2. Configure the HPA

The following example describes how to trigger auto scaling when the GPU utilization of a pod exceeds 20%. The following table describes the metrics that are supported by the Horizontal Pod Autoscaler (HPA).

Metric

Description

Unit

DCGM_FI_DEV_GPU_UTIL

  • The GPU utilization.

  • This metric is available only for GPUs that are scheduled in exclusive mode.

    Important

    If a GPU is shared among multiple pods, only the utilization of the GPU is returned after you run the nvidia-smi command in one of the pods. This is because NVIDIA does not provide information about the GPU utilization of pods.

%

DCGM_FI_DEV_FB_USED

  • The amount of GPU memory that is used.

  • This metric is available only for GPUs that are scheduled in exclusive mode.

MiB

DCGM_CUSTOM_PROCESS_SM_UTIL

The GPU utilization of pods.

%

DCGM_CUSTOM_PROCESS_MEM_USED

The amount of GPU memory that is used by pods.

MiB

DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO

The GPU memory utilization of pods.

GPU memory utilization of a pod = Current GPU memory used by the pod (Used)/Current GPU memory allocated to the pod (Allocated)

%

  1. Run the following command to deploy the HPA:

    Clusters that run Kubernetes 1.23 or later

    cat <<EOF | kubectl create -f -
    apiVersion: autoscaling/v2  # Use the HPA configuration for API version autoscaling/v2. 
    kind: HorizontalPodAutoscaler
    metadata:
      name: gpu-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: bert-intent-detection
      minReplicas: 1
      maxReplicas: 10
      metrics:
      - type: Pods
        pods:
          metric:
            name: DCGM_CUSTOM_PROCESS_SM_UTIL
          target:
            type: Utilization
            averageValue: 20 # If the GPU utilization exceeds 20%, pods are scaled out. 
    EOF

    Clusters that run Kubernetes versions earlier than 1.23

    cat <<EOF | kubectl create -f -
    apiVersion: autoscaling/v2beta1  # Use the HPA configuration for API version autoscaling/v2beta1. 
    kind: HorizontalPodAutoscaler
    metadata:
      name: gpu-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: bert-intent-detection
      minReplicas: 1
      maxReplicas: 10
      metrics:
      - type: Pods
        pods:
          metricName: DCGM_CUSTOM_PROCESS_SM_UTIL # The GPU utilization of pods. 
          targetAverageValue: 20 # If the GPU utilization exceeds 20%, pods are scaled out. 
    EOF
  2. Run the following command to query the status of the HPA:

    kubectl get hpa

    Expected output:

    NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
    gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          74s

    The expected output indicates that TARGETS displays 0/20. The current GPU utilization is 0. When the GPU utilization exceeds 20%, pods are scaled out.

3. Test auto scaling on the inference service.

Test scale-out activities

  1. Run the following command to perform the stress test:

    hey -n 10000 -c 200 "http://47.95.XX.XX/predict?query=music"
    Note

    The following formula is used to calculate the expected number of pods after auto scaling: Expected number of pods = ceil [Current number of pods × (Current GPU utilization/Expected GPU utilization)]. For example, if the current number of pods is 1, current GPU utilization is 23, expected GPU utilization is 20, the expected number of pods after auto scaling is 2.

  2. During the stress test, run the following command to query the status of the HPA and the pods:

    1. Run the following command to query the status of the HPA:

      kubectl get hpa

      Expected output:

      NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
      gpu-hpa   Deployment/bert-intent-detection   23/20     1         10        2          7m56s

      The output indicates that the value in the TARGETS column is 23/20. The current GPU utilization exceeds the threshold 20%. In this case, auto scaling is triggered and the ACK cluster starts to scale out pods.

    2. Run the following command to query the status of the pods:

      kubectl get pods

      Expected output:

      NAME                                    READY   STATUS    RESTARTS   AGE
      bert-intent-detection-7b486f6bf-f****   1/1     Running   0          44m
      bert-intent-detection-7b486f6bf-m****   1/1     Running   0          14s

      The output indicates that two pods are running. This value is the same as the expected number of pods calculated based on the preceding formula.

    The output returned by the HPA and the pods indicates that the pods are scaled out.

Test scale-in activities

When the stress test stops and the GPU utilization drops below 20%, the ACK cluster starts to scale in pods.

  1. Run the following command to query the status of the HPA:

    kubectl get hpa

    Expected output:

    NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
    gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          15m

    The output indicates that the value in the TARGETS column is 0/20. The current GPU utilization drops to 0. The ACK cluster starts to scale in pods after about 5 minutes.

  2. Run the following command to query the status of the pods:

    kubectl get pods

    Expected output:

    NAME                                    READY   STATUS    RESTARTS   AGE
    bert-intent-detection-7b486f6bf-f****   1/1     Running   0          52m

    The output indicates that number of pods is 1. This means that the pods are scaled in.

FAQ

How do I confirm whether a GPU is used?

You can check whether there are changes in the GPU utilization on the GPU Monitoring tab. If the GPU utilization increases, a GPU is used. If no changes are found in the GPU utilization, no GPU is used. To do this, perform the following steps:

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage. In the left-side navigation pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, click the GPU Monitoring tab and view changes in the GPU utilization.