Configure auto scaling for pods based on GPU metrics from Prometheus - Container Service for Kubernetes

How it works

Managed Service for Prometheus collects NVIDIA Data Center GPU Manager (DCGM) metrics from GPU nodes. The Prometheus Adapter translates those raw metrics into the Kubernetes Custom Metrics API format. HPA reads the custom metrics and adjusts pod replicas when utilization crosses your defined threshold.

Prerequisites

Before you begin, ensure that you have:

A GPU node added to your cluster, or a dedicated GPU cluster created. See Add a GPU node or Create a dedicated GPU cluster

Step 1: Deploy Managed Service for Prometheus and the metrics adapter

Enable Prometheus monitoring

Enable Prometheus monitoring for your cluster.

Skip this step if you selected to install Prometheus when you created the cluster.

Install ack-alibaba-cloud-metrics-adapter

Get the Grafana read address

Log on to the ACK consoleACK consoleACK consoleARMS console.
In the left navigation pane, choose Managed Service for Prometheus > Instances.
At the top of the page, select the region where your Container Service for Kubernetes (ACK) cluster is located, then click the name of the target instance.
On the Settings page, click the Settings tab. In the HTTP API URL (Grafana Read Address) section, copy the internal network address.

Configure the Prometheus URL

Log on to the ACK console. In the left navigation pane, click Marketplace > Marketplace.
On the App Marketplace page, click the App Catalog tab. Search for and click ack-alibaba-cloud-metrics-adapter.
On the ack-alibaba-cloud-metrics-adapter page, click Deploy.
In the Basic Information wizard, select a cluster and namespace, then click Next.
In the Parameters wizard, select a Chart Version. In the Parameters section, set the Prometheus url parameter to the HTTP API address you copied, then click OK.

Step 2: Configure adapter rules for GPU metrics

The Prometheus Adapter maps raw DCGM metric series to Kubernetes custom metric names using three key fields:

seriesQuery: selects which Prometheus metric series to expose
resources.overrides: maps Prometheus label names to Kubernetes resource dimensions (pod, namespace, node)
metricsQuery: the PromQL expression the adapter runs when HPA requests the metric value

Available GPU metrics

Metric	Description	Unit
`DCGM_FI_DEV_GPU_UTIL`	GPU card utilization	%
`DCGM_FI_DEV_FB_USED`	GPU card memory usage	MiB
`DCGM_CUSTOM_PROCESS_SM_UTIL`	GPU utilization of the container	%
`DCGM_CUSTOM_PROCESS_MEM_USED`	GPU memory usage of the container	MiB
`DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO`	GPU memory utilization of the container	%

Important

DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_FB_USED are card-level metrics valid only for dedicated GPU scheduling. For shared GPU scheduling — where multiple pods share one GPU card — NVIDIA reports only card-level utilization. The value shown by nvidia-smi inside a pod is the utilization of the entire card, not the individual container.

DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO is calculated as:

GPU memory utilization = Actual GPU memory usage of the current pod container (Used) / GPU memory allocated to the current pod container (Allocated)

For the full list of available metrics, see Monitoring metric descriptions.

Add adapter rules

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Helm.

In the Actions column for the ack-alibaba-cloud-metrics-adapter Helm release, click Update. Add the following rules to the custom field.

- metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
  resources:
    overrides:
      NodeName:
        resource: node
  seriesQuery: DCGM_FI_DEV_GPU_UTIL{} # GPU utilization
- metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
  resources:
    overrides:
      NamespaceName:
        resource: namespace
      NodeName:
        resource: node
      PodName:
        resource: pod
  seriesQuery: DCGM_CUSTOM_PROCESS_SM_UTIL{} # Container GPU utilization.
- metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
  resources:
    overrides:
      NodeName:
        resource: node
  seriesQuery: DCGM_FI_DEV_FB_USED{} # GPU memory usage.
- metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
  resources:
    overrides:
      NamespaceName:
        resource: namespace
      NodeName:
        resource: node
      PodName:
        resource: pod
  seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{} # Container GPU memory usage.
- metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>) / sum(DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED{}) by (<<.GroupBy>>)
  name:
    as: ${1}_GPU_MEM_USED_RATIO
    matches: ^(.*)_MEM_USED
  resources:
    overrides:
      NamespaceName:
        resource: namespace
      PodName:
        resource: pod
  seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{NamespaceName!="",PodName!=""}  # Container GPU memory utilization.

After saving, the configuration looks like the following:

Verify that custom metrics are registered

Run the following command to confirm HPA can see the GPU metrics:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"

The output lists all registered custom metrics. A successful configuration includes entries for DCGM_FI_DEV_GPU_UTIL, DCGM_CUSTOM_PROCESS_SM_UTIL, DCGM_FI_DEV_FB_USED, and DCGM_CUSTOM_PROCESS_MEM_USED. The following shows an example for DCGM_CUSTOM_PROCESS_SM_UTIL:

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "nodes/DCGM_CUSTOM_PROCESS_SM_UTIL",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "pods/DCGM_CUSTOM_PROCESS_SM_UTIL",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "namespaces/DCGM_CUSTOM_PROCESS_SM_UTIL",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Step 3: Test GPU-based auto scaling

This example deploys a BERT intent detection inference service on a GPU node, then runs a stress test to trigger scale out and scale in based on DCGM_CUSTOM_PROCESS_SM_UTIL (container GPU utilization).

Deploy the inference service

Create the Deployment and Service:

cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bert-intent-detection
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bert-intent-detection
  template:
    metadata:
      labels:
        app: bert-intent-detection
    spec:
      containers:
      - name: bert-container
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
        ports:
        - containerPort: 80
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: bert-intent-detection-svc
  labels:
    app: bert-intent-detection
spec:
  selector:
    app: bert-intent-detection
  ports:
  - protocol: TCP
    name: http
    port: 80
    targetPort: 80
  type: LoadBalancer
EOF

Confirm the pod is running and scheduled to a GPU node:

kubectl get pods -o wide

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
bert-intent-detection-7b486f6bf-f****   1/1     Running   0          3m24s   10.15.1.17   cn-beijing.192.168.94.107   <none>           <none>

Confirm the LoadBalancer service is assigned an external IP:

kubectl get svc bert-intent-detection-svc

Expected output:

NAME                        TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
bert-intent-detection-svc   LoadBalancer   172.16.186.159   47.95.XX.XX   80:30118/TCP   5m1s

SSH into GPU node 192.168.94.107 and run the following command to confirm the inference process is loaded:

nvidia-smi

Expected output:

Wed Feb 16 11:48:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   32C    P0    55W / 300W |  15345MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2305118      C   python                          15343MiB |
+-----------------------------------------------------------------------------+

GPU utilization is 0% because no inference requests have been sent yet. The process entry confirms the model is loaded.

Send a test request to confirm the service responds:

curl -v  "http://47.95.XX.XX/predict?query=Music"

Expected output:

*   Trying 47.95.XX.XX...
* TCP_NODELAY set
* Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0)
> GET /predict?query=Music HTTP/1.1
> Host: 47.95.XX.XX
> User-Agent: curl/7.64.1
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 9
< Server: Werkzeug/1.0.1 Python/3.6.9
< Date: Wed, 16 Feb 2022 03:52:11 GMT
<
* Closing connection 0
PlayMusic # Intent recognition result.

An HTTP 200 response with an intent recognition result confirms the service is working.

Configure HPA

HPA calculates the desired replica count using the following formula:

Desired Replicas = ceil[Current Replicas × (Current Metric / Desired Metric)]

For example, with 1 current replica, a current utilization of 23, and a target of 20: ceil(1 x 23/20) = 2.

This example triggers scale out when container GPU utilization (DCGM_CUSTOM_PROCESS_SM_UTIL) exceeds 20%. Create the HPA resource. The API version depends on your Kubernetes version.

v1.23 or later

cat <<EOF | kubectl create -f -
apiVersion: autoscaling/v2  # Use the autoscaling/v2 version for HPA configuration.
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-intent-detection
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_CUSTOM_PROCESS_SM_UTIL
      target:
        type: Utilization
        averageValue: 20 # A scale-out is triggered when the container's GPU utilization exceeds 20%.
EOF

Earlier than v1.23

cat <<EOF | kubectl create -f -
apiVersion: autoscaling/v2beta1  # Use the autoscaling/v2beta1 version for HPA configuration.
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-intent-detection
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metricName: DCGM_CUSTOM_PROCESS_SM_UTIL # GPU utilization of the pod.
      targetAverageValue: 20 # A scale-out is triggered when the container's GPU utilization exceeds 20%.
EOF

Confirm the HPA is active and reading the metric:

kubectl get hpa

Expected output:

NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          74s

TARGETS shows 0/20, meaning current GPU utilization is 0% against a 20% threshold.

Test scale out

Run a stress test against the inference service:

hey -n 10000 -c 200 "http://47.95.XX.XX/predict?query=music"

While the test runs, watch the HPA status:

kubectl get hpa --watch

Expected output when GPU utilization exceeds the threshold:

NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   23/20     1         10        2          7m56s

TARGETS rising to 23/20 triggers HPA to scale the Deployment to 2 replicas (following the formula: ceil(1 x 23/20) = 2). Confirm the second pod started:

kubectl get pods

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE
bert-intent-detection-7b486f6bf-f****   1/1     Running   0          44m
bert-intent-detection-7b486f6bf-m****   1/1     Running   0          14s

Test scale in

When the stress test stops, GPU utilization drops back to 0. HPA applies a default 5-minute scale-in cooldown before terminating the extra pod. Use --watch to observe the transition in real time rather than polling manually:

kubectl get hpa --watch

Expected output after the cooldown:

NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          15m

Confirm only one pod remains:

kubectl get pods

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE
bert-intent-detection-7b486f6bf-f****   1/1     Running   0          52m

FAQ

How can I tell if a GPU card is being used?

Check the GPU Monitoring tab in the ACK console. A GPU card under load shows increasing utilization; an idle card stays flat.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, click the GPU Monitoring tab. Observe the GPU card utilization graph.