Enable auto scaling based on GPU metrics - Container Service for Kubernetes

Kubernetes supports auto scaling based on custom metrics. Kubernetes can work with Managed Service for Prometheus to implement auto scaling based on GPU metrics. This topic describes how to deploy Managed Service for Prometheus to monitor applications. This topic also provides examples on how to view GPU metrics that are collected by Managed Service for Prometheus and enable auto scaling of pods based on GPU metrics.

Prerequisites

An ACK cluster with GPU-accelerated nodes or ACK dedicated cluster with GPU-accelerated nodes is created.

Introduction

GPU-accelerated computing is widely used in high-performance computing scenarios, such as the training of deep learning models and inference. To reduce resource costs, you can enable cluster auto scaling based on GPU metrics, such as GPU utilization and GPU memory usage.

By default, Kubernetes enables horizontal pod autoscaling based on CPU and memory metrics. If you have higher requirements, you can use the Prometheus adapter to support the GPU metrics that are collected by Prometheus and use the custom metrics API to define custom metrics. This allows you to enable horizontal pod autoscaling based on GPU utilization and GPU memory usage. The following figure shows how auto scaling based on GPU metrics works.

Step 1: Deploy Managed Service for Prometheus and ack-alibaba-cloud-metrics-adapter

Enable Managed Service for Prometheus.
Note
You can select Enable Managed Service for Prometheus when you create a cluster. This saves you the need to install Managed Service for Prometheus after the cluster is created.
Install and configure ack-alibaba-cloud-metrics-adapter.
a. Obtain the HTTP API endpoint
1. Log on to the ARMS console.
2. In the left-side navigation pane, choose Prometheus Service > Prometheus Instances.
3. In the upper-left corner of the Managed Service for Prometheus page, select the region where your ACK cluster is deployed. Then, click the name of a Prometheus instance whose Instance Type is Prometheus for Container Service. The details page of the Prometheus instance appears.
4. In the left-side navigation pane of the instance details page, click Settings and copy the internal endpoint in the HTTP API Address section.
b. Configure the Prometheus URL
1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.
2. On the Marketplace page, click the App Catalog tab. Find and click ack-alibaba-cloud-metrics-adapter.
3. On the ack-alibaba-cloud-metrics-adapter page, click Deploy.
4. On the Basic Information wizard page, select a cluster and a namespace, and then click Next.
5. On the Parameters wizard page, select a chart version from the Chart Version drop-down list, set the Prometheus URL in the Parameters section to the HTTP API endpoint that you obtained, and then click OK.

Step 2: Configure rules for ack-alibaba-cloud-metrics-adapter

a. Query GPU metrics

Query GPU metrics. For more information, see Introduction to metrics.

b. Configure rules for ack-alibaba-cloud-metrics-adapter

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Helm in the left-side navigation pane.

On the Helm page, click Update in the Actions column of ack-alibaba-cloud-metrics-adapter. Add the following rules below custom.

Show sample code

- metricsQuery: <<.Series>>{<<.LabelMatchers>>}
  resources:
    overrides:
      NodeName:
        resource: node
  seriesQuery: DCGM_FI_DEV_GPU_UTIL{} # This metric indicates the GPU utilization.
- metricsQuery: <<.Series>>{<<.LabelMatchers>>}
  resources:
    overrides:
      NamespaceName:
        resource: namespace
      NodeName:
        resource: node
      PodName:
        resource: pod
  seriesQuery: DCGM_CUSTOM_PROCESS_SM_UTIL{} # This metric indicates the GPU utilization of pods. 
- metricsQuery: <<.Series>>{<<.LabelMatchers>>}
  resources:
    overrides:
      NodeName:
        resource: node
  seriesQuery: DCGM_FI_DEV_FB_USED{} # This metric indicates the amount of GPU memory that is used. 
- metricsQuery: <<.Series>>{<<.LabelMatchers>>}
  resources:
    overrides:
      NamespaceName:
        resource: namespace
      NodeName:
        resource: node
      PodName:
        resource: pod
  seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{} # This metric indicates the GPU memory usage of pods. 
- metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>})by(<<.GroupBy>>) / sum(DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED{})by(<<.GroupBy>>)
    name:
      as: ${1}_GPU_MEM_USED_RATIO
      matches: ^(.*)_MEM_USED
    resources:
      overrides:
        NamespaceName:
          resource: namespace
        PodName:
          resource: pod
  seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{NamespaceName!="",PodName!=""}  # This metric indicates the GPU memory utilization.

The following figure provides an example.

Run the following command. If the output includes DCGM_FI_DEV_GPU_UTIL, DCGM_CUSTOM_PROCESS_SM_UTIL, DCGM_FI_DEV_FB_USED, and DCGM_CUSTOM_PROCESS_MEM_USED, the rules are configured. In the following example, DCGM_CUSTOM_PROCESS_SM_UTIL is returned in the output.

Show sample code

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"
{
	[
    ...
		{
      "name": "nodes/DCGM_CUSTOM_PROCESS_SM_UTIL",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    ...
    {
      "name": "pods/DCGM_CUSTOM_PROCESS_SM_UTIL",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    ...
    {
      "name": "namespaces/DCGM_CUSTOM_PROCESS_SM_UTIL",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
    ...
    {
      "name": "DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
    ...
	]
}

Step 3: Enable auto scaling based on GPU metrics

The following example shows how to deploy a model inference service on a GPU-accelerated node and perform stress tests on the node to check whether auto scaling can be performed based on GPU metrics.

1. Deploy an inference service.

Run the following command to deploy the inference service:

Show sample code

cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bert-intent-detection
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bert-intent-detection
  template:
    metadata:
      labels:
        app: bert-intent-detection
    spec:
      containers:
      - name: bert-container
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
        ports:
        - containerPort: 80
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: bert-intent-detection-svc
  labels:
    app: bert-intent-detection
spec:
  selector:
    app: bert-intent-detection
  ports:
  - protocol: TCP
    name: http
    port: 80
    targetPort: 80
  type: LoadBalancer
EOF

Query the status of the pod and Service.

Run the following command to query the status of the pod:

kubectl get pods -o wide

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
bert-intent-detection-7b486f6bf-f****   1/1     Running   0          3m24s   10.15.1.17   cn-beijing.192.168.94.107   <none>           <none>

The output indicates that only one pod is deployed on the GPU-accelerated node 192.168.94.107.

Run the following command to query the status of the Service:

kubectl get svc bert-intent-detection-svc

Expected output:

NAME                        TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
bert-intent-detection-svc   LoadBalancer   172.16.186.159   47.95.XX.XX   80:30118/TCP   5m1s

If the output displays the name of the Service, the Service is deployed.

Log on to the node 192.168.94.107 by using SSH and run the following command to query GPU utilization:

nvidia-smi

Expected output:

Wed Feb 16 11:48:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   32C    P0    55W / 300W |  15345MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2305118      C   python                          15343MiB |
+-----------------------------------------------------------------------------+

The output indicates that the inference service is running on the GPU-accelerated node. The GPU utilization is 0 because no request is sent to the service.

Run the following command to send requests to the inference service and check whether the service is deployed:

curl -v  "http://47.95.XX.XX/predict?query=Music"

Expected output:

*   Trying 47.95.XX.XX...
* TCP_NODELAY set
* Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0)
> GET /predict?query=Music HTTP/1.1
> Host: 47.95.XX.XX
> User-Agent: curl/7.64.1
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 9
< Server: Werkzeug/1.0.1 Python/3.6.9
< Date: Wed, 16 Feb 2022 03:52:11 GMT
<
* Closing connection 0
PlayMusic # The query result.

If the HTTP status code 200 and the query result are returned, the inference service is deployed.

2. Configure the HPA

The following example describes how to trigger auto scaling when the GPU utilization of a pod exceeds 20%. The following table describes the metrics that are supported by the Horizontal Pod Autoscaler (HPA).

Metric	Description	Unit
DCGM_FI_DEV_GPU_UTIL	The GPU utilization. This metric is available only for GPUs that are scheduled in exclusive mode. Important If a GPU is shared among multiple pods, only the utilization of the GPU is returned after you run the `nvidia-smi` command in one of the pods. This is because NVIDIA does not provide information about the GPU utilization of pods.	%
DCGM_FI_DEV_FB_USED	The amount of GPU memory that is used. This metric is available only for GPUs that are scheduled in exclusive mode.	MiB
DCGM_CUSTOM_PROCESS_SM_UTIL	The GPU utilization of pods.	%
DCGM_CUSTOM_PROCESS_MEM_USED	The amount of GPU memory that is used by pods.	MiB
DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO	The GPU memory utilization of pods. `GPU memory utilization of a pod = Current GPU memory used by the pod (Used)/Current GPU memory allocated to the pod (Allocated)`	%

Run the following command to deploy the HPA:

Clusters that run Kubernetes 1.23 or later

cat <<EOF | kubectl create -f -
apiVersion: autoscaling/v2  # Use the HPA configuration for API version autoscaling/v2. 
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-intent-detection
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_CUSTOM_PROCESS_SM_UTIL
      target:
        type: Utilization
        averageValue: 20 # If the GPU utilization exceeds 20%, pods are scaled out. 
EOF

Clusters that run Kubernetes versions earlier than 1.23

cat <<EOF | kubectl create -f -
apiVersion: autoscaling/v2beta1  # Use the HPA configuration for API version autoscaling/v2beta1. 
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-intent-detection
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metricName: DCGM_CUSTOM_PROCESS_SM_UTIL # The GPU utilization of pods. 
      targetAverageValue: 20 # If the GPU utilization exceeds 20%, pods are scaled out. 
EOF

Run the following command to query the status of the HPA:
```
kubectl get hpa
```
Expected output:
```
NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          74s
```
The expected output indicates that TARGETS displays 0/20. The current GPU utilization is 0. When the GPU utilization exceeds 20%, pods are scaled out.

3. Test auto scaling on the inference service.

Test scale-out activities

Run the following command to perform the stress test:
```
hey -n 10000 -c 200 "http://47.95.XX.XX/predict?query=music"
```
Note
The following formula is used to calculate the expected number of pods after auto scaling: Expected number of pods = ceil [Current number of pods × (Current GPU utilization/Expected GPU utilization)]. For example, if the current number of pods is 1, current GPU utilization is 23, expected GPU utilization is 20, the expected number of pods after auto scaling is 2.
During the stress test, run the following command to query the status of the HPA and the pods:
1. Run the following command to query the status of the HPA:
```
kubectl get hpa
```
  Expected output:
```
NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   23/20     1         10        2          7m56s
```
  The output indicates that the value in the TARGETS column is 23/20. The current GPU utilization exceeds the threshold 20%. In this case, auto scaling is triggered and the ACK cluster starts to scale out pods.
2. Run the following command to query the status of the pods:
```
kubectl get pods
```
  Expected output:
```
NAME                                    READY   STATUS    RESTARTS   AGE
bert-intent-detection-7b486f6bf-f****   1/1     Running   0          44m
bert-intent-detection-7b486f6bf-m****   1/1     Running   0          14s
```
  The output indicates that two pods are running. This value is the same as the expected number of pods calculated based on the preceding formula.
The output returned by the HPA and the pods indicates that the pods are scaled out.

Test scale-in activities

When the stress test stops and the GPU utilization drops below 20%, the ACK cluster starts to scale in pods.

Run the following command to query the status of the HPA:
```
kubectl get hpa
```
Expected output:
```
NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          15m
```
The output indicates that the value in the TARGETS column is 0/20. The current GPU utilization drops to 0. The ACK cluster starts to scale in pods after about 5 minutes.

Run the following command to query the status of the pods:

kubectl get pods

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE
bert-intent-detection-7b486f6bf-f****   1/1     Running   0          52m

The output indicates that number of pods is 1. This means that the pods are scaled in.

FAQ

How do I confirm whether a GPU is used?

You can check whether there are changes in the GPU utilization on the GPU Monitoring tab. If the GPU utilization increases, a GPU is used. If no changes are found in the GPU utilization, no GPU is used. To do this, perform the following steps:

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage. In the left-side navigation pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, click the GPU Monitoring tab and view changes in the GPU utilization.