Use AHPA to perform predictive scaling based on GPU metrics - Container Service for Kubernetes

The Advanced Horizontal Pod Autoscaler (AHPA) component can predict GPU resource requests based on the GPU utilization data obtained from Prometheus Adapter, historical load trends, and prediction algorithms. AHPA can automatically adjust the number of replicated pods or the allocation of GPU resources to ensure that scale-out operations are completed before GPU resources are out of stock. It also performs scale-in operations when idle resources exist to reduce costs and improve cluster resource utilization.

Prerequisites

An ACK managed cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.
AHPA is installed and data sources are configured to collect metrics. For more information, see AHPA overview.
Managed Service for Prometheus is enabled, and application statistics within at least seven days are collected by Managed Service for Prometheus. The statistics include the details of the GPU resources that are used by an application. For more information, see Managed Service for Prometheus.

How it works

In high-performance computing fields, particularly in scenarios that strongly rely on GPU resources, such as model training and model inference in deep learning, fine-grained management and dynamic adjustment of GPU resource allocation can efficiently improve resource utilization and reduce costs. Container Service for Kubernetes(ACK) supports auto scaling based on GPU metrics. You can use Managed Service for Prometheus to collect key metrics such as the real-time GPU utilization and memory usage. Then, you can use Prometheus Adapter to convert these metrics to the metrics that can be recognized by Kubernetes and integrate them with AHPA. AHPA can predict GPU resource requests based on the GPU utilization data obtained from Prometheus Adapter, historical load trends, and prediction algorithms. AHPA can automatically adjust the number of replicated pods or the allocation of GPU resources to ensure that scale-out operations are completed before GPU resources are out of stock. It also performs scale-in operations when idle resources exist to reduce costs and improve cluster efficiency.

Step 1: Deploy Metrics Adapter

Obtain the internal HTTP API endpoint of the Prometheus instance that is used by your cluster.
1. Log on to the ARMS console. In the left-side navigation pane, choose Prometheus Monitoring > Prometheus Instances.
2. In the upper part of the Prometheus Monitoring page, select the region in which your cluster is deployed and click the Prometheus instance that is used by your cluster. You are redirected to the instance details page.
3. In the left-side navigation pane of the instance details page, click Settings and record the internal endpoint in the HTTP API Address section.
Deploy ack-alibaba-cloud-metrics-adapter.
1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.
2. On the Marketplace page, click the App Catalog tab. Find and click ack-alibaba-cloud-metrics-adapter.
3. In the upper-right corner of the ack-alibaba-cloud-metrics-adapter page, click Deploy.
4. On the Basic Information wizard page, specify Cluster and Namespace, and click Next.
5. On the Parameters wizard page, set the Chart Version parameter and specify the internal HTTP API endpoint that you obtained in Step 1 as the value of the prometheus.url parameter in the Parameters section. Then, click OK.

Step 2: Use AHPA to perform predictive scaling based on GPU metrics

In this example, an inference service is deployed. Then, requests are sent to the inference service to check whether AHPA can perform predictive scaling based on GPU metrics.

Deploy an inference service.

Run the following command to deploy the inference service:

Show YAML content

cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bert-intent-detection
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bert-intent-detection
  template:
    metadata:
      labels:
        app: bert-intent-detection
    spec:
      containers:
      - name: bert-container
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: "1"
            memory: 2G
            nvidia.com/gpu: "1"
          requests:
            cpu: 200m
            memory: 500M
            nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
  name: bert-intent-detection-svc
  labels:
    app: bert-intent-detection
spec:
  selector:
    app: bert-intent-detection
  ports:
  - protocol: TCP
    name: http
    port: 80
    targetPort: 80
  type: LoadBalancer
EOF

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                                    READY   STATUS    RESTARTS   AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
bert-intent-detection-7b486f6bf-f****   1/1     Running   0          3m24s   10.15.1.17   cn-beijing.192.168.94.107   <none>           <none>

Run the following command to send requests to the inference service and check whether the service is deployed.

You can run the kubectl get svc bert-intent-detection-svc command to query the IP address of the GPU-accelerated node on which the inference service is deployed. Then, replace 47.95.XX.XX in the following command with the IP address that you obtained:

curl -v  "http://47.95.XX.XX/predict?query=Music"

Expected output:

*   Trying 47.95.XX.XX...
* TCP_NODELAY set
* Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0)
> GET /predict?query=Music HTTP/1.1
> Host: 47.95.XX.XX
> User-Agent: curl/7.64.1
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 9
< Server: Werkzeug/1.0.1 Python/3.6.9
< Date: Wed, 16 Feb 2022 03:52:11 GMT
<
* Closing connection 0
PlayMusic # The query result.

If the HTTP status code 200 and the query result are returned, the inference service is deployed.

Configure AHPA.

In this example, AHPA is configured to scale pods when the GPU utilization of the pod exceeds 20%.

Configure data sources to collect metrics for AHPA.

Create a file named application-intelligence.yaml and copy the following content to the file.

Set the prometheusUrl parameter to the internal endpoint of the Prometheus instance that you obtained in Step 1.

apiVersion: v1
kind: ConfigMap
metadata:
  name: application-intelligence
  namespace: kube-system
data:
  prometheusUrl: "http://cn-shanghai-intranet.arms.aliyuncs.com:9090/api/v1/prometheus/da9d7dece901db4c9fc7f5b*******/1581204543170*****/c54417d182c6d430fb062ec364e****/cn-shanghai"

Run the following command to deploy application-intelligence:
```
kubectl apply -f application-intelligence.yaml
```

Deploy AHPA

Create a file named fib-gpu.yaml and copy the following content to the file.

In this example, the observer mode is used. For more information about the parameters related to AHPA, see Parameters.

Show YAML content

apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: AdvancedHorizontalPodAutoscaler
metadata:
  name: fib-gpu
  namespace: default
spec:
  metrics:
  - resource:
      name: gpu
      target:
        averageUtilization: 20
        type: Utilization
    type: Resource
  minReplicas: 0
  maxReplicas: 100
  prediction:
    quantile: 95
    scaleUpForward: 180
  scaleStrategy: observer
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-intent-detection
  instanceBounds:
  - startTime: "2021-12-16 00:00:00"
    endTime: "2022-12-16 00:00:00"
    bounds:
    - cron: "* 0-8 ? * MON-FRI"
      maxReplicas: 50
      minReplicas: 4
    - cron: "* 9-15 ? * MON-FRI"
      maxReplicas: 50
      minReplicas: 10
    - cron: "* 16-23 ? * MON-FRI"
      maxReplicas: 50
      minReplicas: 12

Run the following command to deploy AHPA:
```
kubectl apply -f fib-gpu.yaml
```

Run the following command to query the status of AHPA:

kubectl get ahpa

Expected output:

NAME             STRATEGY   REFERENCE               METRIC   TARGET(%)   CURRENT(%)   DESIREDPODS   REPLICAS   MINPODS   MAXPODS   AGE
fib-gpu          observer   bert-intent-detection   gpu      20          0          0            1          10        50        6d19h

In the output, 0 is returned in the CURRENT(%) column and 20 is returned in the TARGET(%) column. This indicates that the current GPU utilization is 0% and pod scaling will be triggered if the GPU utilization exceeds 20%.

Test auto scaling on the inference service.

Run the following command to access the inference service:

Show YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fib-loader
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: fib-loader
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: fib-loader
    spec:
      containers:
      - args:
        - -c
        - |
          /ko-app/fib-loader --service-url="http://bert-intent-detection-svc.${NAMESPACE}/predict?query=Music" --save-path=/tmp/fib-loader-chart.html
        command:
        - sh
        env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: registry.cn-huhehaote.aliyuncs.com/kubeway/knative-sample-fib-loader:20201126-110434
        imagePullPolicy: IfNotPresent
        name: loader
        ports:
        - containerPort: 8090
          name: chart
          protocol: TCP
        resources:
          limits:
            cpu: "8"
            memory: 16000Mi
          requests:
            cpu: "2"
            memory: 4000Mi

Run the following command to query the status of AHPA:
```
kubectl get ahpa
```
Expected output:
```
NAME             STRATEGY   REFERENCE               METRIC   TARGET(%)   CURRENT(%)   DESIREDPODS   REPLICAS   MINPODS   MAXPODS   AGE
fib-gpu          observer   bert-intent-detection   gpu      20          189          10            4          10        50        6d19h
```
The output shows that the current GPU utilization (CURRENT(%)) is higher than the scaling threshold (TARGET(%)). Therefore, pod scaling is triggered and the expected number of pods is 10, which is the value returned in the DESIREDPODS column.
Run the following command to query the prediction results:
```
kubectl get --raw '/apis/metrics.alibabacloud.com/v1beta1/namespaces/default/predictionsobserver/fib-gpu'|jq -r '.content' |base64 -d > observer.html
```
The following figures show the prediction results of GPU utilization based on historical data within the last seven days.
- Predict GPU Resource Observer: The actual GPU utilization is represented by a blue line. The GPU utilization predicted by AHPA is represented by a green line. You can find that the predicted GPU utilization is higher than the actual GPU utilization.
- Predict POD Observer: The actual number of pods that are added or deleted in scaling events is represented by a blue line. The number of pods that AHPA predicts to be added or deleted in scaling events is represented by a green line. You can find that the predicted number of pods is less than the actual number of pods. You can set the scaling mode to auto and configure other settings based on the predicted number of pods. This way, AHPA can save pod resources.
The results show that AHPA can use predictive scaling to handle fluctuating workloads as expected. After you confirm the prediction results, you can set the scaling mode to auto, which allows AHPA to automatically scale pods.

References

Knative allows you to use AHPA in serverless Kubernetes (ASK) clusters. If your application requests resources in a periodic pattern, you can use AHPA to predict changes in resource requests and prefetch resources for scaling activities. This reduces the impact of cold starts when your application is scaled. For more information, see Use AHPA to enable predictive scaling in Knative.
In some scenarios, you may need to scale applications based on custom metrics, such as the QPS of HTTP requests or message queue length. AHPA provides the External Metrics mechanism that can work with the alibaba-cloud-metrics-adapter component to allow you to scale applications based on custom metrics. For more information, see Use AHPA to configure custom metrics for application scaling.