All Products
Search
Document Center

Container Service for Kubernetes:Use AHPA to perform predictive scaling based on GPU metrics

Last Updated:Apr 12, 2024

The Advanced Horizontal Pod Autoscaler (AHPA) component can predict GPU resource requests based on the GPU utilization data obtained from Prometheus Adapter, historical load trends, and prediction algorithms. AHPA can automatically adjust the number of replicated pods or the allocation of GPU resources to ensure that scale-out operations are completed before GPU resources are out of stock. It also performs scale-in operations when idle resources exist to reduce costs and improve cluster resource utilization.

Prerequisites

  • An ACK managed cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.

  • AHPA is installed and data sources are configured to collect metrics. For more information, see AHPA overview.

  • Managed Service for Prometheus is enabled, and application statistics within at least seven days are collected by Managed Service for Prometheus. The statistics include the details of the GPU resources that are used by an application. For more information, see Managed Service for Prometheus.

How it works

In high-performance computing fields, particularly in scenarios that strongly rely on GPU resources, such as model training and model inference in deep learning, fine-grained management and dynamic adjustment of GPU resource allocation can efficiently improve resource utilization and reduce costs. Container Service for Kubernetes(ACK) supports auto scaling based on GPU metrics. You can use Managed Service for Prometheus to collect key metrics such as the real-time GPU utilization and memory usage. Then, you can use Prometheus Adapter to convert these metrics to the metrics that can be recognized by Kubernetes and integrate them with AHPA. AHPA can predict GPU resource requests based on the GPU utilization data obtained from Prometheus Adapter, historical load trends, and prediction algorithms. AHPA can automatically adjust the number of replicated pods or the allocation of GPU resources to ensure that scale-out operations are completed before GPU resources are out of stock. It also performs scale-in operations when idle resources exist to reduce costs and improve cluster efficiency.

Step 1: Deploy Metrics Adapter

  1. Obtain the internal HTTP API endpoint of the Prometheus instance that is used by your cluster.

    1. Log on to the ARMS console. In the left-side navigation pane, choose Prometheus Monitoring > Prometheus Instances.

    2. In the upper part of the Prometheus Monitoring page, select the region in which your cluster is deployed and click the Prometheus instance that is used by your cluster. You are redirected to the instance details page.

    3. In the left-side navigation pane of the instance details page, click Settings and record the internal endpoint in the HTTP API Address section.

  2. Deploy ack-alibaba-cloud-metrics-adapter.

    1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

    2. On the Marketplace page, click the App Catalog tab. Find and click ack-alibaba-cloud-metrics-adapter.

    3. In the upper-right corner of the ack-alibaba-cloud-metrics-adapter page, click Deploy.

    4. On the Basic Information wizard page, specify Cluster and Namespace, and click Next.

    5. On the Parameters wizard page, set the Chart Version parameter and specify the internal HTTP API endpoint that you obtained in Step 1 as the value of the prometheus.url parameter in the Parameters section. Then, click OK.

      URL

Step 2: Use AHPA to perform predictive scaling based on GPU metrics

In this example, an inference service is deployed. Then, requests are sent to the inference service to check whether AHPA can perform predictive scaling based on GPU metrics.

  1. Deploy an inference service.

    1. Run the following command to deploy the inference service:

      Show YAML content

      cat <<EOF | kubectl create -f -
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: bert-intent-detection
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: bert-intent-detection
        template:
          metadata:
            labels:
              app: bert-intent-detection
          spec:
            containers:
            - name: bert-container
              image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
              ports:
              - containerPort: 80
              resources:
                limits:
                  cpu: "1"
                  memory: 2G
                  nvidia.com/gpu: "1"
                requests:
                  cpu: 200m
                  memory: 500M
                  nvidia.com/gpu: "1"
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: bert-intent-detection-svc
        labels:
          app: bert-intent-detection
      spec:
        selector:
          app: bert-intent-detection
        ports:
        - protocol: TCP
          name: http
          port: 80
          targetPort: 80
        type: LoadBalancer
      EOF
    2. Run the following command to query the status of the pods:

      kubectl get pods -o wide

      Expected output:

      NAME                                    READY   STATUS    RESTARTS   AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
      bert-intent-detection-7b486f6bf-f****   1/1     Running   0          3m24s   10.15.1.17   cn-beijing.192.168.94.107   <none>           <none>
    3. Run the following command to send requests to the inference service and check whether the service is deployed.

      You can run the kubectl get svc bert-intent-detection-svc command to query the IP address of the GPU-accelerated node on which the inference service is deployed. Then, replace 47.95.XX.XX in the following command with the IP address that you obtained:

      curl -v  "http://47.95.XX.XX/predict?query=Music"

      Expected output:

      *   Trying 47.95.XX.XX...
      * TCP_NODELAY set
      * Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0)
      > GET /predict?query=Music HTTP/1.1
      > Host: 47.95.XX.XX
      > User-Agent: curl/7.64.1
      > Accept: */*
      >
      * HTTP 1.0, assume close after body
      < HTTP/1.0 200 OK
      < Content-Type: text/html; charset=utf-8
      < Content-Length: 9
      < Server: Werkzeug/1.0.1 Python/3.6.9
      < Date: Wed, 16 Feb 2022 03:52:11 GMT
      <
      * Closing connection 0
      PlayMusic # The query result.

      If the HTTP status code 200 and the query result are returned, the inference service is deployed.

  2. Configure AHPA.

    In this example, AHPA is configured to scale pods when the GPU utilization of the pod exceeds 20%.

    1. Configure data sources to collect metrics for AHPA.

      1. Create a file named application-intelligence.yaml and copy the following content to the file.

        Set the prometheusUrl parameter to the internal endpoint of the Prometheus instance that you obtained in Step 1.

        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: application-intelligence
          namespace: kube-system
        data:
          prometheusUrl: "http://cn-shanghai-intranet.arms.aliyuncs.com:9090/api/v1/prometheus/da9d7dece901db4c9fc7f5b*******/1581204543170*****/c54417d182c6d430fb062ec364e****/cn-shanghai"
      2. Run the following command to deploy application-intelligence:

        kubectl apply -f application-intelligence.yaml
    2. Deploy AHPA

      1. Create a file named fib-gpu.yaml and copy the following content to the file.

        In this example, the observer mode is used. For more information about the parameters related to AHPA, see Parameters.

        Show YAML content

        apiVersion: autoscaling.alibabacloud.com/v1beta1
        kind: AdvancedHorizontalPodAutoscaler
        metadata:
          name: fib-gpu
          namespace: default
        spec:
          metrics:
          - resource:
              name: gpu
              target:
                averageUtilization: 20
                type: Utilization
            type: Resource
          minReplicas: 0
          maxReplicas: 100
          prediction:
            quantile: 95
            scaleUpForward: 180
          scaleStrategy: observer
          scaleTargetRef:
            apiVersion: apps/v1
            kind: Deployment
            name: bert-intent-detection
          instanceBounds:
          - startTime: "2021-12-16 00:00:00"
            endTime: "2022-12-16 00:00:00"
            bounds:
            - cron: "* 0-8 ? * MON-FRI"
              maxReplicas: 50
              minReplicas: 4
            - cron: "* 9-15 ? * MON-FRI"
              maxReplicas: 50
              minReplicas: 10
            - cron: "* 16-23 ? * MON-FRI"
              maxReplicas: 50
              minReplicas: 12
      2. Run the following command to deploy AHPA:

        kubectl apply -f fib-gpu.yaml
      3. Run the following command to query the status of AHPA:

        kubectl get ahpa

        Expected output:

        NAME             STRATEGY   REFERENCE               METRIC   TARGET(%)   CURRENT(%)   DESIREDPODS   REPLICAS   MINPODS   MAXPODS   AGE
        fib-gpu          observer   bert-intent-detection   gpu      20          0          0            1          10        50        6d19h

        In the output, 0 is returned in the CURRENT(%) column and 20 is returned in the TARGET(%) column. This indicates that the current GPU utilization is 0% and pod scaling will be triggered if the GPU utilization exceeds 20%.

  3. Test auto scaling on the inference service.

    1. Run the following command to access the inference service:

      Show YAML content

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: fib-loader
        namespace: default
      spec:
        progressDeadlineSeconds: 600
        replicas: 1
        revisionHistoryLimit: 10
        selector:
          matchLabels:
            app: fib-loader
        strategy:
          rollingUpdate:
            maxSurge: 25%
            maxUnavailable: 25%
          type: RollingUpdate
        template:
          metadata:
            creationTimestamp: null
            labels:
              app: fib-loader
          spec:
            containers:
            - args:
              - -c
              - |
                /ko-app/fib-loader --service-url="http://bert-intent-detection-svc.${NAMESPACE}/predict?query=Music" --save-path=/tmp/fib-loader-chart.html
              command:
              - sh
              env:
              - name: NAMESPACE
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
              image: registry.cn-huhehaote.aliyuncs.com/kubeway/knative-sample-fib-loader:20201126-110434
              imagePullPolicy: IfNotPresent
              name: loader
              ports:
              - containerPort: 8090
                name: chart
                protocol: TCP
              resources:
                limits:
                  cpu: "8"
                  memory: 16000Mi
                requests:
                  cpu: "2"
                  memory: 4000Mi
    2. Run the following command to query the status of AHPA:

      kubectl get ahpa

      Expected output:

      NAME             STRATEGY   REFERENCE               METRIC   TARGET(%)   CURRENT(%)   DESIREDPODS   REPLICAS   MINPODS   MAXPODS   AGE
      fib-gpu          observer   bert-intent-detection   gpu      20          189          10            4          10        50        6d19h

      The output shows that the current GPU utilization (CURRENT(%)) is higher than the scaling threshold (TARGET(%)). Therefore, pod scaling is triggered and the expected number of pods is 10, which is the value returned in the DESIREDPODS column.

    3. Run the following command to query the prediction results:

      kubectl get --raw '/apis/metrics.alibabacloud.com/v1beta1/namespaces/default/predictionsobserver/fib-gpu'|jq -r '.content' |base64 -d > observer.html

      The following figures show the prediction results of GPU utilization based on historical data within the last seven days.GPU预测趋势

      • Predict GPU Resource Observer: The actual GPU utilization is represented by a blue line. The GPU utilization predicted by AHPA is represented by a green line. You can find that the predicted GPU utilization is higher than the actual GPU utilization.

      • Predict POD Observer: The actual number of pods that are added or deleted in scaling events is represented by a blue line. The number of pods that AHPA predicts to be added or deleted in scaling events is represented by a green line. You can find that the predicted number of pods is less than the actual number of pods. You can set the scaling mode to auto and configure other settings based on the predicted number of pods. This way, AHPA can save pod resources.

      The results show that AHPA can use predictive scaling to handle fluctuating workloads as expected. After you confirm the prediction results, you can set the scaling mode to auto, which allows AHPA to automatically scale pods.

References

  • Knative allows you to use AHPA in serverless Kubernetes (ASK) clusters. If your application requests resources in a periodic pattern, you can use AHPA to predict changes in resource requests and prefetch resources for scaling activities. This reduces the impact of cold starts when your application is scaled. For more information, see Use AHPA to enable predictive scaling in Knative.

  • In some scenarios, you may need to scale applications based on custom metrics, such as the QPS of HTTP requests or message queue length. AHPA provides the External Metrics mechanism that can work with the alibaba-cloud-metrics-adapter component to allow you to scale applications based on custom metrics. For more information, see Use AHPA to configure custom metrics for application scaling.