All Products
Search
Document Center

Container Service for Kubernetes:Implement AHPA elastic prediction based on GPU metrics

Last Updated:Jun 20, 2026

AHPA uses GPU utilization data from the Prometheus Adapter, combined with historical load trends and prediction algorithms, to forecast future GPU resource requirements. It automatically adjusts the number of pod replicas or allocates GPU resources to scale out before GPU resources become constrained and scale in when resources are idle. This approach reduces costs and improves cluster efficiency.

Prerequisites

How it works

In high-performance computing (HPC), especially for deep learning model training and inference, fine-grained management and dynamic adjustment of GPU resources improve utilization and reduce costs. Container Service for Kubernetes supports elastic scaling based on GPU metrics. You can use Prometheus to collect key GPU metrics such as real-time utilization and GPU memory usage. Then, use the Prometheus Adapter to transform these metrics into a Kubernetes-compatible format and integrate them with AHPA. AHPA uses this data, along with historical load trends and prediction algorithms, to forecast future GPU resource requirements and automatically adjusts the number of pod replicas or GPU resource allocation. This ensures scale-out occurs before resources become constrained and scale-in happens promptly when resources are idle, reducing costs and improving cluster efficiency.

Step 1: Deploy Metrics Adapter

  1. Obtain the internal HTTP API endpoint.

    1. Log on to the ARMS console.

    2. In the left navigation pane, choose Managed Service for Prometheus > Instances.

    3. On the Instances page, select the region where your Container Service for Kubernetes cluster is located.

    4. Click the name of your target Prometheus instance. In the left navigation pane, click Settings to obtain the internal endpoint under HTTP API address.

  2. Deploy ack-alibaba-cloud-metrics-adapter.

    1. Log on to the ACK console. In the left navigation pane, click Marketplace > Marketplace.

    2. On the Marketplace page, click the App Catalog tab, search for ack-alibaba-cloud-metrics-adapter, and click it.

    3. On the ack-alibaba-cloud-metrics-adapter page, click Quick Deployment in the upper-right corner.

    4. In the Basic Information wizard, select your Cluster and Namespace, then click Next.

    5. In the Parameters wizard, select a Chart Version. In the Parameter section, set the internal HTTP API endpoint obtained in Step 1 as the value for prometheus.url, then click OK.

      prometheus:
        enabled: true
        url: http://

Step 2: Implement AHPA elastic prediction based on GPU metrics

This topic deploys a model inference service on a GPU and continuously sends requests to it. AHPA performs elastic prediction based on GPU utilization.

  1. Deploy the inference service.

    1. Run the following command to deploy the inference service.

      Expand to view details

      cat <<EOF | kubectl create -f -
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: bert-intent-detection
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: bert-intent-detection
        template:
          metadata:
            labels:
              app: bert-intent-detection
          spec:
            containers:
            - name: bert-container
              image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
              ports:
              - containerPort: 80
              resources:
                limits:
                  cpu: "1"
                  memory: 2G
                  nvidia.com/gpu: "1"
                requests:
                  cpu: 200m
                  memory: 500M
                  nvidia.com/gpu: "1"
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: bert-intent-detection-svc
        labels:
          app: bert-intent-detection
      spec:
        selector:
          app: bert-intent-detection
        ports:
        - protocol: TCP
          name: http
          port: 80
          targetPort: 80
        type: LoadBalancer
      EOF
    2. Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                                    READY   STATUS    RESTARTS   AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
      bert-intent-detection-7b486f6bf-f****   1/1     Running   0          3m24s   10.15.1.17   cn-beijing.192.168.94.107   <none>           <none>
    3. Run the following command to call the inference service and verify the deployment.

      Use the command kubectl get svc bert-intent-detection-svc to obtain the IP address of the GPU node and replace 47.95.XX.XX in the following command.

      curl -v  "http://47.95.XX.XX/predict?query=Music"

      Expected output:

      *   Trying 47.95.XX.XX...
      * TCP_NODELAY set
      * Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0)
      > GET /predict?query=Music HTTP/1.1
      > Host: 47.95.XX.XX
      > User-Agent: curl/7.64.1
      > Accept: */*
      >
      * HTTP 1.0, assume close after body
      < HTTP/1.0 200 OK
      < Content-Type: text/html; charset=utf-8
      < Content-Length: 9
      < Server: Werkzeug/1.0.1 Python/3.6.9
      < Date: Wed, 16 Feb 2022 03:52:11 GMT
      <
      * Closing connection 0
      PlayMusic #Intent recognition result.

      If the HTTP request returns status code 200 and an intent recognition result, the inference service is deployed successfully.

  2. Configure AHPA.

    This example uses GPU utilization. Scale-out is triggered when GPU utilization of a pod exceeds 20%.

    1. Configure the AHPA metric source.

      1. Create a file named application-intelligence.yaml with the following content.

        prometheusUrl sets the endpoint for Alibaba Cloud Prometheus. Use the internal endpoint obtained in Step 1.

        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: application-intelligence
          namespace: kube-system
        data:
          prometheusUrl: "http://cn-shanghai-intranet.arms.aliyuncs.com:9090/api/v1/prometheus/da9d7dece901db4c9fc7f5b*******/1581204543170*****/c54417d182c6d430fb062ec364e****/cn-shanghai"
      2. Run the following command to deploy application-intelligence.

        kubectl apply -f application-intelligence.yaml
    2. Deploy AHPA.

      1. Create a file named fib-gpu.yaml with the following content.

        The mode is set to observer. For more information about parameters, see metric description.

        Expand to view details

        apiVersion: autoscaling.alibabacloud.com/v1beta1
        kind: AdvancedHorizontalPodAutoscaler
        metadata:
          name: fib-gpu
          namespace: default
        spec:
          metrics:
          - resource:
              name: gpu
              target:
                averageUtilization: 20
                type: Utilization
            type: Resource
          minReplicas: 0
          maxReplicas: 100
          prediction:
            quantile: 95
            scaleUpForward: 180
          scaleStrategy: observer
          scaleTargetRef:
            apiVersion: apps/v1
            kind: Deployment
            name: bert-intent-detection
          instanceBounds:
          - startTime: "2021-12-16 00:00:00"
            endTime: "2022-12-16 00:00:00"
            bounds:
            - cron: "* 0-8 ? * MON-FRI"
              maxReplicas: 50
              minReplicas: 4
            - cron: "* 9-15 ? * MON-FRI"
              maxReplicas: 50
              minReplicas: 10
            - cron: "* 16-23 ? * MON-FRI"
              maxReplicas: 50
              minReplicas: 12
      2. Run the following command to deploy AHPA.

        kubectl apply -f fib-gpu.yaml
      3. Run the following command to check the AHPA status.

        kubectl get ahpa

        Expected output:

        NAME             STRATEGY   REFERENCE               METRIC   TARGET(%)   CURRENT(%)   DESIREDPODS   REPLICAS   MINPODS   MAXPODS   AGE
        fib-gpu          observer   bert-intent-detection   gpu      20          0          0            1          10        50        6d19h

        The expected output shows that CURRENT(%) is 0 and TARGET(%) is 20. This means the current GPU utilization is 0%. Elastic scale-out is triggered when GPU utilization exceeds 20%.

  3. Test elastic scaling for the inference service.

    1. Run the following command to access the inference service.

      Expand to view details

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: fib-loader
        namespace: default
      spec:
        progressDeadlineSeconds: 600
        replicas: 1
        revisionHistoryLimit: 10
        selector:
          matchLabels:
            app: fib-loader
        strategy:
          rollingUpdate:
            maxSurge: 25%
            maxUnavailable: 25%
          type: RollingUpdate
        template:
          metadata:
            creationTimestamp: null
            labels:
              app: fib-loader
          spec:
            containers:
            - args:
              - -c
              - |
                /ko-app/fib-loader --service-url="http://bert-intent-detection-svc.${NAMESPACE}/predict?query=Music" --save-path=/tmp/fib-loader-chart.html
              command:
              - sh
              env:
              - name: NAMESPACE
                valueFrom:
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
              image: registry.cn-huhehaote.aliyuncs.com/kubeway/knative-sample-fib-loader:20201126-110434
              imagePullPolicy: IfNotPresent
              name: loader
              ports:
              - containerPort: 8090
                name: chart
                protocol: TCP
              resources:
                limits:
                  cpu: "8"
                  memory: 16000Mi
                requests:
                  cpu: "2"
                  memory: 4000Mi
    2. While accessing the service, run the following command to check the AHPA status.

      kubectl get ahpa

      Expected output:

      NAME             STRATEGY   REFERENCE               METRIC   TARGET(%)   CURRENT(%)   DESIREDPODS   REPLICAS   MINPODS   MAXPODS   AGE
      fib-gpu          observer   bert-intent-detection   gpu      20          189          10            4          10        50        6d19h

      The expected output shows that the current GPU utilization CURRENT(%) exceeds the TARGET(%) value. Elastic scaling is triggered, and the desired number of pods DESIREDPODS is 10.

    3. Run the following command to view the prediction trend.

      kubectl get --raw '/apis/metrics.alibabacloud.com/v1beta1/namespaces/default/predictionsobserver/fib-gpu'|jq -r '.content' |base64 -d > observer.html

      The following figure shows an example prediction trend based on seven days of historical GPU metrics:GPU prediction trend

      • Predict GPU Resource Observer: The blue line shows actual GPU usage. The green line shows GPU usage predicted by AHPA. The green curve is mostly above the blue curve, indicating that the predicted GPU capacity is sufficient.

      • Predict POD Observer: The blue line shows the actual number of scaled pods. The green line shows the number of pods predicted by AHPA. The green curve is mostly below the blue curve, indicating that AHPA predicts fewer pods. You can set the elastic scaling mode to auto to use the predicted pod count, which saves pod resources and avoids waste.

      The prediction results meet expectations. After observation, if the results remain satisfactory, set the elastic scaling mode to auto so AHPA handles scaling.

References

  • Knative Serverless supports AHPA (Advanced Horizontal Pod Autoscaler) elastic capabilities. When application resource requirements are periodic, elastic prediction preheats resources and solves cold start issues in Knative. For more information, see Use AHPA elastic prediction in Knative.

  • In many scenarios, applications must scale based on custom metrics such as HTTP request QPS or message queue length. AHPA provides an External Metrics mechanism. Combined with the alibaba-cloud-metrics-adapter component, it offers richer scaling options. For more information, see Configure custom metrics with AHPA to scale applications.