All Products
Search
Document Center

Container Service for Kubernetes:Configure auto scaling for standalone or distributed LLM inference services

Last Updated:Mar 26, 2026

Configure Kubernetes Horizontal Pod Autoscaler (HPA) to scale LLM inference pods automatically based on inference-specific metrics, such as request queue depth and KV cache usage, exposed by your inference framework.

How it works

LLM inference services are bottlenecked by GPU computing power and GPU memory, not CPU or system memory. Scaling on GPU utilization or memory usage is misleading — a GPU at 90% utilization may be idle-waiting on a long decode sequence, not handling more requests. Inference frameworks like vLLM, SGLang, and Dynamo expose metrics that directly reflect service load, such as the number of waiting requests (num_requests_waiting) and KV cache usage (kv_cache_usage_perc). These are the right signals for scaling decisions.

The scaling pipeline works as follows:

  1. A PodMonitor instructs Managed Service for Prometheus to scrape metrics from inference pods.

  2. The ack-alibaba-cloud-metrics-adapter component bridges Prometheus metrics to the Kubernetes Custom Metrics API.

  3. The HPA reads custom metrics from that API and scales your StatefulSet up or down.

Prerequisites

Before you begin, make sure you have:

Billing

Using Managed Service for Prometheus causes your service to emit custom metrics, which may incur additional fees. Fees vary based on your cluster size, number of applications, and data volume. To monitor your usage, see query usage data.

Step 1: Configure metric collection

If you have already configured monitoring for LLM inference services, skip this step.

Create a PodMonitor resource to instruct Prometheus to scrape metrics from your inference pods.

  1. Create a file named podmonitor.yaml with the following content:

    YAML template

    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      name: llm-serving-podmonitor
      namespace: default
      annotations:
        arms.prometheus.io/discovery: "true"
        arms.prometheus.io/resource: "arms"
    spec:
      selector:
        matchExpressions:
        - key: alibabacloud.com/inference-workload
          operator: Exists
      namespaceSelector:
        any: true
      podMetricsEndpoints:
      - interval: 15s
        path: /metrics
        port: "http"
        relabelings:
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_name
          targetLabel: pod_name
        - action: replace
          sourceLabels:
          - __meta_kubernetes_namespace
          targetLabel: pod_namespace
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_rolebasedgroup_workloads_x_k8s_io_role
          regex: (.+)
          targetLabel: rbg_role
        # Allow to override workload-name with specific label
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_alibabacloud_com_inference_workload
          regex: (.+)
          targetLabel: workload_name
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_alibabacloud_com_inference_backend
          regex: (.+)
          targetLabel: backend
  2. Apply the configuration:

    kubectl apply -f ./podmonitor.yaml

Step 2: Configure ack-alibaba-cloud-metrics-adapter

Configure the metrics adapter to expose inference framework metrics through the Kubernetes Custom Metrics API. The adapter uses two fields per metric rule:

  • seriesQuery: selects which Prometheus time series to include (filtered by label selectors such as namespace and pod)

  • metricsQuery: defines how to aggregate those series (typically sum ... by (<<.GroupBy>>))

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of the target cluster. In the left navigation pane, choose Applications > Helm.

  3. On the Helm page, find ack-alibaba-cloud-metrics-adapter and click Update in the Actions column.

  4. In the Update Release panel, update the YAML configuration as shown below, then click OK. The following example covers metrics for vLLM, SGLang, and Dynamo. Include only the metrics for your inference framework. For a complete list of available metrics, refer to the official documentation: vLLM metrics, SGLang production metrics, and Dynamo metrics.

    The metrics in this example are for demonstration. Modify them based on your inference framework and scaling requirements.

    YAML template

    AlibabaCloudMetricsAdapter:
    
      prometheus:
        enabled: true    # Set to true to enable the Prometheus adapter feature.
        # Enter the URL of your Managed Service for Prometheus.
        url: http://cn-beijing.arms.aliyuncs.com:9090/api/v1/prometheus/xxxx/xxxx/xxx/cn-beijing
        # If token-based authentication is enabled, configure the Authorization header.
    #    prometheusHeader:
    #    - Authorization: xxxxxxx
    
        adapter:
          rules:
            default: false  # Keep this false to disable default metric collection.
            custom:
    
            # vLLM metrics
            # vllm:num_requests_waiting — number of requests queued, waiting for a GPU slot.
            # Recommended for scale-up triggers: a growing queue means the service is overloaded.
            - seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # vllm:num_requests_running — number of requests actively being processed.
            - seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # vllm:kv_cache_usage_perc — KV cache utilization.
            # High values (close to 1.0) indicate memory pressure and potential request dropping.
            - seriesQuery: 'vllm:kv_cache_usage_perc{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # SGLang metrics
            # sglang:num_queue_reqs — number of requests waiting in queue.
            - seriesQuery: 'sglang:num_queue_reqs{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # sglang:num_running_reqs — number of requests being actively processed.
            - seriesQuery: 'sglang:num_running_reqs{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # sglang:token_usage — token usage, which reflects KV cache utilization.
            - seriesQuery: 'sglang:token_usage{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # Dynamo metrics
            # nv_llm_http_service_inflight_requests — number of requests being processed.
            - seriesQuery: 'nv_llm_http_service_inflight_requests{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
  5. Verify that the metrics adapter is exposing the metrics correctly. Run the following command for each metric you configured:

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_waiting"

    A successful response returns a JSON object with a value field for each pod. If the command returns an error, check that the PodMonitor is deployed and that Prometheus is scraping your inference pods.

Step 3: Configure HPA

Create an HPA resource that targets your inference StatefulSet and triggers scaling based on a custom metric.

Important

The parameter values in the following examples are for demonstration only. Set thresholds based on your own performance testing, resource costs, and service-level objectives (SLOs). LLM inference pods take time to start, and individual requests can run for minutes. To prevent the HPA from scaling down while long-running requests are still in progress, configure behavior.scaleDown.stabilizationWindowSeconds. The default value is 300 seconds. Increase this value if your workload has requests that exceed five minutes.

  1. Create a file named hpa.yaml. Use the example that matches your inference framework.

    vLLM

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: llm-inference-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: StatefulSet
        name: vllm-inference # Replace with your vLLM inference service name.
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Pods
        pods:
          metric:
            name: vllm:num_requests_waiting
          target:
            type: Value
            averageValue: 5
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300 # Adjust based on your longest expected request duration.

    SGLang

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: llm-inference-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: StatefulSet
        name: sgl-inference
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Pods
        pods:
          metric:
            name: sglang:num_queue_reqs
          target:
            type: Value
            averageValue: 5
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300 # Adjust based on your longest expected request duration.
  2. Apply the HPA configuration:

    kubectl apply -f hpa.yaml

Step 4: Test the auto scaling configuration

Apply a load to your inference service to trigger the HPA.

For benchmark tool details and usage, see the vLLM Benchmark guide and the SGLang Benchmark guide.

  1. Create a file named benchmark.yaml. Set the image field to match your inference framework:

    • vLLM: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0

    • SGLang: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104

    YAML template

    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      labels:
        app: llm-benchmark
      name: llm-benchmark
    spec:
      selector:
        matchLabels:
          app: llm-benchmark
      template:
        metadata:
          labels:
            app: llm-benchmark
        spec:
          hostNetwork: true
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: # The SGLang or vLLM container image used to deploy the inference service
            imagePullPolicy: IfNotPresent
            name: llm-benchmark
            resources:
              limits:
                cpu: "8"
                memory: 40 Gi
              requests:
                cpu: "8"
                memory: 40 Gi
            volumeMounts:
            - mountPath: /models/Qwen3-32B
              name: llm-model
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model
  2. Deploy the benchmark client pod:

    kubectl create -f benchmark.yaml
  3. Run the benchmark script from within the client pod to generate load on your inference service.

    vLLM

    python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
            --model /models/Qwen3-32B \
            --host inference-service \
            --port 8000 \
            --dataset-name random \
            --random-input-len 1500 \
            --random-output-len 100 \
            --random-range-ratio 1 \
            --num-prompts 400 \
            --max-concurrency 20

    SGLang

    python3 -m sglang.bench_serving --backend sglang \
            --model /models/Qwen3-32B \
            --host inference-service \
            --port 8000 \
            --dataset-name random \
            --random-input-len 1500 \
            --random-output-len 100 \
            --random-range-ratio 1 \
            --num-prompts 400 \
            --max-concurrency 20
  4. While the load test runs, open a new terminal and check the HPA status:

    kubectl describe hpa llm-inference-hpa

    When the HPA detects that the average number of waiting requests exceeds the target threshold, it scales up the StatefulSet. A successful scale-up produces a SuccessfulRescale event in the output:

    Name:                                   llm-inference-hpa
    Namespace:                              default
    Labels:                                 <none>
    Annotations:                            <none>
    CreationTimestamp:                      Fri, 25 Jul 2025 11:29:20 +0800
    Reference:                              StatefulSet/vllm-inference
    Metrics:                                ( current / target )
      "vllm:num_requests_waiting" on pods:  11 / 5
    Min replicas:                           1
    Max replicas:                           3
    StatefulSet pods:                       1 current / 3 desired
    Conditions:
      Type            Status  Reason              Message
      ----            ------  ------              -------
      AbleToScale     True    SucceededRescale    the HPA controller was able to update the target scale to 3
      ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric vllm:num_requests_waiting
      ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
    Events:
      Type    Reason             Age   From                       Message
      ----    ------             ----  ----                       -------
      Normal  SuccessfulRescale  1s    horizontal-pod-autoscaler  New size: 3; reason: pods metric vllm:num_requests_waiting above target

What's next