All Products
Search
Document Center

Container Service for Kubernetes:Configure auto scaling for standalone or distributed LLM inference services

Last Updated:Sep 18, 2025

When managing large language model (LLM) inference services, it is crucial to handle the highly dynamic fluctuations in workload. This topic describes how to combine custom metrics from your inference framework with the Kubernetes Horizontal Pod Autoscaler (HPA) to automatically and flexibly scale your inference service pods. This ensures high availability and stability for your LLM services.

Prerequisites

Billing

Integrating with Managed Service for Prometheus will cause your service to emit custom metrics, which may incur additional fees. These fees vary based on factors such as your cluster size, number of applications, and data volume. You can monitor and manage your resources by querying usage data.

Step 1: Configure metric collection

Unlike traditional microservices, LLM inference services are often bottlenecked by GPU computing power and memory, not CPU or system memory. Standard metrics such as GPU usage and memory usage can be misleading for determining the actual load on an inference service. Therefore, a more effective approach is to scale based on performance metrics exposed directly by the inference engine, such as request latency or queue depth.

If you have configured monitoring for LLM inference services, you can skip this step.
  1. Create a file named podmonitor.yaml to instruct Prometheus to scrape metrics from your inference pods.

    YAML template

    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      name: llm-serving-podmonitor
      namespace: default
      annotations:
        arms.prometheus.io/discovery: "true"
        arms.prometheus.io/resource: "arms"
    spec:
      selector:
        matchExpressions:
        - key: alibabacloud.com/inference-workload
          operator: Exists
      namespaceSelector:
        any: true
      podMetricsEndpoints:
      - interval: 15s
        path: /metrics
        port: "http"
        relabelings:
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_name
          targetLabel: pod_name
        - action: replace
          sourceLabels:
          - __meta_kubernetes_namespace
          targetLabel: pod_namespace
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_rolebasedgroup_workloads_x_k8s_io_role
          regex: (.+)
          targetLabel: rbg_role
        # Allow to override workload-name with specific label
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_alibabacloud_com_inference_workload
          regex: (.+)
          targetLabel: workload_name
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_alibabacloud_com_inference_backend
          regex: (.+)
          targetLabel: backend
    
  2. Apply the configuration.

    kubectl apply -f ./podmonitor.yaml

Step 2: Configure ack-alibaba-cloud-metrics-adapter

  1. Log on to ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of the target cluster. In the left navigation pane, click Applications > Helm.

  3. On the Helm page, find ack-alibaba-cloud-metrics-adapter and click Update in the Actions column.

  4. In the Update Release panel, update the YAML configuration as shown in the following example and click OK. The metrics in the YAML are for demonstration purposes only. Modify them as needed.

    Refer to the official documentation for a complete list of metrics for vLLM, SGLang, and Dynamo.

    YAML template

    AlibabaCloudMetricsAdapter:
    
      prometheus:
        enabled: true    # Set this to true to enable the Prometheus adapter feature.
        # Enter the URL of your Managed Service for Prometheus.
        url: http://cn-beijing.arms.aliyuncs.com:9090/api/v1/prometheus/xxxx/xxxx/xxx/cn-beijing
        # If token-based authentication is enabled for Managed Service for Prometheus, configure the Authorization parameter of the prometheusHeader field.
    #    prometheusHeader:
    #    - Authorization: xxxxxxx
    
        adapter:
          rules:
            default: false  			# Default metric collection configuration. Keep this to false.
            custom:
    
            # ** Example 1: This is an example for vLLM **
            # vllm:num_requests_waiting: The number of waiting requests.
            # Run the following command to check whether the metrics are collected.
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_waiting"
            - seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # vllm:num_requests_running: The number of requests being processed.
            # Run the following command to check whether the metrics are collected.
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_running"
            - seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # vllm:kv_cache_usage_perc: The KV cache usage.
            # Run the following command to check whether the metrics are collected.
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:kv_cache_usage_perc"
            - seriesQuery: 'vllm:kv_cache_usage_perc{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # ** Example 2: This is an example for SGLang **
            # sglang:num_queue_reqs: The number of waiting requests.
            # Run the following command to check whether the metrics are collected.
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_queue_reqs"
            - seriesQuery: 'sglang:num_queue_reqs{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
            # sglang:num_running_reqs: The number of requests being processed.
            # Run the following command to check whether the metrics are collected.
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_running_reqs"
            - seriesQuery: 'sglang:num_running_reqs{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
              # sglang:token_usage: The token usage in the system, which can reflect the KV cache utilization.
              # Run the following command to check whether the metrics are collected.
              # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:token_usage"
            - seriesQuery: 'sglang:token_usage{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # Example 3: This is an example for Dynamo
            # nv_llm_http_service_inflight_requests: The number of requests being processed.
            # Run the following command to check whether the metrics are collected.
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/nv_llm_http_service_inflight_requests"
            - seriesQuery: 'nv_llm_http_service_inflight_requests{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
    

Step 3: Configure HPA

Create an HPA resource that targets your inference service and uses one of the custom metrics you configured.

Note

The parameter configurations in the following scaling policies are for demonstration purposes only. Determine the appropriate thresholds for your specific use case based on performance testing, resource costs, and service-level objectives (SLOs).

  1. Create a file named hpa.yaml. Choose the example that matches your inference framework.

    vLLM

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: llm-inference-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: StatefulSet
        name: vllm-inference # Replace with your vLLM inference service name.
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Pods
        pods:
          metric:
            name: vllm:num_requests_waiting
          target:
            type: Value
            averageValue: 5
    

    SGLang

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: llm-inference-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: StatefulSet
        name: sgl-inference
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Pods
        pods:
          metric:
            name: sglang:num_queue_reqs
          target:
            type: Value
            averageValue: 5

    Apply the HPA configuration.

    kubectl apply -f hpa.yaml

Step 4: Test the auto scaling configuration

Apply a load to your service to trigger the HPA using the benchmark tool.

For the benchmark tool details and how to use it, see vLLM Benchmark and SGLang Benchmark.
  1. Create a file named benchmark.yaml.

    • Specify the container image that matches the inference framework you are testing. Choose one of the following options:

      • For vLLM: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0

      • For SGLang: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104

    YAML template

    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      labels:
        app: llm-benchmark
      name: llm-benchmark
    spec:
      selector:
        matchLabels:
          app: llm-benchmark
      template:
        metadata:
          labels:
            app: llm-benchmark
        spec:
          hostNetwork: true
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - command:
            - sh
            - -c
            - sleep inf
            image: # The SGLang or vLLM container image used to deploy the inference service
            imagePullPolicy: IfNotPresent
            name: llm-benchmark
            resources:
              limits:
                cpu: "8"
                memory: 40 Gi
              requests:
                cpu: "8"
                memory: 40 Gi
            volumeMounts:
            - mountPath: /models/Qwen3-32B
              name: llm-model
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model
  2. Deploy a benchmark client pod to generate traffic.

    kubectl create -f benchmark.yaml
  3. Run a benchmark script from within the client pod to generate a high load on your inference service.

    vLLM

    python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
            --model /models/Qwen3-32B \
            --host inference-service \
            --port 8000 \
            --dataset-name random \
            --random-input-len 1500 \
            --random-output-len 100 \
            --random-range-ratio 1 \
            --num-prompts 400 \
            --max-concurrency 20

    SGLang

    python3 -m sglang.bench_serving --backend sglang \
            --model /models/Qwen3-32B \
            --host inference-service \
            --port 8000 \
            --dataset-name random \
            --random-input-len 1500 \
            --random-output-len 100 \
            --random-range-ratio 1 \
            --num-prompts 400 \
            --max-concurrency 20

While the load test is running, open a new terminal and monitor the HPA's status.

kubectl describe hpa llm-inference-hpa

In the event log, you should see a SuccessfulRescale event, indicating that the HPA has detected the high number of waiting requests, and has scaled up the number of replicas from 1 to 3.

Name:                                   llm-inference-hpa
Namespace:                              default
Labels:                                 <none>
Annotations:                            <none>
CreationTimestamp:                      Fri, 25 Jul 2025 11:29:20 +0800
Reference:                              StatefulSet/vllm-inference
Metrics:                                ( current / target )
  "vllm:num_requests_waiting" on pods:  11 / 5
Min replicas:                           1
Max replicas:                           3
StatefulSet pods:                       1 current / 3 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    SucceededRescale    the HPA controller was able to update the target scale to 3
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric vllm:num_requests_waiting
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  1s    horizontal-pod-autoscaler  New size: 3; reason: pods metric vllm:num_requests_waiting above target