Configure auto scaling for standalone or distributed LLM inference services - Container Service for Kubernetes

When managing large language model (LLM) inference services, it is crucial to handle the highly dynamic fluctuations in workload. This topic describes how to combine custom metrics from your inference framework with the Kubernetes Horizontal Pod Autoscaler (HPA) to automatically and flexibly scale your inference service pods. This ensures high availability and stability for your LLM services.

Prerequisites

You have deployed a standalone inference service or a distributed inference service.
You have enabled Managed Service for Prometheus in your Container Service for Kubernetes (ACK) cluster.
You have installed the ack-alibaba-cloud-metrics-adapter component and configured its AlibabaCloudMetricsAdapter.prometheus.url parameter to point to your Managed Service for Prometheus endpoint. For more information, see Modify the configuration of the ack-alibaba-cloud-metrics-adapter component.

Billing

Integrating with Managed Service for Prometheus will cause your service to emit custom metrics, which may incur additional fees. These fees vary based on factors such as your cluster size, number of applications, and data volume. You can monitor and manage your resources by querying usage data.

Step 1: Configure metric collection

Unlike traditional microservices, LLM inference services are often bottlenecked by GPU computing power and memory, not CPU or system memory. Standard metrics such as GPU usage and memory usage can be misleading for determining the actual load on an inference service. Therefore, a more effective approach is to scale based on performance metrics exposed directly by the inference engine, such as request latency or queue depth.

If you have configured monitoring for LLM inference services, you can skip this step.

Create a file named podmonitor.yaml to instruct Prometheus to scrape metrics from your inference pods.

YAML template

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: llm-serving-podmonitor
  namespace: default
  annotations:
    arms.prometheus.io/discovery: "true"
    arms.prometheus.io/resource: "arms"
spec:
  selector:
    matchExpressions:
    - key: alibabacloud.com/inference-workload
      operator: Exists
  namespaceSelector:
    any: true
  podMetricsEndpoints:
  - interval: 15s
    path: /metrics
    port: "http"
    relabelings:
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_name
      targetLabel: pod_name
    - action: replace
      sourceLabels:
      - __meta_kubernetes_namespace
      targetLabel: pod_namespace
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_rolebasedgroup_workloads_x_k8s_io_role
      regex: (.+)
      targetLabel: rbg_role
    # Allow to override workload-name with specific label
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_alibabacloud_com_inference_workload
      regex: (.+)
      targetLabel: workload_name
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_alibabacloud_com_inference_backend
      regex: (.+)
      targetLabel: backend

Apply the configuration.
```
kubectl apply -f ./podmonitor.yaml
```

Step 2: Configure `ack-alibaba-cloud-metrics-adapter`

Log on to ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left navigation pane, click Applications > Helm.
On the Helm page, find ack-alibaba-cloud-metrics-adapter and click Update in the Actions column.

In the Update Release panel, update the YAML configuration as shown in the following example and click OK. The metrics in the YAML are for demonstration purposes only. Modify them as needed.

Refer to the official documentation for a complete list of metrics for vLLM, SGLang, and Dynamo.

YAML template

AlibabaCloudMetricsAdapter:

  prometheus:
    enabled: true    # Set this to true to enable the Prometheus adapter feature.
    # Enter the URL of your Managed Service for Prometheus.
    url: http://cn-beijing.arms.aliyuncs.com:9090/api/v1/prometheus/xxxx/xxxx/xxx/cn-beijing
    # If token-based authentication is enabled for Managed Service for Prometheus, configure the Authorization parameter of the prometheusHeader field.
#    prometheusHeader:
#    - Authorization: xxxxxxx

    adapter:
      rules:
        default: false  			# Default metric collection configuration. Keep this to false.
        custom:

        # ** Example 1: This is an example for vLLM **
        # vllm:num_requests_waiting: The number of waiting requests.
        # Run the following command to check whether the metrics are collected.
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_waiting"
        - seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # vllm:num_requests_running: The number of requests being processed.
        # Run the following command to check whether the metrics are collected.
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_running"
        - seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # vllm:kv_cache_usage_perc: The KV cache usage.
        # Run the following command to check whether the metrics are collected.
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:kv_cache_usage_perc"
        - seriesQuery: 'vllm:kv_cache_usage_perc{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # ** Example 2: This is an example for SGLang **
        # sglang:num_queue_reqs: The number of waiting requests.
        # Run the following command to check whether the metrics are collected.
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_queue_reqs"
        - seriesQuery: 'sglang:num_queue_reqs{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
        # sglang:num_running_reqs: The number of requests being processed.
        # Run the following command to check whether the metrics are collected.
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_running_reqs"
        - seriesQuery: 'sglang:num_running_reqs{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
          # sglang:token_usage: The token usage in the system, which can reflect the KV cache utilization.
          # Run the following command to check whether the metrics are collected.
          # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:token_usage"
        - seriesQuery: 'sglang:token_usage{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # Example 3: This is an example for Dynamo
        # nv_llm_http_service_inflight_requests: The number of requests being processed.
        # Run the following command to check whether the metrics are collected.
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/nv_llm_http_service_inflight_requests"
        - seriesQuery: 'nv_llm_http_service_inflight_requests{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

Step 3: Configure HPA

Create an HPA resource that targets your inference service and uses one of the custom metrics you configured.

Note

The parameter configurations in the following scaling policies are for demonstration purposes only. Determine the appropriate thresholds for your specific use case based on performance testing, resource costs, and service-level objectives (SLOs).

Create a file named hpa.yaml. Choose the example that matches your inference framework.

vLLM

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: vllm-inference # Replace with your vLLM inference service name.
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm:num_requests_waiting
      target:
        type: Value
        averageValue: 5

SGLang

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: sgl-inference
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metric:
        name: sglang:num_queue_reqs
      target:
        type: Value
        averageValue: 5

Apply the HPA configuration.

kubectl apply -f hpa.yaml

Step 4: Test the auto scaling configuration

Apply a load to your service to trigger the HPA using the benchmark tool.

For the benchmark tool details and how to use it, see vLLM Benchmark and SGLang Benchmark.

Create a file named benchmark.yaml.

Specify the container image that matches the inference framework you are testing. Choose one of the following options:
- For vLLM: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
- For SGLang: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104

YAML template

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: llm-benchmark
  name: llm-benchmark
spec:
  selector:
    matchLabels:
      app: llm-benchmark
  template:
    metadata:
      labels:
        app: llm-benchmark
    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: # The SGLang or vLLM container image used to deploy the inference service
        imagePullPolicy: IfNotPresent
        name: llm-benchmark
        resources:
          limits:
            cpu: "8"
            memory: 40 Gi
          requests:
            cpu: "8"
            memory: 40 Gi
        volumeMounts:
        - mountPath: /models/Qwen3-32B
          name: llm-model
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model

Deploy a benchmark client pod to generate traffic.
```
kubectl create -f benchmark.yaml
```

Run a benchmark script from within the client pod to generate a high load on your inference service.

vLLM

python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
        --model /models/Qwen3-32B \
        --host inference-service \
        --port 8000 \
        --dataset-name random \
        --random-input-len 1500 \
        --random-output-len 100 \
        --random-range-ratio 1 \
        --num-prompts 400 \
        --max-concurrency 20

SGLang

python3 -m sglang.bench_serving --backend sglang \
        --model /models/Qwen3-32B \
        --host inference-service \
        --port 8000 \
        --dataset-name random \
        --random-input-len 1500 \
        --random-output-len 100 \
        --random-range-ratio 1 \
        --num-prompts 400 \
        --max-concurrency 20

While the load test is running, open a new terminal and monitor the HPA's status.

kubectl describe hpa llm-inference-hpa

In the event log, you should see a SuccessfulRescale event, indicating that the HPA has detected the high number of waiting requests, and has scaled up the number of replicas from 1 to 3.

Name:                                   llm-inference-hpa
Namespace:                              default
Labels:                                 <none>
Annotations:                            <none>
CreationTimestamp:                      Fri, 25 Jul 2025 11:29:20 +0800
Reference:                              StatefulSet/vllm-inference
Metrics:                                ( current / target )
  "vllm:num_requests_waiting" on pods:  11 / 5
Min replicas:                           1
Max replicas:                           3
StatefulSet pods:                       1 current / 3 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    SucceededRescale    the HPA controller was able to update the target scale to 3
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric vllm:num_requests_waiting
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  1s    horizontal-pod-autoscaler  New size: 3; reason: pods metric vllm:num_requests_waiting above target

Prerequisites

Billing

Step 1: Configure metric collection

Step 2: Configure ack-alibaba-cloud-metrics-adapter

Step 3: Configure HPA

vLLM

SGLang

Step 4: Test the auto scaling configuration

vLLM

SGLang

Step 2: Configure `ack-alibaba-cloud-metrics-adapter`