Configure Kubernetes Horizontal Pod Autoscaler (HPA) to scale LLM inference pods automatically based on inference-specific metrics, such as request queue depth and KV cache usage, exposed by your inference framework.
How it works
LLM inference services are bottlenecked by GPU computing power and GPU memory, not CPU or system memory. Scaling on GPU utilization or memory usage is misleading — a GPU at 90% utilization may be idle-waiting on a long decode sequence, not handling more requests. Inference frameworks like vLLM, SGLang, and Dynamo expose metrics that directly reflect service load, such as the number of waiting requests (num_requests_waiting) and KV cache usage (kv_cache_usage_perc). These are the right signals for scaling decisions.
The scaling pipeline works as follows:
-
A PodMonitor instructs Managed Service for Prometheus to scrape metrics from inference pods.
-
The
ack-alibaba-cloud-metrics-adaptercomponent bridges Prometheus metrics to the Kubernetes Custom Metrics API. -
The HPA reads custom metrics from that API and scales your StatefulSet up or down.
Prerequisites
Before you begin, make sure you have:
-
A standalone or distributed inference service deployed in your cluster. See Deploy a standalone LLM inference service or Deploy a distributed LLM inference service
-
Managed Service for Prometheus enabled in your Container Service for Kubernetes (ACK) cluster
-
The ack-alibaba-cloud-metrics-adapter component installed, with its
AlibabaCloudMetricsAdapter.prometheus.urlparameter pointing to your Managed Service for Prometheus endpoint. For details, see Modify the configuration of the ack-alibaba-cloud-metrics-adapter component
Billing
Using Managed Service for Prometheus causes your service to emit custom metrics, which may incur additional fees. Fees vary based on your cluster size, number of applications, and data volume. To monitor your usage, see query usage data.
Step 1: Configure metric collection
If you have already configured monitoring for LLM inference services, skip this step.
Create a PodMonitor resource to instruct Prometheus to scrape metrics from your inference pods.
-
Create a file named
podmonitor.yamlwith the following content: -
Apply the configuration:
kubectl apply -f ./podmonitor.yaml
Step 2: Configure ack-alibaba-cloud-metrics-adapter
Configure the metrics adapter to expose inference framework metrics through the Kubernetes Custom Metrics API. The adapter uses two fields per metric rule:
-
seriesQuery: selects which Prometheus time series to include (filtered by label selectors such asnamespaceandpod) -
metricsQuery: defines how to aggregate those series (typicallysum ... by (<<.GroupBy>>))
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the name of the target cluster. In the left navigation pane, choose Applications > Helm.
-
On the Helm page, find ack-alibaba-cloud-metrics-adapter and click Update in the Actions column.
-
In the Update Release panel, update the YAML configuration as shown below, then click OK. The following example covers metrics for vLLM, SGLang, and Dynamo. Include only the metrics for your inference framework. For a complete list of available metrics, refer to the official documentation: vLLM metrics, SGLang production metrics, and Dynamo metrics.
The metrics in this example are for demonstration. Modify them based on your inference framework and scaling requirements.
-
Verify that the metrics adapter is exposing the metrics correctly. Run the following command for each metric you configured:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_waiting"A successful response returns a JSON object with a
valuefield for each pod. If the command returns an error, check that the PodMonitor is deployed and that Prometheus is scraping your inference pods.
Step 3: Configure HPA
Create an HPA resource that targets your inference StatefulSet and triggers scaling based on a custom metric.
The parameter values in the following examples are for demonstration only. Set thresholds based on your own performance testing, resource costs, and service-level objectives (SLOs). LLM inference pods take time to start, and individual requests can run for minutes. To prevent the HPA from scaling down while long-running requests are still in progress, configure behavior.scaleDown.stabilizationWindowSeconds. The default value is 300 seconds. Increase this value if your workload has requests that exceed five minutes.
-
Create a file named
hpa.yaml. Use the example that matches your inference framework.vLLM
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: vllm-inference # Replace with your vLLM inference service name. minReplicas: 1 maxReplicas: 3 metrics: - type: Pods pods: metric: name: vllm:num_requests_waiting target: type: Value averageValue: 5 behavior: scaleDown: stabilizationWindowSeconds: 300 # Adjust based on your longest expected request duration.SGLang
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: sgl-inference minReplicas: 1 maxReplicas: 3 metrics: - type: Pods pods: metric: name: sglang:num_queue_reqs target: type: Value averageValue: 5 behavior: scaleDown: stabilizationWindowSeconds: 300 # Adjust based on your longest expected request duration. -
Apply the HPA configuration:
kubectl apply -f hpa.yaml
Step 4: Test the auto scaling configuration
Apply a load to your inference service to trigger the HPA.
For benchmark tool details and usage, see the vLLM Benchmark guide and the SGLang Benchmark guide.
-
Create a file named
benchmark.yaml. Set theimagefield to match your inference framework:-
vLLM:
kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0 -
SGLang:
anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104
-
-
Deploy the benchmark client pod:
kubectl create -f benchmark.yaml -
Run the benchmark script from within the client pod to generate load on your inference service.
vLLM
python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \ --model /models/Qwen3-32B \ --host inference-service \ --port 8000 \ --dataset-name random \ --random-input-len 1500 \ --random-output-len 100 \ --random-range-ratio 1 \ --num-prompts 400 \ --max-concurrency 20SGLang
python3 -m sglang.bench_serving --backend sglang \ --model /models/Qwen3-32B \ --host inference-service \ --port 8000 \ --dataset-name random \ --random-input-len 1500 \ --random-output-len 100 \ --random-range-ratio 1 \ --num-prompts 400 \ --max-concurrency 20 -
While the load test runs, open a new terminal and check the HPA status:
kubectl describe hpa llm-inference-hpaWhen the HPA detects that the average number of waiting requests exceeds the target threshold, it scales up the StatefulSet. A successful scale-up produces a
SuccessfulRescaleevent in the output:Name: llm-inference-hpa Namespace: default Labels: <none> Annotations: <none> CreationTimestamp: Fri, 25 Jul 2025 11:29:20 +0800 Reference: StatefulSet/vllm-inference Metrics: ( current / target ) "vllm:num_requests_waiting" on pods: 11 / 5 Min replicas: 1 Max replicas: 3 StatefulSet pods: 1 current / 3 desired Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True SucceededRescale the HPA controller was able to update the target scale to 3 ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric vllm:num_requests_waiting ScalingLimited False DesiredWithinRange the desired count is within the acceptable range Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulRescale 1s horizontal-pod-autoscaler New size: 3; reason: pods metric vllm:num_requests_waiting above target
What's next
-
Learn how to configure monitoring for LLM inference services
-
Review the full list of available metrics to tune your scaling thresholds: vLLM metrics, SGLang production metrics, and Dynamo metrics