When managing large language model (LLM) inference services, it is crucial to handle the highly dynamic fluctuations in workload. This topic describes how to combine custom metrics from your inference framework with the Kubernetes Horizontal Pod Autoscaler (HPA) to automatically and flexibly scale your inference service pods. This ensures high availability and stability for your LLM services.
Prerequisites
You have deployed a standalone inference service or a distributed inference service.
You have enabled Managed Service for Prometheus in your Container Service for Kubernetes (ACK) cluster.
You have installed the ack-alibaba-cloud-metrics-adapter component and configured its
AlibabaCloudMetricsAdapter.prometheus.urlparameter to point to your Managed Service for Prometheus endpoint. For more information, see Modify the configuration of the ack-alibaba-cloud-metrics-adapter component.
Billing
Integrating with Managed Service for Prometheus will cause your service to emit custom metrics, which may incur additional fees. These fees vary based on factors such as your cluster size, number of applications, and data volume. You can monitor and manage your resources by querying usage data.
Step 1: Configure metric collection
Unlike traditional microservices, LLM inference services are often bottlenecked by GPU computing power and memory, not CPU or system memory. Standard metrics such as GPU usage and memory usage can be misleading for determining the actual load on an inference service. Therefore, a more effective approach is to scale based on performance metrics exposed directly by the inference engine, such as request latency or queue depth.
If you have configured monitoring for LLM inference services, you can skip this step.
Create a file named
podmonitor.yamlto instruct Prometheus to scrape metrics from your inference pods.Apply the configuration.
kubectl apply -f ./podmonitor.yaml
Step 2: Configure ack-alibaba-cloud-metrics-adapter
Log on to ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left navigation pane, click .
On the Helm page, find ack-alibaba-cloud-metrics-adapter and click Update in the Actions column.
In the Update Release panel, update the YAML configuration as shown in the following example and click OK. The metrics in the YAML are for demonstration purposes only. Modify them as needed.
Refer to the official documentation for a complete list of metrics for vLLM, SGLang, and Dynamo.
Step 3: Configure HPA
Create an HPA resource that targets your inference service and uses one of the custom metrics you configured.
The parameter configurations in the following scaling policies are for demonstration purposes only. Determine the appropriate thresholds for your specific use case based on performance testing, resource costs, and service-level objectives (SLOs).
Create a file named
hpa.yaml. Choose the example that matches your inference framework.vLLM
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: vllm-inference # Replace with your vLLM inference service name. minReplicas: 1 maxReplicas: 3 metrics: - type: Pods pods: metric: name: vllm:num_requests_waiting target: type: Value averageValue: 5SGLang
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: sgl-inference minReplicas: 1 maxReplicas: 3 metrics: - type: Pods pods: metric: name: sglang:num_queue_reqs target: type: Value averageValue: 5Apply the HPA configuration.
kubectl apply -f hpa.yaml
Step 4: Test the auto scaling configuration
Apply a load to your service to trigger the HPA using the benchmark tool.
For the benchmark tool details and how to use it, see vLLM Benchmark and SGLang Benchmark.
Create a file named
benchmark.yaml.Specify the container image that matches the inference framework you are testing. Choose one of the following options:
For vLLM:
kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0For SGLang:
anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104
Deploy a benchmark client pod to generate traffic.
kubectl create -f benchmark.yamlRun a benchmark script from within the client pod to generate a high load on your inference service.
vLLM
python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \ --model /models/Qwen3-32B \ --host inference-service \ --port 8000 \ --dataset-name random \ --random-input-len 1500 \ --random-output-len 100 \ --random-range-ratio 1 \ --num-prompts 400 \ --max-concurrency 20SGLang
python3 -m sglang.bench_serving --backend sglang \ --model /models/Qwen3-32B \ --host inference-service \ --port 8000 \ --dataset-name random \ --random-input-len 1500 \ --random-output-len 100 \ --random-range-ratio 1 \ --num-prompts 400 \ --max-concurrency 20
While the load test is running, open a new terminal and monitor the HPA's status.
kubectl describe hpa llm-inference-hpaIn the event log, you should see a SuccessfulRescale event, indicating that the HPA has detected the high number of waiting requests, and has scaled up the number of replicas from 1 to 3.
Name: llm-inference-hpa
Namespace: default
Labels: <none>
Annotations: <none>
CreationTimestamp: Fri, 25 Jul 2025 11:29:20 +0800
Reference: StatefulSet/vllm-inference
Metrics: ( current / target )
"vllm:num_requests_waiting" on pods: 11 / 5
Min replicas: 1
Max replicas: 3
StatefulSet pods: 1 current / 3 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededRescale the HPA controller was able to update the target scale to 3
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric vllm:num_requests_waiting
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 1s horizontal-pod-autoscaler New size: 3; reason: pods metric vllm:num_requests_waiting above target