KServe integrates Kubernetes Horizontal Pod Autoscaler (HPA) and the ACK CronHPA controller, letting you automatically adjust the number of model service pods based on CPU utilization, memory usage, GPU utilization, or a time-based schedule.
This topic uses a Qwen-7B-Chat-Int8 model and a V100 GPU to show how to configure elastic scaling for a KServe inference service.
Choose a scaling type
| Scaling type | Trigger | Deployment mode | When to use |
|---|---|---|---|
| CPU/memory-based HPA | CPU or memory utilization exceeds a threshold | Raw Deployment | Unpredictable traffic with CPU- or memory-bound inference workloads |
| GPU utilization-based HPA | Custom GPU metric (DCGM) exceeds a threshold | Raw Deployment | GPU-bound inference workloads, such as LLM serving |
| Scheduled scaling (CronHPA) | Time schedule (cron expression) | Raw Deployment | Predictable traffic patterns, such as business-hours peaks |
HPA does not support scaling to 0. The --min-replicas value must be an integer greater than 0.
Prerequisites
Before you begin, ensure that you have:
-
The Arena client, version 0.9.15 or later. For more information, see Configure the Arena client.
-
The ack-kserve component installed. For more information, see Install the ack-kserve component.
Configure CPU/memory-based elastic scaling
This scaling type uses the Kubernetes HPA mechanism in Raw Deployment mode. HPA dynamically adjusts the number of pod replicas in a ReplicaSet based on CPU or memory utilization. The following example configures CPU-based scaling using an sklearn-iris model.
For background on HPA, see the Kubernetes documentation on Horizontal Pod Autoscaling.
-
Submit the inference service with scaling parameters.
Parameter Description --scale-metricThe scaling metric. Valid values: cpu,memory.--scale-targetThe scaling threshold, as a percentage. --min-replicasThe minimum number of replicas. Must be an integer greater than 0. HPA does not support scaling to 0. --max-replicasThe maximum number of replicas. Must be an integer greater than the value of --min-replicas.arena serve kserve \ --name=sklearn-iris \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \ --cpu=1 \ --memory=200Mi \ --scale-metric=cpu \ --scale-target=10 \ --min-replicas=1 \ --max-replicas=10 \ "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"Expected output:
inferenceservice.serving.kserve.io/sklearn-iris created INFO[0002] The Job sklearn-iris has been submitted successfully INFO[0002] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status -
Create the inference input file. Create a file named
iris-input.jsonwith the following content:cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF -
Test the inference service.
# Get the load balancer IP of the nginx-ingress-lb service in the kube-system namespace. NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'` # Get the hostname of the sklearn-iris InferenceService. SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a prediction request using the iris-input.json file. curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \ http://$NGINX_INGRESS_IP:80/v1/models/sklearn-iris:predict -d @./iris-input.jsonExpected output:
{"predictions":[1,1]}% -
Run a stress test to trigger scaling.
NoteFor more information about the Hey stress testing tool, see Hey.
hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict -
While the stress test is running, open a separate terminal and check the HPA scaling status.
kubectl describe hpa sklearn-iris-predictorThe
Eventssection confirms that HPA automatically adjusted the replica count based on CPU utilization: scaling out to 8, then scaling in to 7 and 1 as load decreased.
Configure GPU utilization-based elastic scaling
This scaling type also uses the Kubernetes HPA mechanism in Raw Deployment mode, but relies on the ack-alibaba-cloud-metrics-adapter component to expose custom GPU metrics to HPA. For more information, see Horizontal pod autoscaling based on Alibaba Cloud Prometheus metrics.
The following example demonstrates how to configure custom metric-based auto scaling based on the GPU utilization of pods.
-
Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM inference service.
-
Configure custom GPU metrics. For more information, see Implement elastic scaling based on GPU metrics.
-
Deploy the vLLM inference service with GPU-based scaling.
arena serve kserve \ --name=qwen \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \ --scale-target=50 \ --min-replicas=1 \ --max-replicas=2 \ --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \ "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"Expected output:
inferenceservice.serving.kserve.io/qwen created INFO[0002] The Job qwen has been submitted successfully INFO[0002] You can run `arena serve get qwen --type kserve -n default` to check the job status -
Test the inference service.
# Get the IP address of the Nginx Ingress. NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}') # Get the hostname of the InferenceService. SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a test request to the vLLM chat completions endpoint. curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'Expected output:
{"id":"cmpl-77088b96abe744c89284efde2e779174","object":"chat.completion","created":1715590010,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK, what do you need to test?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}% -
Run a stress test to trigger GPU-based scaling.
NoteFor more information about the Hey stress testing tool, see Hey.
hey -z 2m -c 5 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions -
While the stress test is running, open a separate terminal and check the HPA scaling status.
kubectl describe hpa qwen-hpaThe output shows the service scaled out to 2 pods during the stress test and scaled back to 1 pod approximately 5 minutes after the test ended.
Configure scheduled elastic scaling
Scheduled elastic scaling uses the ack-kubernetes-cronhpa-controller component (CronHPA) provided by ACK. CronHPA lets you adjust the number of pod replicas at specific times or intervals using cron expressions, making it suitable for predictable traffic patterns.
-
Install the CronHPA component. For more information, see Use CronHPA for scheduled horizontal scaling of containers.
-
Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM inference service.
-
Deploy the vLLM inference service.
arena serve kserve \ --name=qwen-cronhpa \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --annotation="serving.kserve.io/autoscalerClass=external" \ --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \ "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"Expected output:
inferenceservice.serving.kserve.io/qwen-cronhpa created INFO[0004] The Job qwen-cronhpa has been submitted successfully INFO[0004] You can run `arena serve get qwen-cronhpa --type kserve -n default` to check the job status -
Test the inference service.
# Get the IP address of the Nginx Ingress. NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'` # Get the hostname of the InferenceService. SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-cronhpa -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a test request to the vLLM chat completions endpoint. curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \ http://$NGINX_INGRESS_IP:80/v1/chat/completions -X POST \ -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}'Expected output:
{"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK, what do you need to test?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}% -
Apply the CronHPA configuration. The following example scales out to 2 pods at 10:30 every day and scales in to 1 pod at 12:00 every day.
The output confirms the CronHPA resource is created with two scheduled jobs. The
qwen-cronhpa-predictorDeployment automatically scales to 2 pods at 10:30 and back to 1 pod at 12:00 each day.
What's next
For more information about elastic scaling in ACK, see Auto Scaling.