When you deploy and manage KServe model services, you need to handle highly dynamic load fluctuations for model inference services. KServe integrates the native Kubernetes Horizontal Pod Autoscaler (HPA) and scaling controllers. This lets you automatically adjust the number of model service pods based on CPU utilization, memory usage, GPU utilization, and custom performance metrics to ensure service performance and stability. This topic uses a Qwen-7B-Chat-Int8 model and a V100 GPU to demonstrate how to configure elastic scaling for a service using KServe.
Prerequisites
The Arena client, version 0.9.15 or later, is installed. For more information, see Configure the Arena client.
The ack-kserve component is installed. For more information, see Install the ack-kserve component.
Configure an auto scaling policy based on CPU or memory
Auto scaling in Raw Deployment mode relies on the Kubernetes Horizontal Pod Autoscaler (HPA) mechanism. HPA is a basic auto scaling method that dynamically adjusts the number of pod replicas in a ReplicaSet based on the CPU or memory utilization of pods.
This section demonstrates how to configure auto scaling based on CPU utilization. For more information about the HPA mechanism, see the Kubernetes documentation on Horizontal Pod Autoscaling.
Run the following command to submit the service.
arena serve kserve \ --name=sklearn-iris \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \ --cpu=1 \ --memory=200Mi \ --scale-metric=cpu \ --scale-target=10 \ --min-replicas=1 \ --max-replicas=10 \ "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"The parameters are as follows:
Parameter
Description
--scale-metric
The scaling metric. Valid values are
cpuandmemory. This example usescpu.--scale-target
The scaling threshold, as a percentage.
--min-replicas
The minimum number of replicas. This must be an integer greater than 0. HPA policies do not support scaling to 0.
--max-replicas
The maximum number of replicas. This must be an integer greater than the value of
minReplicas.Expected output:
inferenceservice.serving.kserve.io/sklearn-iris created INFO[0002] The Job sklearn-iris has been submitted successfully INFO[0002] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job statusThe output indicates that the sklearn-iris service is created.
Run the following command to prepare an inference input request.
Create a file named iris-input.json and add the following JSON content. This content is the input data for model prediction.
cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOFRun the following command to access the service and perform inference.
# Get the load balancer IP address of the service named nginx-ingress-lb from the kube-system namespace. This is the entry point for external access. NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'` # Get the URL of the Inference Service named sklearn-iris and extract the hostname for later use. SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Use the curl command to send a request to the model service. The request header specifies the target hostname (the SERVICE_HOSTNAME obtained earlier) and the JSON content type. -d @./iris-input.json specifies that the request body comes from the local file iris-input.json, which contains the input data required for model prediction. curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \ http://$NGINX_INGRESS_IP:80/v1/models/sklearn-iris:predict -d @./iris-input.jsonExpected output:
{"predictions":[1,1]}%The output shows that two inferences were run, and the responses are consistent.
Run the following command to start stress testing.
NoteFor more information about the Hey stress testing tool, see Hey.
hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predictWhile the stress test is running, open another terminal and run the following command to view the scaling status of the service.
kubectl describe hpa sklearn-iris-predictorExpected output:
The
Eventssection in the output shows that HPA automatically adjusted the number of replicas based on CPU usage. For example, the number of replicas changed to 8, 7, and 1 at different times. This indicates that HPA can automatically scale based on CPU usage.
Configure a custom metric-based elastic scaling policy based on GPU utilization
Custom metric-based auto scaling relies on the ack-alibaba-cloud-metrics-adapter component provided by ACK and the Kubernetes HPA mechanism. For more information, see Horizontal pod autoscaling based on Alibaba Cloud Prometheus metrics.
The following example demonstrates how to configure custom metric-based auto scaling based on the GPU utilization of pods.
Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM inference service.
Configure custom GPU metrics. For more information, see Implement elastic scaling based on GPU metrics.
Run the following command to deploy the vLLM service.
arena serve kserve \ --name=qwen \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \ --scale-target=50 \ --min-replicas=1 \ --max-replicas=2 \ --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \ "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"Expected output:
inferenceservice.serving.kserve.io/qwen created INFO[0002] The Job qwen has been submitted successfully INFO[0002] You can run `arena serve get qwen --type kserve -n default` to check the job statusThe output indicates that the inference service is deployed.
Run the following commands to use the NGINX Ingress gateway address to access the inference service and test whether the vLLM service is running correctly.
# Get the IP address of the Nginx Ingress. NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}') # Get the hostname of the Inference Service. SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a request to access the inference service. curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'Expected output:
{"id":"cmpl-77088b96abe744c89284efde2e779174","object":"chat.completion","created":1715590010,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK, what do you need to test?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%The output indicates that the request was sent to the server and that the server returned an expected JSON response.
Run the following command to perform stress testing on the service.
NoteFor more information about the Hey stress testing tool, see Hey.
hey -z 2m -c 5 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completionsDuring the stress test, open a new terminal and run the following command to view the scaling status of the service.
kubectl describe hpa qwen-hpaExpected output:
The output shows that the number of pods scales out to 2 during the stress test. After the test ends, the number of pods scales in to 1 after about 5 minutes. This indicates that KServe can perform custom metric-based auto scaling based on the GPU utilization of pods.
Configure a scheduled auto scaling policy
Scheduled auto scaling requires the ack-kubernetes-cronhpa-controller component provided by ACK. This component lets you change the number of application replicas at specific times or intervals to handle predictable load changes.
Install the CronHPA component. For more information, see Use CronHPA for scheduled horizontal scaling of containers.
Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM inference service.
Run the following command to deploy the vLLM service.
arena serve kserve \ --name=qwen-cronhpa \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --annotation="serving.kserve.io/autoscalerClass=external" \ --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \ "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"Expected output:
inferenceservice.serving.kserve.io/qwen-cronhpa created INFO[0004] The Job qwen-cronhpa has been submitted successfully INFO[0004] You can run `arena serve get qwen-cronhpa --type kserve -n default` to check the job statusRun the following commands to test whether the vLLM service is running correctly.
# Get the IP address of the Nginx Ingress. NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'` # Get the hostname of the Inference Service. SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-cronhpa -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a request to access the inference service. curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \ http://$NGINX_INGRESS_IP:80/v1/chat/completions -X POST \ -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}'Expected output:
{"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK, what do you need to test?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%The output indicates that the request was sent to the service and that the service returned an expected JSON response.
Run the following command to configure scheduled auto scaling.
Expected output:
The output shows that a scheduled auto scaling plan is configured for the
qwen-cronhpaCRD. Based on the schedule, the number of pods in theqwen-cronhpa-predictordeployment is automatically adjusted at specific times each day to meet the preset scaling requirements.
References
For more information about ACK elastic scaling, see Auto Scaling.