When you deploy a model as an inference service by using KServe, the model that is deployed as the inference service needs to handle loads that dynamically fluctuate. KServe integrates Kubernetes-native Horizontal Pod Autoscaler (HPA) technology and the scaling controller to automatically and flexibly scale pods for a model based on the CPU utilization, memory usage, GPU utilization, and custom performance metrics. This ensures performance and stability of services. This topic describes how to configure an auto scaling policy for a service by using KServe. In this example, a Qwen-7B-Chat-Int8 model that uses NVIDIA V100 GPUs is used.
Prerequisites
The Arena client of version 0.9.15 or later is installed. For more information, see Configure the Arena client.
The ack-kserve component is installed. For more information, see Install ack-kserve.
Configure an auto scaling policy based on CPU utilization or memory usage
Auto scaling in Raw Deployment mode is implemented based on the HPA mechanism of Kubernetes, which is the most basic method for auto scaling. HPA dynamically adjusts the number of pod replicas in a ReplicaSet based on the CPU utilization or memory usage of pods.
The following example shows how to configure an auto scaling policy based on CPU utilization. For more information about the HPA mechanism, see Horizontal Pod Autoscaling.
Run the following command to submit a service:
arena serve kserve \ --name=sklearn-iris \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \ --cpu=1 \ --memory=200Mi \ --scale-metric=cpu \ --scale-target=10 \ --min-replicas=1 \ --max-replicas=10 \ "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"
The following table describes the parameters.
Parameter
Description
--scale-metric
The metric based on which auto scaling is triggered. Valid values:
cpu
andmemory
. In this example, this parameter is set tocpu
.--scale-target
The scaling threshold in percentage.
--min-replicas
The minimum number of pod replicas for scaling. The value of this parameter must be an integer greater than 0. The value 0 is not supported.
--max-replicas
The maximum number of pod replicas for scaling. The value of this parameter must be an integer greater than the value of the
minReplicas
parameter.Expected output:
inferenceservice.serving.kserve.io/sklearn-iris created INFO[0002] The Job sklearn-iris has been submitted successfully INFO[0002] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status
The preceding output indicates that the sklearn-iris service is created.
Run the following command to prepare an inference request.
Create a file named iris-input.json and copy the following JSON data to the file as the input data for inference.
cat <<EOF > "./iris-input.json" { "instances": [ [6.8, 2.8, 4.8, 1.4], [6.0, 3.4, 4.5, 1.6] ] } EOF
Run the following command to access the service for inference.
# Obtain the IP address of the Server Load Balancer (SLB) instance that is configured for the service named nginx-ingress-lb from the kube-system namespace. The IP address is used for external access to the service. NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'` # Obtain the URL of the inference service named sklearn-iris and extract the hostname from the URL for subsequent use. SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Run the curl command to send a request to the inference service. The hostname that is extracted from the URL of the sklearn-iris service and JSON content are contained in the request header. -d @./iris-input.json specifies that the request body contains the content of the local file iris-input.json, which contains the input data that is required for model inference. curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \ http://$NGINX_INGRESS_IP:80/v1/models/sklearn-iris:predict -d @./iris-input.json
Expected output:
{"predictions":[1,1]}%
The preceding output indicates that the two inferences occur during the request, and the responses to the two inferences are the same.
Run the following command to initiate stress testing.
NoteFor more information about the hey tool that is used for stress testing, see hey.
hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict
Open another terminal during stress testing and run the following command to check the scaling status of the service.
kubectl describe hpa sklearn-iris-predictor
Expected output:
The
Events
parameter returned in the expected output indicates that HPA automatically adjusted the number of pod replicas based on the CPU utilization. For example, HPA adjusted the number of pod replicas to 8, 7, and 1 at different points in time. This indicates that HPA can automatically scale pods based on the CPU utilization.
Configure an auto scaling policy based on GPU utilization
Custom metric-based auto scaling is implemented based on the ack-alibaba-cloud-metrics-adapter component and the Kubernetes HPA mechanism provided by Container Service for Kubernetes (ACK). For more information, see Horizontal pod scaling based on Managed Service for Prometheus metrics.
The following example shows how to configure an auto scaling policy based on the GPU utilization of pods.
Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM model as an inference service.
Configure custom GPU metrics. For more information, see Enable auto scaling based on GPU metrics.
Run the following command to deploy a vLLM model as an inference service.
arena serve kserve \ --name=qwen \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \ "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"
Expected output:
inferenceservice.serving.kserve.io/qwen created INFO[0002] The Job qwen has been submitted successfully INFO[0002] You can run `arena serve get qwen --type kserve -n default` to check the job status
The preceding output indicates that the vLLM model is deployed as an inference service.
Run the following command to obtain the IP address of the NGINX Ingress controller and use the obtained IP address to access the inference service to test whether the vLLM model runs as expected.
# Obtain the IP address of the NGINX Ingress controller. NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}') # Obtain the hostname of the inference service. SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a request to access the inference service. curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Perform a text."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}
Expected output:
{"id":"cmpl-77088b96abe744c89284efde2e779174","object":"chat.completion","created":1715590010,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test? <|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%
The preceding output indicates that the request is correctly sent to the server and the server returns an expected response in the JSON format.
Run the following command to perform a stress test on the service.
NoteFor more information about the hey tool that is used for stress testing, see hey.
hey -z 2m -c 5 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "qwen", "messages": [{"role": "user", "content": "Perform a text."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions
Open another terminal during stress testing and run the following command to check the scaling status of the service.
kubectl describe hpa qwen-hpa
Expected output:
The preceding output indicates that the number of pods is increased to 2 during stress testing. However, after about 5 minutes, the number of pods is decreased to 1. This indicates that KServe can scale pods based on the GPU utilization of pods.
Configure a scheduled auto scaling policy
Scheduled auto scaling is implemented based on the ack-kubernetes-cronhpa-controller component provided by ACK. The component allows you to specify the number of pod replicas to scale pods at a specific point in time or on a periodic basis. This helps handle predicable load changes.
Install the CronHPA component. For more information, see Use CronHPA for scheduled horizontal scaling.
Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM model as an inference service.
Run the following command to deploy a vLLM model as an inference service.
arena serve kserve \ --name=qwen \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \ "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"
Expected output:
inferenceservice.serving.kserve.io/qwen-cronhpa created INFO[0004] The Job qwen-cronhpa has been submitted successfully INFO[0004] You can run `arena serve get qwen-cronhpa --type kserve -n default` to check the job status
Run the following command to test whether the vLLM model is normal.
# Obtain the IP address of the NGINX Ingress controller. NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'` # Obtain the hostname of the inference service. SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3) # Send a request to access the inference service. curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \ http://$NGINX_INGRESS_IP:80/v1/chat/completions -X POST \ -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}
Expected output:
{"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test? <|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%
The preceding output indicates that the request is correctly sent to the service and the service returns an expected response in the JSON format.
Run the following command to configure a scheduled scaling policy.
Expected output:
The preceding output indicates that an automatic scaling policy is configured for the
qwen-cronhpa
custom resource definition (CRD). Based on the policy, the number of pods in the Deployment namedqwen-cronhpa-predictor
is automatically adjusted at a specific point in time every day to meet the preset scaling requirements.
References
For more information about auto scaling of ACK, see Auto scaling overview.