All Products
Search
Document Center

Container Service for Kubernetes:Configure an auto scaling policy for a service by using KServe

Last Updated:Dec 26, 2025

When you deploy and manage KServe model services, you need to handle highly dynamic load fluctuations for model inference services. KServe integrates the native Kubernetes Horizontal Pod Autoscaler (HPA) and scaling controllers. This lets you automatically adjust the number of model service pods based on CPU utilization, memory usage, GPU utilization, and custom performance metrics to ensure service performance and stability. This topic uses a Qwen-7B-Chat-Int8 model and a V100 GPU to demonstrate how to configure elastic scaling for a service using KServe.

Prerequisites

Configure an auto scaling policy based on CPU or memory

Auto scaling in Raw Deployment mode relies on the Kubernetes Horizontal Pod Autoscaler (HPA) mechanism. HPA is a basic auto scaling method that dynamically adjusts the number of pod replicas in a ReplicaSet based on the CPU or memory utilization of pods.

This section demonstrates how to configure auto scaling based on CPU utilization. For more information about the HPA mechanism, see the Kubernetes documentation on Horizontal Pod Autoscaling.

  1. Run the following command to submit the service.

    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --scale-metric=cpu \
        --scale-target=10 \
        --min-replicas=1 \
        --max-replicas=10 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    The parameters are as follows:

    Parameter

    Description

    --scale-metric

    The scaling metric. Valid values are cpu and memory. This example uses cpu.

    --scale-target

    The scaling threshold, as a percentage.

    --min-replicas

    The minimum number of replicas. This must be an integer greater than 0. HPA policies do not support scaling to 0.

    --max-replicas

    The maximum number of replicas. This must be an integer greater than the value of minReplicas.

    Expected output:

    inferenceservice.serving.kserve.io/sklearn-iris created
    INFO[0002] The Job sklearn-iris has been submitted successfully 
    INFO[0002] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status 

    The output indicates that the sklearn-iris service is created.

  2. Run the following command to prepare an inference input request.

    Create a file named iris-input.json and add the following JSON content. This content is the input data for model prediction.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to access the service and perform inference.

    # Get the load balancer IP address of the service named nginx-ingress-lb from the kube-system namespace. This is the entry point for external access.
    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    # Get the URL of the Inference Service named sklearn-iris and extract the hostname for later use.
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Use the curl command to send a request to the model service. The request header specifies the target hostname (the SERVICE_HOSTNAME obtained earlier) and the JSON content type. -d @./iris-input.json specifies that the request body comes from the local file iris-input.json, which contains the input data required for model prediction.
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \
         http://$NGINX_INGRESS_IP:80/v1/models/sklearn-iris:predict -d @./iris-input.json

    Expected output:

    {"predictions":[1,1]}%

    The output shows that two inferences were run, and the responses are consistent.

  4. Run the following command to start stress testing.

    Note

    For more information about the Hey stress testing tool, see Hey.

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict
  5. While the stress test is running, open another terminal and run the following command to view the scaling status of the service.

    kubectl describe hpa sklearn-iris-predictor

    Expected output:

    Expand to view the scaling status of the service

    Name:                                                  sklearn-iris-predictor
    Namespace:                                             default
    Labels:                                                app=isvc.sklearn-iris-predictor
                                                           arena.kubeflow.org/uid=3399d840e8b371ed7ca45dda29debeb1
                                                           chart=kserve-0.1.0
                                                           component=predictor
                                                           heritage=Helm
                                                           release=sklearn-iris
                                                           serving.kserve.io/inferenceservice=sklearn-iris
                                                           servingName=sklearn-iris
                                                           servingType=kserve
    Annotations:                                           arena.kubeflow.org/username: kubecfg:certauth:admin
                                                           serving.kserve.io/deploymentMode: RawDeployment
    CreationTimestamp:                                     Sat, 11 May 2024 17:15:47 +0800
    Reference:                                             Deployment/sklearn-iris-predictor
    Metrics:                                               ( current / target )
      resource cpu on pods  (as a percentage of request):  0% (2m) / 10%
    Min replicas:                                          1
    Max replicas:                                          10
    Behavior:
      Scale Up:
        Stabilization Window: 0 seconds
        Select Policy: Max
        Policies:
          - Type: Pods     Value: 4    Period: 15 seconds
          - Type: Percent  Value: 100  Period: 15 seconds
      Scale Down:
        Select Policy: Max
        Policies:
          - Type: Percent  Value: 100  Period: 15 seconds
    Deployment pods:       10 current / 10 desired
    Conditions:
      Type            Status  Reason               Message
      ----            ------  ------               -------
      AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
      ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
      ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
    Events:
      Type    Reason             Age                  From                       Message
      ----    ------             ----                 ----                       -------
      Normal  SuccessfulRescale  38m                  horizontal-pod-autoscaler  New size: 8; reason: cpu resource utilization (percentage of request) above target
      Normal  SuccessfulRescale  28m                  horizontal-pod-autoscaler  New size: 7; reason: All metrics below target
      Normal  SuccessfulRescale  27m                  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

    The Events section in the output shows that HPA automatically adjusted the number of replicas based on CPU usage. For example, the number of replicas changed to 8, 7, and 1 at different times. This indicates that HPA can automatically scale based on CPU usage.

Configure a custom metric-based elastic scaling policy based on GPU utilization

Custom metric-based auto scaling relies on the ack-alibaba-cloud-metrics-adapter component provided by ACK and the Kubernetes HPA mechanism. For more information, see Horizontal pod autoscaling based on Alibaba Cloud Prometheus metrics.

The following example demonstrates how to configure custom metric-based auto scaling based on the GPU utilization of pods.

  1. Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM inference service.

  2. Configure custom GPU metrics. For more information, see Implement elastic scaling based on GPU metrics.

  3. Run the following command to deploy the vLLM service.

    arena serve kserve \
        --name=qwen \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
        --scale-target=50 \
        --min-replicas=1 \
        --max-replicas=2 \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
        "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    Expected output:

    inferenceservice.serving.kserve.io/qwen created
    INFO[0002] The Job qwen has been submitted successfully 
    INFO[0002] You can run `arena serve get qwen --type kserve -n default` to check the job status 

    The output indicates that the inference service is deployed.

  4. Run the following commands to use the NGINX Ingress gateway address to access the inference service and test whether the vLLM service is running correctly.

    # Get the IP address of the Nginx Ingress.
    NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
    # Get the hostname of the Inference Service.
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a request to access the inference service.
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    Expected output:

    {"id":"cmpl-77088b96abe744c89284efde2e779174","object":"chat.completion","created":1715590010,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK, what do you need to test?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%    

    The output indicates that the request was sent to the server and that the server returned an expected JSON response.

  5. Run the following command to perform stress testing on the service.

    Note

    For more information about the Hey stress testing tool, see Hey.

    hey -z 2m -c 5 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions 
  6. During the stress test, open a new terminal and run the following command to view the scaling status of the service.

    kubectl describe hpa qwen-hpa

    Expected output:

    Expand to view the scaling status of qwen-hpa

    Name:                                     qwen-hpa
    Namespace:                                default
    Labels:                                   <none>
    Annotations:                              <none>
    CreationTimestamp:                        Tue, 14 May 2024 14:57:03 +0800
    Reference:                                Deployment/qwen-predictor
    Metrics:                                  ( current / target )
      "DCGM_CUSTOM_PROCESS_SM_UTIL" on pods:  0 / 50
    Min replicas:                             1
    Max replicas:                             2
    Deployment pods:                          1 current / 1 desired
    Conditions:
      Type            Status  Reason            Message
      ----            ------  ------            -------
      AbleToScale     True    ReadyForNewScale  recommended size matches current size
      ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric DCGM_CUSTOM_PROCESS_SM_UTIL
      ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
    Events:
      Type    Reason             Age   From                       Message
      ----    ------             ----  ----                       -------
      Normal  SuccessfulRescale  43m   horizontal-pod-autoscaler  New size: 2; reason: pods metric DCGM_CUSTOM_PROCESS_SM_UTIL above target
      Normal  SuccessfulRescale  34m   horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

    The output shows that the number of pods scales out to 2 during the stress test. After the test ends, the number of pods scales in to 1 after about 5 minutes. This indicates that KServe can perform custom metric-based auto scaling based on the GPU utilization of pods.

Configure a scheduled auto scaling policy

Scheduled auto scaling requires the ack-kubernetes-cronhpa-controller component provided by ACK. This component lets you change the number of application replicas at specific times or intervals to handle predictable load changes.

  1. Install the CronHPA component. For more information, see Use CronHPA for scheduled horizontal scaling of containers.

  2. Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM inference service.

  3. Run the following command to deploy the vLLM service.

    arena serve kserve \
        --name=qwen-cronhpa \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --annotation="serving.kserve.io/autoscalerClass=external" \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
       "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    Expected output:

    inferenceservice.serving.kserve.io/qwen-cronhpa created
    INFO[0004] The Job qwen-cronhpa has been submitted successfully 
    INFO[0004] You can run `arena serve get qwen-cronhpa --type kserve -n default` to check the job status 
  4. Run the following commands to test whether the vLLM service is running correctly.

    # Get the IP address of the Nginx Ingress.
    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    # Get the hostname of the Inference Service.
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-cronhpa -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a request to access the inference service.
    curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
         http://$NGINX_INGRESS_IP:80/v1/chat/completions -X POST \
         -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}'

    Expected output:

    {"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK, what do you need to test?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}% 

    The output indicates that the request was sent to the service and that the service returned an expected JSON response.

  5. Run the following command to configure scheduled auto scaling.

    Expand to view the command for configuring scheduled auto scaling

    kubectl apply -f- <<EOF
    apiVersion: autoscaling.alibabacloud.com/v1beta1
    kind: CronHorizontalPodAutoscaler
    metadata:
      name: qwen-cronhpa
      namespace: default 
    spec:
       scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: qwen-cronhpa-predictor
       jobs:
       # Scale out at 10:30 every day.
       - name: "scale-up"
         schedule: "0 30 10 * * *"
         targetSize: 2
         runOnce: false
      # Scale in at 12:00 every day.
       - name: "scale-down"
         schedule: "0 0 12 * * *"
         targetSize: 1
         runOnce: false
    EOF

    Expected output:

    Expand to view the preset scaling configuration

    Name:         qwen-cronhpa
    Namespace:    default
    Labels:       <none>
    Annotations:  <none>
    API Version:  autoscaling.alibabacloud.com/v1beta1
    Kind:         CronHorizontalPodAutoscaler
    Metadata:
      Creation Timestamp:  2024-05-12T14:06:49Z
      Generation:          2
      Resource Version:    9205625
      UID:                 b9e72da7-262e-4***-b***-26586b7****c
    Spec:
      Jobs:
        Name:         scale-up
        Schedule:     0 30 10 * * *
        Target Size:  2
        Name:         scale-down
        Schedule:     0 0 12 * * *
        Target Size:  1
      Scale Target Ref:
        API Version:  apps/v1
        Kind:         Deployment
        Name:         qwen-cronhpa-predictor
    Status:
      Conditions:
        Job Id:           3972f7cc-bab0-482e-8cbe-7c4*******5
        Last Probe Time:  2024-05-12T14:06:49Z
        Message:          
        Name:             scale-up
        Run Once:         false
        Schedule:         0 30 10 * * *
        State:            Submitted
        Target Size:      2
        Job Id:           36a04605-0233-4420-967c-ac2********6
        Last Probe Time:  2024-05-12T14:06:49Z
        Message:          
        Name:             scale-down
        Run Once:         false
        Schedule:         0 0 12 * * *
        State:            Submitted
        Target Size:      1
      Scale Target Ref:
        API Version:  apps/v1
        Kind:         Deployment
        Name:         qwen-cronhpa-predictor
    Events:           <none>
    

    The output shows that a scheduled auto scaling plan is configured for the qwen-cronhpa CRD. Based on the schedule, the number of pods in the qwen-cronhpa-predictor deployment is automatically adjusted at specific times each day to meet the preset scaling requirements.

References

For more information about ACK elastic scaling, see Auto Scaling.