All Products
Search
Document Center

Container Service for Kubernetes:Configure an auto scaling policy for a service by using KServe

Last Updated:Mar 26, 2026

KServe integrates Kubernetes Horizontal Pod Autoscaler (HPA) and the ACK CronHPA controller, letting you automatically adjust the number of model service pods based on CPU utilization, memory usage, GPU utilization, or a time-based schedule.

This topic uses a Qwen-7B-Chat-Int8 model and a V100 GPU to show how to configure elastic scaling for a KServe inference service.

Choose a scaling type

Scaling type Trigger Deployment mode When to use
CPU/memory-based HPA CPU or memory utilization exceeds a threshold Raw Deployment Unpredictable traffic with CPU- or memory-bound inference workloads
GPU utilization-based HPA Custom GPU metric (DCGM) exceeds a threshold Raw Deployment GPU-bound inference workloads, such as LLM serving
Scheduled scaling (CronHPA) Time schedule (cron expression) Raw Deployment Predictable traffic patterns, such as business-hours peaks
Note

HPA does not support scaling to 0. The --min-replicas value must be an integer greater than 0.

Prerequisites

Before you begin, ensure that you have:

Configure CPU/memory-based elastic scaling

This scaling type uses the Kubernetes HPA mechanism in Raw Deployment mode. HPA dynamically adjusts the number of pod replicas in a ReplicaSet based on CPU or memory utilization. The following example configures CPU-based scaling using an sklearn-iris model.

For background on HPA, see the Kubernetes documentation on Horizontal Pod Autoscaling.

  1. Submit the inference service with scaling parameters.

    Parameter Description
    --scale-metric The scaling metric. Valid values: cpu, memory.
    --scale-target The scaling threshold, as a percentage.
    --min-replicas The minimum number of replicas. Must be an integer greater than 0. HPA does not support scaling to 0.
    --max-replicas The maximum number of replicas. Must be an integer greater than the value of --min-replicas.
    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --scale-metric=cpu \
        --scale-target=10 \
        --min-replicas=1 \
        --max-replicas=10 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    Expected output:

    inferenceservice.serving.kserve.io/sklearn-iris created
    INFO[0002] The Job sklearn-iris has been submitted successfully
    INFO[0002] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status
  2. Create the inference input file. Create a file named iris-input.json with the following content:

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Test the inference service.

    # Get the load balancer IP of the nginx-ingress-lb service in the kube-system namespace.
    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    # Get the hostname of the sklearn-iris InferenceService.
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a prediction request using the iris-input.json file.
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \
         http://$NGINX_INGRESS_IP:80/v1/models/sklearn-iris:predict -d @./iris-input.json

    Expected output:

    {"predictions":[1,1]}%
  4. Run a stress test to trigger scaling.

    Note

    For more information about the Hey stress testing tool, see Hey.

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict
  5. While the stress test is running, open a separate terminal and check the HPA scaling status.

    kubectl describe hpa sklearn-iris-predictor

    Expand to view the scaling status of the service

    Name:                                                  sklearn-iris-predictor
    Namespace:                                             default
    Labels:                                                app=isvc.sklearn-iris-predictor
                                                           arena.kubeflow.org/uid=3399d840e8b371ed7ca45dda29debeb1
                                                           chart=kserve-0.1.0
                                                           component=predictor
                                                           heritage=Helm
                                                           release=sklearn-iris
                                                           serving.kserve.io/inferenceservice=sklearn-iris
                                                           servingName=sklearn-iris
                                                           servingType=kserve
    Annotations:                                           arena.kubeflow.org/username: kubecfg:certauth:admin
                                                           serving.kserve.io/deploymentMode: RawDeployment
    CreationTimestamp:                                     Sat, 11 May 2024 17:15:47 +0800
    Reference:                                             Deployment/sklearn-iris-predictor
    Metrics:                                               ( current / target )
      resource cpu on pods  (as a percentage of request):  0% (2m) / 10%
    Min replicas:                                          1
    Max replicas:                                          10
    Behavior:
      Scale Up:
        Stabilization Window: 0 seconds
        Select Policy: Max
        Policies:
          - Type: Pods     Value: 4    Period: 15 seconds
          - Type: Percent  Value: 100  Period: 15 seconds
      Scale Down:
        Select Policy: Max
        Policies:
          - Type: Percent  Value: 100  Period: 15 seconds
    Deployment pods:       10 current / 10 desired
    Conditions:
      Type            Status  Reason               Message
      ----            ------  ------               -------
      AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
      ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
      ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
    Events:
      Type    Reason             Age                  From                       Message
      ----    ------             ----                 ----                       -------
      Normal  SuccessfulRescale  38m                  horizontal-pod-autoscaler  New size: 8; reason: cpu resource utilization (percentage of request) above target
      Normal  SuccessfulRescale  28m                  horizontal-pod-autoscaler  New size: 7; reason: All metrics below target
      Normal  SuccessfulRescale  27m                  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

    The Events section confirms that HPA automatically adjusted the replica count based on CPU utilization: scaling out to 8, then scaling in to 7 and 1 as load decreased.

Configure GPU utilization-based elastic scaling

This scaling type also uses the Kubernetes HPA mechanism in Raw Deployment mode, but relies on the ack-alibaba-cloud-metrics-adapter component to expose custom GPU metrics to HPA. For more information, see Horizontal pod autoscaling based on Alibaba Cloud Prometheus metrics.

The following example demonstrates how to configure custom metric-based auto scaling based on the GPU utilization of pods.

  1. Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM inference service.

  2. Configure custom GPU metrics. For more information, see Implement elastic scaling based on GPU metrics.

  3. Deploy the vLLM inference service with GPU-based scaling.

    arena serve kserve \
        --name=qwen \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
        --scale-target=50 \
        --min-replicas=1 \
        --max-replicas=2 \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
        "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    Expected output:

    inferenceservice.serving.kserve.io/qwen created
    INFO[0002] The Job qwen has been submitted successfully
    INFO[0002] You can run `arena serve get qwen --type kserve -n default` to check the job status
  4. Test the inference service.

    # Get the IP address of the Nginx Ingress.
    NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
    # Get the hostname of the InferenceService.
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a test request to the vLLM chat completions endpoint.
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    Expected output:

    {"id":"cmpl-77088b96abe744c89284efde2e779174","object":"chat.completion","created":1715590010,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK, what do you need to test?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%
  5. Run a stress test to trigger GPU-based scaling.

    Note

    For more information about the Hey stress testing tool, see Hey.

    hey -z 2m -c 5 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions
  6. While the stress test is running, open a separate terminal and check the HPA scaling status.

    kubectl describe hpa qwen-hpa

    Expand to view the scaling status of qwen-hpa

    Name:                                     qwen-hpa
    Namespace:                                default
    Labels:                                   <none>
    Annotations:                              <none>
    CreationTimestamp:                        Tue, 14 May 2024 14:57:03 +0800
    Reference:                                Deployment/qwen-predictor
    Metrics:                                  ( current / target )
      "DCGM_CUSTOM_PROCESS_SM_UTIL" on pods:  0 / 50
    Min replicas:                             1
    Max replicas:                             2
    Deployment pods:                          1 current / 1 desired
    Conditions:
      Type            Status  Reason            Message
      ----            ------  ------            -------
      AbleToScale     True    ReadyForNewScale  recommended size matches current size
      ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric DCGM_CUSTOM_PROCESS_SM_UTIL
      ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
    Events:
      Type    Reason             Age   From                       Message
      ----    ------             ----  ----                       -------
      Normal  SuccessfulRescale  43m   horizontal-pod-autoscaler  New size: 2; reason: pods metric DCGM_CUSTOM_PROCESS_SM_UTIL above target
      Normal  SuccessfulRescale  34m   horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

    The output shows the service scaled out to 2 pods during the stress test and scaled back to 1 pod approximately 5 minutes after the test ended.

Configure scheduled elastic scaling

Scheduled elastic scaling uses the ack-kubernetes-cronhpa-controller component (CronHPA) provided by ACK. CronHPA lets you adjust the number of pod replicas at specific times or intervals using cron expressions, making it suitable for predictable traffic patterns.

  1. Install the CronHPA component. For more information, see Use CronHPA for scheduled horizontal scaling of containers.

  2. Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM inference service.

  3. Deploy the vLLM inference service.

    arena serve kserve \
        --name=qwen-cronhpa \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --annotation="serving.kserve.io/autoscalerClass=external" \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
       "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    Expected output:

    inferenceservice.serving.kserve.io/qwen-cronhpa created
    INFO[0004] The Job qwen-cronhpa has been submitted successfully
    INFO[0004] You can run `arena serve get qwen-cronhpa --type kserve -n default` to check the job status
  4. Test the inference service.

    # Get the IP address of the Nginx Ingress.
    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    # Get the hostname of the InferenceService.
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-cronhpa -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a test request to the vLLM chat completions endpoint.
    curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
         http://$NGINX_INGRESS_IP:80/v1/chat/completions -X POST \
         -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}'

    Expected output:

    {"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK, what do you need to test?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%
  5. Apply the CronHPA configuration. The following example scales out to 2 pods at 10:30 every day and scales in to 1 pod at 12:00 every day.

    Expand to view the command for configuring scheduled auto scaling

    kubectl apply -f- <<EOF
    apiVersion: autoscaling.alibabacloud.com/v1beta1
    kind: CronHorizontalPodAutoscaler
    metadata:
      name: qwen-cronhpa
      namespace: default
    spec:
       scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: qwen-cronhpa-predictor
       jobs:
       # Scale out at 10:30 every day.
       - name: "scale-up"
         schedule: "0 30 10 * * *"
         targetSize: 2
         runOnce: false
      # Scale in at 12:00 every day.
       - name: "scale-down"
         schedule: "0 0 12 * * *"
         targetSize: 1
         runOnce: false
    EOF

    Expand to view the preset scaling configuration

    Name:         qwen-cronhpa
    Namespace:    default
    Labels:       <none>
    Annotations:  <none>
    API Version:  autoscaling.alibabacloud.com/v1beta1
    Kind:         CronHorizontalPodAutoscaler
    Metadata:
      Creation Timestamp:  2024-05-12T14:06:49Z
      Generation:          2
      Resource Version:    9205625
      UID:                 b9e72da7-262e-4***-b***-26586b7****c
    Spec:
      Jobs:
        Name:         scale-up
        Schedule:     0 30 10 * * *
        Target Size:  2
        Name:         scale-down
        Schedule:     0 0 12 * * *
        Target Size:  1
      Scale Target Ref:
        API Version:  apps/v1
        Kind:         Deployment
        Name:         qwen-cronhpa-predictor
    Status:
      Conditions:
        Job Id:           3972f7cc-bab0-482e-8cbe-7c4*******5
        Last Probe Time:  2024-05-12T14:06:49Z
        Message:
        Name:             scale-up
        Run Once:         false
        Schedule:         0 30 10 * * *
        State:            Submitted
        Target Size:      2
        Job Id:           36a04605-0233-4420-967c-ac2********6
        Last Probe Time:  2024-05-12T14:06:49Z
        Message:
        Name:             scale-down
        Run Once:         false
        Schedule:         0 0 12 * * *
        State:            Submitted
        Target Size:      1
      Scale Target Ref:
        API Version:  apps/v1
        Kind:         Deployment
        Name:         qwen-cronhpa-predictor
    Events:           <none>

    The output confirms the CronHPA resource is created with two scheduled jobs. The qwen-cronhpa-predictor Deployment automatically scales to 2 pods at 10:30 and back to 1 pod at 12:00 each day.

What's next

For more information about elastic scaling in ACK, see Auto Scaling.