All Products
Search
Document Center

Container Service for Kubernetes:Configure an auto scaling policy for a service by using KServe

Last Updated:Jul 24, 2024

When you deploy a model as an inference service by using KServe, the model that is deployed as the inference service needs to handle loads that dynamically fluctuate. KServe integrates Kubernetes-native Horizontal Pod Autoscaler (HPA) technology and the scaling controller to automatically and flexibly scale pods for a model based on the CPU utilization, memory usage, GPU utilization, and custom performance metrics. This ensures performance and stability of services. This topic describes how to configure an auto scaling policy for a service by using KServe. In this example, a Qwen-7B-Chat-Int8 model that uses NVIDIA V100 GPUs is used.

Prerequisites

Configure an auto scaling policy based on CPU utilization or memory usage

Auto scaling in Raw Deployment mode is implemented based on the HPA mechanism of Kubernetes, which is the most basic method for auto scaling. HPA dynamically adjusts the number of pod replicas in a ReplicaSet based on the CPU utilization or memory usage of pods.

The following example shows how to configure an auto scaling policy based on CPU utilization. For more information about the HPA mechanism, see Horizontal Pod Autoscaling.

  1. Run the following command to submit a service:

    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --scale-metric=cpu \
        --scale-target=10 \
        --min-replicas=1 \
        --max-replicas=10 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    The following table describes the parameters.

    Parameter

    Description

    --scale-metric

    The metric based on which auto scaling is triggered. Valid values: cpu and memory. In this example, this parameter is set to cpu.

    --scale-target

    The scaling threshold in percentage.

    --min-replicas

    The minimum number of pod replicas for scaling. The value of this parameter must be an integer greater than 0. The value 0 is not supported.

    --max-replicas

    The maximum number of pod replicas for scaling. The value of this parameter must be an integer greater than the value of the minReplicas parameter.

    Expected output:

    inferenceservice.serving.kserve.io/sklearn-iris created
    INFO[0002] The Job sklearn-iris has been submitted successfully 
    INFO[0002] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status 

    The preceding output indicates that the sklearn-iris service is created.

  2. Run the following command to prepare an inference request.

    Create a file named iris-input.json and copy the following JSON data to the file as the input data for inference.

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. Run the following command to access the service for inference.

    # Obtain the IP address of the Server Load Balancer (SLB) instance that is configured for the service named nginx-ingress-lb from the kube-system namespace. The IP address is used for external access to the service. 
    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    # Obtain the URL of the inference service named sklearn-iris and extract the hostname from the URL for subsequent use. 
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Run the curl command to send a request to the inference service. The hostname that is extracted from the URL of the sklearn-iris service and JSON content are contained in the request header. -d @./iris-input.json specifies that the request body contains the content of the local file iris-input.json, which contains the input data that is required for model inference. 
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \
         http://$NGINX_INGRESS_IP:80/v1/models/sklearn-iris:predict -d @./iris-input.json

    Expected output:

    {"predictions":[1,1]}%

    The preceding output indicates that the two inferences occur during the request, and the responses to the two inferences are the same.

  4. Run the following command to initiate stress testing.

    Note

    For more information about the hey tool that is used for stress testing, see hey.

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict
  5. Open another terminal during stress testing and run the following command to check the scaling status of the service.

    kubectl describe hpa sklearn-iris-predictor

    Expected output:

    Check the scaling status of the service

    Name:                                                  sklearn-iris-predictor
    Namespace:                                             default
    Labels:                                                app=isvc.sklearn-iris-predictor
                                                           arena.kubeflow.org/uid=3399d840e8b371ed7ca45dda29debeb1
                                                           chart=kserve-0.1.0
                                                           component=predictor
                                                           heritage=Helm
                                                           release=sklearn-iris
                                                           serving.kserve.io/inferenceservice=sklearn-iris
                                                           servingName=sklearn-iris
                                                           servingType=kserve
    Annotations:                                           arena.kubeflow.org/username: kubecfg:certauth:admin
                                                           serving.kserve.io/deploymentMode: RawDeployment
    CreationTimestamp:                                     Sat, 11 May 2024 17:15:47 +0800
    Reference:                                             Deployment/sklearn-iris-predictor
    Metrics:                                               ( current / target )
      resource cpu on pods  (as a percentage of request):  0% (2m) / 10%
    Min replicas:                                          1
    Max replicas:                                          10
    Behavior:
      Scale Up:
        Stabilization Window: 0 seconds
        Select Policy: Max
        Policies:
          - Type: Pods     Value: 4    Period: 15 seconds
          - Type: Percent  Value: 100  Period: 15 seconds
      Scale Down:
        Select Policy: Max
        Policies:
          - Type: Percent  Value: 100  Period: 15 seconds
    Deployment pods:       10 current / 10 desired
    Conditions:
      Type            Status  Reason               Message
      ----            ------  ------               -------
      AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
      ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
      ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
    Events:
      Type    Reason             Age                  From                       Message
      ----    ------             ----                 ----                       -------
      Normal  SuccessfulRescale  38m                  horizontal-pod-autoscaler  New size: 8; reason: cpu resource utilization (percentage of request) above target
      Normal  SuccessfulRescale  28m                  horizontal-pod-autoscaler  New size: 7; reason: All metrics below target
      Normal  SuccessfulRescale  27m                  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

    The Events parameter returned in the expected output indicates that HPA automatically adjusted the number of pod replicas based on the CPU utilization. For example, HPA adjusted the number of pod replicas to 8, 7, and 1 at different points in time. This indicates that HPA can automatically scale pods based on the CPU utilization.

Configure an auto scaling policy based on GPU utilization

Custom metric-based auto scaling is implemented based on the ack-alibaba-cloud-metrics-adapter component and the Kubernetes HPA mechanism provided by Container Service for Kubernetes (ACK). For more information, see Horizontal pod scaling based on Managed Service for Prometheus metrics.

The following example shows how to configure an auto scaling policy based on the GPU utilization of pods.

  1. Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM model as an inference service.

  2. Configure custom GPU metrics. For more information, see Enable auto scaling based on GPU metrics.

  3. Run the following command to deploy a vLLM model as an inference service.

    arena serve kserve \
        --name=qwen \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
        "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    Expected output:

    inferenceservice.serving.kserve.io/qwen created
    INFO[0002] The Job qwen has been submitted successfully 
    INFO[0002] You can run `arena serve get qwen --type kserve -n default` to check the job status 

    The preceding output indicates that the vLLM model is deployed as an inference service.

  4. Run the following command to obtain the IP address of the NGINX Ingress controller and use the obtained IP address to access the inference service to test whether the vLLM model runs as expected.

    # Obtain the IP address of the NGINX Ingress controller. 
    NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
    # Obtain the hostname of the inference service. 
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a request to access the inference service. 
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Perform a text."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}

    Expected output:

    {"id":"cmpl-77088b96abe744c89284efde2e779174","object":"chat.completion","created":1715590010,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test? <|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%    

    The preceding output indicates that the request is correctly sent to the server and the server returns an expected response in the JSON format.

  5. Run the following command to perform a stress test on the service.

    Note

    For more information about the hey tool that is used for stress testing, see hey.

    hey -z 2m -c 5 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "qwen", "messages": [{"role": "user", "content": "Perform a text."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions 
  6. Open another terminal during stress testing and run the following command to check the scaling status of the service.

    kubectl describe hpa qwen-hpa

    Expected output:

    Check the scaling status of the qwen-hpa service

    Name:                                     qwen-hpa
    Namespace:                                default
    Labels:                                   <none>
    Annotations:                              <none>
    CreationTimestamp:                        Tue, 14 May 2024 14:57:03 +0800
    Reference:                                Deployment/qwen-predictor
    Metrics:                                  ( current / target )
      "DCGM_CUSTOM_PROCESS_SM_UTIL" on pods:  0 / 50
    Min replicas:                             1
    Max replicas:                             2
    Deployment pods:                          1 current / 1 desired
    Conditions:
      Type            Status  Reason            Message
      ----            ------  ------            -------
      AbleToScale     True    ReadyForNewScale  recommended size matches current size
      ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric DCGM_CUSTOM_PROCESS_SM_UTIL
      ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
    Events:
      Type    Reason             Age   From                       Message
      ----    ------             ----  ----                       -------
      Normal  SuccessfulRescale  43m   horizontal-pod-autoscaler  New size: 2; reason: pods metric DCGM_CUSTOM_PROCESS_SM_UTIL above target
      Normal  SuccessfulRescale  34m   horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

    The preceding output indicates that the number of pods is increased to 2 during stress testing. However, after about 5 minutes, the number of pods is decreased to 1. This indicates that KServe can scale pods based on the GPU utilization of pods.

Configure a scheduled auto scaling policy

Scheduled auto scaling is implemented based on the ack-kubernetes-cronhpa-controller component provided by ACK. The component allows you to specify the number of pod replicas to scale pods at a specific point in time or on a periodic basis. This helps handle predicable load changes.

  1. Install the CronHPA component. For more information, see Use CronHPA for scheduled horizontal scaling.

  2. Prepare the Qwen-7B-Chat-Int8 model data. For more information, see Deploy a vLLM model as an inference service.

  3. Run the following command to deploy a vLLM model as an inference service.

    arena serve kserve \
        --name=qwen \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
        "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    Expected output:

    inferenceservice.serving.kserve.io/qwen-cronhpa created
    INFO[0004] The Job qwen-cronhpa has been submitted successfully 
    INFO[0004] You can run `arena serve get qwen-cronhpa --type kserve -n default` to check the job status 
  4. Run the following command to test whether the vLLM model is normal.

    # Obtain the IP address of the NGINX Ingress controller. 
    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    # Obtain the hostname of the inference service. 
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a request to access the inference service. 
    curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
         http://$NGINX_INGRESS_IP:80/v1/chat/completions -X POST \
         -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}

    Expected output:

    {"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test? <|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}% 

    The preceding output indicates that the request is correctly sent to the service and the service returns an expected response in the JSON format.

  5. Run the following command to configure a scheduled scaling policy.

    View the commands for configuring a scheduled scaling policy

    kubectl apply -f- <<EOF
    apiVersion: autoscaling.alibabacloud.com/v1beta1
    kind: CronHorizontalPodAutoscaler
    metadata:
      name: qwen-cronhpa
      namespace: default 
    spec:
       scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: qwen-cronhpa-predictor
       jobs:
       # Perform a scale-up at 10:30 every day.
       - name: "scale-down"
         schedule: "0 0 12 * * *"
         targetSize: 1
         runOnce: false
      # Perform a scale-down at 12:00 every day.
       - name: "scale-up"
         schedule: "0 30 10 * * *"
         targetSize: 2
         runOnce: false
    EOF

    Expected output:

    View the preset scaling policy

    Name:         qwen-cronhpa
    Namespace:    default
    Labels:       <none>
    Annotations:  <none>
    API Version:  autoscaling.alibabacloud.com/v1beta1
    Kind:         CronHorizontalPodAutoscaler
    Metadata:
      Creation Timestamp:  2024-05-12T14:06:49Z
      Generation:          2
      Resource Version:    9205625
      UID:                 b9e72da7-262e-4042-b7f8-26586b75ecac
    Spec:
      Jobs:
        Name:         scale-down
        Schedule:     0 0 12 * * *
        Target Size:  1
        Name:         scale-up
        Schedule:     0 30 10 * * *
        Target Size:  2
      Scale Target Ref:
        API Version:  apps/v1
        Kind:         Deployment
        Name:         qwen-cronhpa-predictor
    Status:
      Conditions:
        Job Id:           3972f7cc-bab0-482e-8cbe-7c41661b07f5
        Last Probe Time:  2024-05-12T14:06:49Z
        Message:          
        Name:             scale-down
        Run Once:         false
        Schedule:         0 0 12 * * *
        State:            Submitted
        Target Size:      1
        Job Id:           36a04605-0233-4420-967c-ac2615f43de6
        Last Probe Time:  2024-05-12T14:06:49Z
        Message:          
        Name:             scale-up
        Run Once:         false
        Schedule:         0 30 10 * * *
        State:            Submitted
        Target Size:      2
      Scale Target Ref:
        API Version:  apps/v1
        Kind:         Deployment
        Name:         qwen-cronhpa-predictor
    Events:           <none>
    

    The preceding output indicates that an automatic scaling policy is configured for the qwen-cronhpa custom resource definition (CRD). Based on the policy, the number of pods in the Deployment named qwen-cronhpa-predictor is automatically adjusted at a specific point in time every day to meet the preset scaling requirements.

References

For more information about auto scaling of ACK, see Auto scaling overview.