All Products
Search
Document Center

Container Service for Kubernetes:Deploy LLMs as elastic inference services in ACK Edge clusters in hybrid cloud environments

Last Updated:Mar 28, 2025

When you use large language models (LLMs) to deploy elastic inference services in hybrid cloud environments, traffic distribution may become imbalanced, which may lead to GPU allocation issues in data centers. To resolve the issues, ACK Edge clusters provide a solution for deploying LLMs as elastic inference services in hybrid cloud environments. This solution helps you centrally manage GPU resources in the cloud and data centers. You can use this solution to configure your business to preferably use resources in data centers during off-peak hours and launch resources in the cloud during peak hours. This solution greatly reduces the operational costs of inference services deployed by using LLMs, and dynamically and flexibly adjusts resource supply. This ensures service stability and prevents idle resources.

Solution overview

Architecture

This solution is developed based on the cloud-edge collaboration capability of ACK Edge clusters. You can use this solution to centrally manage computing resources in the cloud and data centers and dynamically allocate resources to computing tasks. After you deploy an LLM as an inference service in your cluster, you can use KServe to configure a scaling policy for the inference service.

  • During off-peak hours, you can create a ResourcePolicy in your cluster to enable priority-based resource scheduling for your inference service. You can assign computing resources in data centers a higher priority than computing resources in the cloud. This way, your inference service preferably uses on-premises computing resources.

  • During peak hours, KServe can leverage the monitoring capability of ACK Edge clusters to monitor the GPU utilization and workload status in real time. When the scaling conditions are met, KServe dynamically scales out the pods in which the inference service is deployed. When on-premises GPU resources become insufficient, the system allocates the GPU resources provided by a pre-configured elastic node pool in the cloud to the inference service. This ensures service stability and continuity.

image
  1. Inference requests: A large number of inference requests

  2. Resource scheduling: The system preferably schedules the inference service to the resource pools in the data center.

  3. Use on-cloud resources for scale-out: When resources in the data center become insufficient, the system allocates the resources provided by the pre-configured elastic node pool in the cloud to the inference service.

Key components

This solution includes the following key components: ACK Edge cluster, KServe, elastic node pool (node auto scaling), and ResourcePolicy (priority-based resource scheduling).

Click to view the introduction to the key components

ACK Edge cluster

An ACK Edge cluster is a cloud-hosted Kubernetes cluster that serves as a cloud-native cloud-edge collaboration platform where you can connect and manage edge computing resources and services in a unified manner.

KServe

KServe is an open source cloud-native model service platform designed to simplify the process of deploying and running machine learning (ML) models on Kubernetes. KServe supports multiple ML frameworks and provides scaling capabilities. KServe makes it easier to configure and manage model services by defining simple YAML files and providing declarative APIs for model deployment and management. In addition, KServe provides a series of CustomResourceDefinitions (CRDs) to manage and deliver ML model services.

Elastic node pool (node auto scaling)

Node auto scaling is a feature used to automatically scale computing resources in clusters. This feature is implemented based on the cluster-autoscaler component. The cluster-autoscaler component regularly monitors the cluster status and automatically scales nodes in the cluster. When pods cannot be scheduled to nodes, the node auto scaling feature simulates the scheduling process to check whether a scale-out activity is required. If a scale-out activity is required, a node pool is automatically added to the cluster to meet the resource requirements of the pods. This achieves efficient resource management and ensures application stability.

ResourcePolicy (priority-based resource scheduling)

The ResourcePolicy resource is used to configure priority-based resource scheduling for the scheduler in ACK Edge clusters. Priority-based resource scheduling provides fine-grained resource scheduling for scenarios where multiple types of resources are used. You can use a ResourcePolicy to specify the usage rules and priorities of multiple resource types in descending order. This way, the system schedules pods to different types of nodes based on their usage rules and priorities.

When you create pods, you can configure the scheduling priority of pods based on your business requirements and resource characteristics. For example, you can configure the system to preferably schedule compute-intensive applications to high-performance (HPC) nodes and data-intensive applications to nodes that provide rich storage resources. In this case, compute-intensive applications have a higher scheduling priority than data-intensive applications. During scale-out activities, pods are deleted from nodes in ascending order of priority. In this case, pods of data-intensive applications are deleted first. After all pods of data-intensive applications are deleted, the system starts to delete pods of compute-intensive applications.

Example

  1. Prepare the environment.

    1. Create an ACK Edge cluster.

    2. Create an elastic node pool.

    3. Install KServer in the ACK Edge cluster.

    4. Configure the Arena client.

    5. Deploy a monitoring component and configure GPU metrics.

    After you complete the preceding operations, classify the resources in the cluster into three types and add the resources to the following node pools.

    Type

    Node pool type

    Description

    Example

    On-cloud control resource pool

    On-cloud

    An on-cloud node pool used to deploy ACK Edge clusters and control components such as KServe.

    default-nodepool

    On-premises resource pool

    Edge/dedicated

    Computing resources in data centers used to host inference services deployed by using LLMs.

    GPU-V100-Edge

    On-cloud elastic resource pool

    On-cloud

    A scalable resource pool can dynamically scale to meet the GPU resource requirements of the cluster and host inference services deployed by using LLMs during peak hours.

    GPU-V100-Elastic

  2. Prepare an AI model.

    You can use Object Storage Service (OSS) or File Storage NAS (NAS) to prepare the model data. For more information, see Prepare model data and upload the model data to an OSS bucket.

  3. Specify resource priorities.

    Create a ResourcePolicy to specify resource priorities. In this example, the labelSelector parameter of the ResourcePolicy is set to app: isvc.qwen-predictor to select the application to which the ResourcePolicy is applied. The following ResourcePolicy specifies that the matching pods are scheduled to the on-premises resource pool first. When the resources provided by the on-premises resource pool become insufficient, the system schedules the matching pods to the on-cloud elastic resource pool. For more information about how to configure a ResourcePolicy, see Configure priority-based resource scheduling.

    Important

    When creating application pods subsequently, you must add labels matching the following labelSelector to associate them with the scheduling policy defined here.

    apiVersion: scheduling.alibabacloud.com/v1alpha1
    kind: ResourcePolicy
    metadata:
      name: qwen-chat
      namespace: default
    spec:
      selector:
        app: isvc.qwen-predictor # You must specify a label of the pods to which you want to apply the ResourcePolicy. 
      strategy: prefer
      units:
      - resource: ecs
        nodeSelector:
          alibabacloud.com/nodepool-id: npxxxxxx  # Replace the value with the ID of the on-premises resource pool. 
      - resource: elastic
        nodeSelector:
          alibabacloud.com/nodepool-id: npxxxxxy  # Replace the value with the ID of the on-cloud resource pool. 
  4. Deploy the LLM as an inference service.

    Run the following command on the Arena client to use KServe to deploy an inference service based on the LLM.

     arena serve kserve \
        --name=qwen-chat \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
        --scale-target=50 \
        --min-replicas=1  \
        --max-replicas=3  \
        --gpus=1 \
        --cpu=4  \
        --memory=12Gi \
        --data="llm-model:/mnt/models/Qwen" \
        "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    Parameter

    Required

    Description

    Example

    --name

    Yes

    The name of the inference service, which is globally unique.

    qwen-chat

    --image

    Yes

    The address of the inference service image. In this example, the virtual large language model (vLLM) inference framework is used.

    kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1

    --scale-metric

    No

    The scaling metric. In this example, the GPU utilization metric DCGM_CUSTOM_PROCESS_SM_UTIL is used as the scaling metric. For more information, see Configure HPA.

    DCGM_CUSTOM_PROCESS_SM_UTIL

    --scale-target

    No

    The scaling threshold. In this example, the scaling threshold is 50%. When the GPU utilization exceeds 50%, the system scales the pod replicas.

    50

    --min-replicas

    No

    The minimum number of pod replicas.

    1

    --max-replicas

    No

    The maximum number of pod replicas.

    3

    --gpus

    No

    The number of GPUs requested by the inference service. Default value: 0.

    1

    --cpu

    No

    The number of vCores requested by the inference service.

    4

    --memory

    No

    The size of memory requested by the inference service.

    12Gi

    --data

    No

    The address of the inference service model. In this example, the volume of the model is llm-model, which is mounted to the /mnt/models/ directory of the container.

    "llm-model:/mnt/models/Qwen" \

    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

  5. Checks whether the elastic inference service is deployed.

    curl -H "Host: qwen-chat-default.example.com" \ # Obtain the address from the details of the Ingress automatically created by KServe. 
    -H "Content-Type: application/json"      \
    http://xx.xx.xx.xx:80/v1/chat/completions \
    -X POST      \
    -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}
  6. Use the stress testing tool hey to send a large number of requests to the inference service to simulate traffic spikes during peak hours and test whether on-cloud resources are launched.

    hey -z 2m -c 5 \
    -m POST -host qwen-chat-default.example.com \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \
    http://xx.xx.xx.xx:80/v1/chat/completions

    After the requests are sent to the pods, the GPU utilization of the inference service exceeds the scaling threshold (50%). In this case, HPA scales out the pods based on the pre-defined scaling rules. The following figure shows that the number of pods created for the inference service increases to three.11

    However, the data center in the test environment provides only one GPU. As a result, the two newly created pods cannot be scheduled and remain in the pending state. In this case, cluster-autoscaler automatically launches two on-cloud GPU-accelerated nodes to host the two pending pods.123

References