When you use large language models (LLMs) to deploy elastic inference services in hybrid cloud environments, traffic distribution may become imbalanced, which may lead to GPU allocation issues in data centers. To resolve the issues, ACK Edge clusters provide a solution for deploying LLMs as elastic inference services in hybrid cloud environments. This solution helps you centrally manage GPU resources in the cloud and data centers. You can use this solution to configure your business to preferably use resources in data centers during off-peak hours and launch resources in the cloud during peak hours. This solution greatly reduces the operational costs of inference services deployed by using LLMs, and dynamically and flexibly adjusts resource supply. This ensures service stability and prevents idle resources.
Solution overview
Architecture
This solution is developed based on the cloud-edge collaboration capability of ACK Edge clusters. You can use this solution to centrally manage computing resources in the cloud and data centers and dynamically allocate resources to computing tasks. After you deploy an LLM as an inference service in your cluster, you can use KServe to configure a scaling policy for the inference service.
During off-peak hours, you can create a ResourcePolicy in your cluster to enable priority-based resource scheduling for your inference service. You can assign computing resources in data centers a higher priority than computing resources in the cloud. This way, your inference service preferably uses on-premises computing resources.
During peak hours, KServe can leverage the monitoring capability of ACK Edge clusters to monitor the GPU utilization and workload status in real time. When the scaling conditions are met, KServe dynamically scales out the pods in which the inference service is deployed. When on-premises GPU resources become insufficient, the system allocates the GPU resources provided by a pre-configured elastic node pool in the cloud to the inference service. This ensures service stability and continuity.
Inference requests: A large number of inference requests
Resource scheduling: The system preferably schedules the inference service to the resource pools in the data center.
Use on-cloud resources for scale-out: When resources in the data center become insufficient, the system allocates the resources provided by the pre-configured elastic node pool in the cloud to the inference service.
Key components
This solution includes the following key components: ACK Edge cluster, KServe, elastic node pool (node auto scaling), and ResourcePolicy (priority-based resource scheduling).
Example
Prepare the environment.
After you complete the preceding operations, classify the resources in the cluster into three types and add the resources to the following node pools.
Type
Node pool type
Description
Example
On-cloud control resource pool
On-cloud
An on-cloud node pool used to deploy ACK Edge clusters and control components such as KServe.
default-nodepool
On-premises resource pool
Edge/dedicated
Computing resources in data centers used to host inference services deployed by using LLMs.
GPU-V100-Edge
On-cloud elastic resource pool
On-cloud
A scalable resource pool can dynamically scale to meet the GPU resource requirements of the cluster and host inference services deployed by using LLMs during peak hours.
GPU-V100-Elastic
Prepare an AI model.
You can use Object Storage Service (OSS) or File Storage NAS (NAS) to prepare the model data. For more information, see Prepare model data and upload the model data to an OSS bucket.
Specify resource priorities.
Create a ResourcePolicy to specify resource priorities. In this example, the labelSelector parameter of the ResourcePolicy is set to
app: isvc.qwen-predictorto select the application to which the ResourcePolicy is applied. The following ResourcePolicy specifies that the matching pods are scheduled to the on-premises resource pool first. When the resources provided by the on-premises resource pool become insufficient, the system schedules the matching pods to the on-cloud elastic resource pool. For more information about how to configure a ResourcePolicy, see Configure priority-based resource scheduling.ImportantWhen creating application pods subsequently, you must add labels matching the following
labelSelectorto associate them with the scheduling policy defined here.apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: qwen-chat namespace: default spec: selector: app: isvc.qwen-predictor # You must specify a label of the pods to which you want to apply the ResourcePolicy. strategy: prefer units: - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: npxxxxxx # Replace the value with the ID of the on-premises resource pool. - resource: elastic nodeSelector: alibabacloud.com/nodepool-id: npxxxxxy # Replace the value with the ID of the on-cloud resource pool.Deploy the LLM as an inference service.
Run the following command on the Arena client to use KServe to deploy an inference service based on the LLM.
arena serve kserve \ --name=qwen-chat \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \ --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \ --scale-target=50 \ --min-replicas=1 \ --max-replicas=3 \ --gpus=1 \ --cpu=4 \ --memory=12Gi \ --data="llm-model:/mnt/models/Qwen" \ "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"Parameter
Required
Description
Example
--nameYes
The name of the inference service, which is globally unique.
qwen-chat
--imageYes
The address of the inference service image. In this example, the virtual large language model (vLLM) inference framework is used.
kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
--scale-metricNo
The scaling metric. In this example, the GPU utilization metric
DCGM_CUSTOM_PROCESS_SM_UTILis used as the scaling metric. For more information, see Configure HPA.DCGM_CUSTOM_PROCESS_SM_UTIL
--scale-targetNo
The scaling threshold. In this example, the scaling threshold is 50%. When the GPU utilization exceeds 50%, the system scales the pod replicas.
50
--min-replicasNo
The minimum number of pod replicas.
1
--max-replicasNo
The maximum number of pod replicas.
3
--gpusNo
The number of GPUs requested by the inference service. Default value: 0.
1
--cpuNo
The number of vCores requested by the inference service.
4
--memoryNo
The size of memory requested by the inference service.
12Gi
--dataNo
The address of the inference service model. In this example, the volume of the model is llm-model, which is mounted to the /mnt/models/ directory of the container.
"llm-model:/mnt/models/Qwen" \
"python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"
Checks whether the elastic inference service is deployed.
curl -H "Host: qwen-chat-default.example.com" \ # Obtain the address from the details of the Ingress automatically created by KServe. -H "Content-Type: application/json" \ http://xx.xx.xx.xx:80/v1/chat/completions \ -X POST \ -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}Use the stress testing tool hey to send a large number of requests to the inference service to simulate traffic spikes during peak hours and test whether on-cloud resources are launched.
hey -z 2m -c 5 \ -m POST -host qwen-chat-default.example.com \ -H "Content-Type: application/json" \ -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \ http://xx.xx.xx.xx:80/v1/chat/completionsAfter the requests are sent to the pods, the GPU utilization of the inference service exceeds the scaling threshold (50%). In this case, HPA scales out the pods based on the pre-defined scaling rules. The following figure shows that the number of pods created for the inference service increases to three.
However, the data center in the test environment provides only one GPU. As a result, the two newly created pods cannot be scheduled and remain in the
pendingstate. In this case, cluster-autoscaler automatically launches two on-cloud GPU-accelerated nodes to host the twopendingpods.
References
For more information about how to deploy inference services, see Deploy AI inference services on Kubernetes.
For more information about the on-cloud elasticity capability of ACK Edge clusters, see Cloud elasticity.
For more information about how to accelerate access to OSS buckets from edge nodes, see Use Fluid to accelerate access to OSS buckets from edge nodes.