In multi-region application deployments, managing resource allocation across different regions can be challenging. To address this, ACK One provides an intelligent, inventory-aware scheduler for multi-cluster fleets. This topic describes how inventory-aware scheduling works and how to enable it for your fleet.
Overview
When serving GPU inference workloads, users face two primary challenges:
Dynamic GPU availability: The supply of GPU resources fluctuates among regions, making it difficult to guarantee real-time availability.
High cost of GPUs: Pre-provisioning GPU nodes to handle potential demand can lead to significant and unnecessary costs.
The inventory-aware scheduling mechanism, combined with instant scaling, effectively addresses these two challenges. When the existing resources in a fleet's member clusters are insufficient to schedule a new application, the scheduler intelligently places the workload onto a cluster located in a region where GPU inventory is available. That cluster's instant scaling feature then provisions the required nodes on-demand.
This capability maximizes the successful scheduling of applications that depend on scarce resources, such as GPUs, while significantly reducing operational costs.
This feature is currently in preview. To use it, submit a ticket.
How it works
When an application is deployed to a fleet and a member cluster has insufficient resources, the following workflow is triggered:
An application and its propagation policy are created in the fleet's control plane.
The scheduler detects that the target member cluster lacks the necessary resources.
The scheduler queries the member cluster's scaler (ACK GOATScaler) to check for available GPU inventory in its region.
Based on the inventory report, the scheduler re-evaluates its placement decision and dispatches the application to a cluster with available inventory.
Once the application is dispatched to the target cluster, the instant scaling feature provisions new nodes, and the application's pods are scheduled and start running.
Prerequisites
You have associated multiple member clusters with your fleet instance.
You have enabled node instant scaling for the member clusters.
ImportantIf your member clusters are currently configured with auto scaling, switch them to node instant scaling.
You have installed the AMC command line tool.
GPU-accelerated instance specification and estimated cost
GPU memory is occupied by model parameters during the inference phase. The usage is calculated by using the following formula:
Take a 7B model with default FP16 precision as an example: the model parameter count is 7 billion, and the byte size per parameter is 2 bytes (default 16-bit floating number/8 bits per byte).
In addition to the memory used to load the model, you also need to consider the size of the key-value (KV) cache and the GPU utilization. Typically, a proportion of memory is reserved for buffering. Therefore, we recommend using instance types that provide 24 GiB of memory, such as ecs.gn7i-c8g1.2xlarge or ecs.gn7i-c16g1.4xlarge. See GPU-accelerated compute-optimized instance families and Billing for Elastic GPU Service.
Step 1: Prepare the model data
In this step, you will prepare the Qwen-8B model files and create corresponding Object Storage Service (OSS) persistent volumes (PVs) for them in each member cluster
Download the model.
NoteCheck whether the git-lfs plug-in is installed. If it's not, run
yum install git-lfsorapt-get install git-lfsto install it. For more information, see Install git-lfs.git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-8B cd Qwen3-8B git lfs pullCreate a folder in OSS and upload the model to it.
NoteFor detailed steps of how to install and use ossutil, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/Qwen3-8B ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/models/Qwen3-8BCreate a PV and a PersistentVolumeClaim (PVC) in each member cluster to mount the model files from OSS. For more information, see Use an ossfs 1.0 statically provisioned volume.
Step 2: Configure node pools in member clusters
In each member cluster, create or edit a node pool with the following settings:
Instance Type:
ecs.gn7i-c8g1.2xlarge(or another suitable GPU instance type)Scaling Mode: Auto
Expected Nodes: 0
For more operations and parameter configurations, see Create and manage node pools.
When adjusting the node scaling configuration, you can adjust the Defer Scale-in For parameter to shorten the waiting time for subsequent steps.
Step 3: Create the application and propagation policy in the fleet cluster
Create a file named
deploy.yamlto define the inference service deployment.apiVersion: apps/v1 kind: Deployment metadata: labels: app: qwen3-8b name: qwen3-8b namespace: default spec: replicas: 4 selector: matchLabels: app: qwen3-8b template: metadata: labels: app: qwen3-8b spec: volumes: - name: qwen3-8b persistentVolumeClaim: claimName: qwen3-8b - name: dshm emptyDir: medium: Memory sizeLimit: 20Gi containers: - command: - sh - -c - vllm serve /models/qwen3-8b --port 8000 --trust-remote-code --served-model-name qwen3-8b --tensor-parallel=1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-openai:v0.9.1 name: vllm ports: - containerPort: 8000 readinessProbe: tcpSocket: port: 8000 initialDelaySeconds: 30 periodSeconds: 30 resources: limits: nvidia.com/gpu: "1" volumeMounts: - mountPath: /models/qwen3-8b name: qwen3-8b - mountPath: /dev/shm name: dshmCreate a file named
PropagationPolicy.yaml. The key fieldautoScaling.ecsProvision: trueenables inventory-aware scheduling.apiVersion: policy.one.alibabacloud.com/v1alpha1 kind: PropagationPolicy metadata: name: demo-policy spec: # This field enables inventory-aware elastic scheduling. autoScaling: ecsProvision: true preserveResourcesOnDeletion: false conflictResolution: Overwrite resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: qwen3-8b namespace: default placement: replicaScheduling: replicaSchedulingType: Divided weightPreference: dynamicWeight: AvailableReplicas clusterAffinity: clusterNames: - ${cluster1-id} # Replace with your member cluster ID. - ${cluster2-id} # Replace with your member cluster ID.Use the kubeconfig file of the fleet to deploy the application and its propagation policy.
kubectl apply -f deploy.yaml kubectl apply -f PropagationPolicy.yamlAfter a few moments, the GPU node pools in your member clusters will begin to scale up automatically.
Step 4: Validate the elastic scaling
Check the scheduling status of the workload in the fleet.
kubectl get resourcebindingExpected output:
NAME SCHEDULED FULLYAPPLIED OVERRIDDEN ALLAVAILABLE AGE qwen3-8b-deployment True True True False 7m47sThe output shows that
SCHEDULEDisTRUE, indicating that the workload was successfully scheduled.Once the pods are in the
Runningstate, check their distribution across the member clusters.kubectl amc get deploy qwen3-8b -MExpected output:
NAME CLUSTER READY UP-TO-DATE AVAILABLE AGE ADOPTION qwen3-8b cxxxxxxxxxxxxxx 2/2 2 2 3m22s Y qwen3-8b cxxxxxxxxxxxxxx 2/2 2 2 3m22s YThe output shows that all replicas are scheduled and running, even though the member clusters initially had no available GPU nodes.
Update the
deploy.yamlfile to scale down the number of replicas forqwen3-8bto 2, and re-apply it.Alternatively, delete the workload to simulate a scenario where the number of replicas is scaled down to 0.
kubectl apply -f deploy.yamlAfter about 10 minutes, the GPU node pools in the member clusters will automatically scale down to release the unused nodes, thereby reducing costs.
If you delete the workload, the number of nodes is scaled down to 0.