Inventory-aware elastic scheduling for a cross-region multi-cluster fleet - Container Service for Kubernetes

In multi-region application deployments, managing resource allocation across different regions can be challenging. To address this, ACK One provides an intelligent, inventory-aware scheduler for multi-cluster fleets. This topic describes how inventory-aware scheduling works and how to enable it for your fleet.

Overview

When serving GPU inference workloads, users face two primary challenges:

Dynamic GPU availability: The supply of GPU resources fluctuates among regions, making it difficult to guarantee real-time availability.
High cost of GPUs: Pre-provisioning GPU nodes to handle potential demand can lead to significant and unnecessary costs.

The inventory-aware scheduling mechanism, combined with instant scaling, effectively addresses these two challenges. When the existing resources in a fleet's member clusters are insufficient to schedule a new application, the scheduler intelligently places the workload onto a cluster located in a region where GPU inventory is available. That cluster's instant scaling feature then provisions the required nodes on-demand.

This capability maximizes the successful scheduling of applications that depend on scarce resources, such as GPUs, while significantly reducing operational costs.

Important

This feature is currently in preview. To use it, submit a ticket.

How it works

When an application is deployed to a fleet and a member cluster has insufficient resources, the following workflow is triggered:

An application and its propagation policy are created in the fleet's control plane.
The scheduler detects that the target member cluster lacks the necessary resources.
The scheduler queries the member cluster's scaler (ACK GOATScaler) to check for available GPU inventory in its region.
Based on the inventory report, the scheduler re-evaluates its placement decision and dispatches the application to a cluster with available inventory.
Once the application is dispatched to the target cluster, the instant scaling feature provisions new nodes, and the application's pods are scheduled and start running.

Prerequisites

You have associated multiple member clusters with your fleet instance.
You have enabled node instant scaling for the member clusters.
Important
If your member clusters are currently configured with auto scaling, switch them to node instant scaling.
You have installed the AMC command line tool.

GPU-accelerated instance specification and estimated cost

GPU memory is occupied by model parameters during the inference phase. The usage is calculated by using the following formula:

$GP U m e m ory = M o d e l p a r am e t er co u n t * P rec i s i o n d a t a b y t es$

Take a 7B model with default FP16 precision as an example: the model parameter count is 7 billion, and the byte size per parameter is 2 bytes (default 16-bit floating number/8 bits per byte).

$GP U m e m ory = 7 * 1 0^{9} * 2 b y t es \approx 13.04 G i B$

In addition to the memory used to load the model, you also need to consider the size of the key-value (KV) cache and the GPU utilization. Typically, a proportion of memory is reserved for buffering. Therefore, we recommend using instance types that provide 24 GiB of memory, such as ecs.gn7i-c8g1.2xlarge or ecs.gn7i-c16g1.4xlarge. See GPU-accelerated compute-optimized instance families and Billing for Elastic GPU Service.

Step 1: Prepare the model data

In this step, you will prepare the Qwen-8B model files and create corresponding Object Storage Service (OSS) persistent volumes (PVs) for them in each member cluster

Download the model.
Note
Check whether the git-lfs plug-in is installed. If it's not, run yum install git-lfs or apt-get install git-lfs to install it. For more information, see Install git-lfs.
```
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-8B
cd Qwen3-8B
git lfs pull
```
Create a folder in OSS and upload the model to it.
Note
For detailed steps of how to install and use ossutil, see Install ossutil.
```
ossutil mkdir oss://<your-bucket-name>/models/Qwen3-8B
ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/models/Qwen3-8B
```

Create a PV and a PersistentVolumeClaim (PVC) in each member cluster to mount the model files from OSS. For more information, see Use an ossfs 1.0 statically provisioned volume.

YAML template

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak> # The AccessKey ID used to access OSS.
  akSecret: <your-oss-sk> # The AccessKey secret used to access OSS.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen3-8b
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi
  selector:
    matchLabels:
      alicloud-pvname: qwen3-8b
  storageClassName: oss
  volumeMode: Filesystem
  volumeName: qwen3-8b
---
apiVersion: v1
kind: PersistentVolume
metadata:
  labels:
    alicloud-pvname: qwen3-8b
  name: qwen3-8b
spec:
  accessModes:
    - ReadWriteMany
  capacity:
    storage: 20Gi
  csi:
    driver: ossplugin.csi.alibabacloud.com
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # The name of the bucket.
      otherOpts: '-o allow_other -o umask=000'
      path: <your-model-path> # In this example, the path is /models/Qwen3-8B/.
      url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
    volumeHandle: qwen3-8b
  persistentVolumeReclaimPolicy: Retain
  storageClassName: oss
  volumeMode: Filesystem

Step 2: Configure node pools in member clusters

In each member cluster, create or edit a node pool with the following settings:

Instance Type: ecs.gn7i-c8g1.2xlarge (or another suitable GPU instance type)
Scaling Mode: Auto
Expected Nodes: 0

For more operations and parameter configurations, see Create and manage node pools.

When adjusting the node scaling configuration, you can adjust the Defer Scale-in For parameter to shorten the waiting time for subsequent steps.

Step 3: Create the application and propagation policy in the fleet cluster

Create a file named deploy.yaml to define the inference service deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen3-8b
  name: qwen3-8b
  namespace: default
spec:
  replicas: 4
  selector:
    matchLabels:
      app: qwen3-8b
  template:
    metadata:
      labels:
        app: qwen3-8b
    spec:
      volumes:
        - name: qwen3-8b
          persistentVolumeClaim:
            claimName: qwen3-8b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 20Gi
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/qwen3-8b --port 8000 --trust-remote-code --served-model-name qwen3-8b --tensor-parallel=1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-openai:v0.9.1
        name: vllm
        ports:
        - containerPort: 8000
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "1"
        volumeMounts:
          - mountPath: /models/qwen3-8b
            name: qwen3-8b
          - mountPath: /dev/shm
            name: dshm

Create a file named PropagationPolicy.yaml. The key field autoScaling.ecsProvision: true enables inventory-aware scheduling.

apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
  name: demo-policy
spec:
  # This field enables inventory-aware elastic scheduling.
  autoScaling:
    ecsProvision: true
  preserveResourcesOnDeletion: false
  conflictResolution: Overwrite
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    name: qwen3-8b
    namespace: default
  placement:
    replicaScheduling:
      replicaSchedulingType: Divided
      weightPreference:
        dynamicWeight: AvailableReplicas
    clusterAffinity:
      clusterNames:
      - ${cluster1-id} # Replace with your member cluster ID.
      - ${cluster2-id} # Replace with your member cluster ID.

Use the kubeconfig file of the fleet to deploy the application and its propagation policy.
```
kubectl apply -f deploy.yaml
kubectl apply -f PropagationPolicy.yaml
```
After a few moments, the GPU node pools in your member clusters will begin to scale up automatically.

Step 4: Validate the elastic scaling

Check the scheduling status of the workload in the fleet.

kubectl get resourcebinding

Expected output:

NAME                  SCHEDULED   FULLYAPPLIED   OVERRIDDEN   ALLAVAILABLE   AGE
qwen3-8b-deployment   True        True           True         False          7m47s

The output shows that SCHEDULED is TRUE, indicating that the workload was successfully scheduled.

Once the pods are in the Running state, check their distribution across the member clusters.

kubectl amc get deploy qwen3-8b -M

Expected output:

NAME       CLUSTER           READY   UP-TO-DATE   AVAILABLE   AGE     ADOPTION
qwen3-8b   cxxxxxxxxxxxxxx   2/2     2            2           3m22s   Y
qwen3-8b   cxxxxxxxxxxxxxx   2/2     2            2           3m22s   Y

The output shows that all replicas are scheduled and running, even though the member clusters initially had no available GPU nodes.

Update the deploy.yaml file to scale down the number of replicas for qwen3-8b to 2, and re-apply it.
Alternatively, delete the workload to simulate a scenario where the number of replicas is scaled down to 0.
```
kubectl apply -f deploy.yaml
```
After about 10 minutes, the GPU node pools in the member clusters will automatically scale down to release the unused nodes, thereby reducing costs.
If you delete the workload, the number of nodes is scaled down to 0.