All Products
Search
Document Center

Container Service for Kubernetes:Inventory-aware cross-region multi-cluster elastic scheduling

Last Updated:Mar 26, 2026

GPU availability across regions is unpredictable. Pre-provisioning nodes to guarantee capacity is expensive. ACK One multi-cluster fleets address this with an inventory-aware scheduler: when all child clusters in a fleet lack available GPU nodes, the scheduler queries real-time ECS inventory, selects a cluster with available capacity, and triggers instant node elasticity to scale out nodes on demand. Workloads run without requiring idle GPU nodes to be standing by.

Important

This feature is currently in invitational preview. To try it, submit a ticket.

How it works

Three components collaborate to deliver inventory-aware elastic scheduling:

Component Role
Fleet scheduler Detects resource shortfalls in child clusters, queries inventory via ACK GOATScaler, and distributes replicas based on available capacity—including inventory not yet provisioned as running nodes.
ACK GOATScaler Runs in each child cluster. Checks real-time ECS inventory and returns available instance counts to the fleet scheduler.
Child cluster node pools Configured with instant node elasticity and zero desired nodes. Nodes scale out only when the scheduler assigns workloads, and scale back in when workloads are removed.

When you create an application in a fleet and no child cluster has enough running resources, the following sequence runs:

  1. The scheduler detects that child clusters lack resources and cannot schedule the workload.

  2. The scheduler triggers ACK GOATScaler in each child cluster to check inventory.

  3. Based on the inventory result, the scheduler reschedules and distributes the application to the cluster with available capacity.

  4. The selected child cluster scales out nodes and runs the application.

image

Prerequisites

Before you begin, make sure you have:

  • Multiple associated clusters in the fleet — the scheduler distributes workloads across these child clusters.

  • Instant node elasticity enabled for each child cluster — this allows child clusters to scale out GPU nodes when the scheduler assigns workloads to them.

  • The AMC command-line tool installed — used in Step 3 to verify pod distribution across clusters.

Important

If node autoscaling is already enabled for a child cluster, switch to instant node elasticity before proceeding. See Enable instant node elasticity.

GPU instance specifications and cost estimation

Model parameters are the main consumer of GPU memory during inference. Use the following formula to estimate required GPU memory:

GPU memory = Number of parameters × Bytes per parameter

Example: 7B model at FP16 precision

Factor Value
Parameters 7 × 10⁹
Bytes per parameter (FP16) 2 bytes
Model memory 7 × 10⁹ × 2 bytes ≈ 13.04 GiB

Beyond model loading, account for KV cache and computation buffers. For a 7B model at FP16, use a GPU instance with at least 24 GiB of GPU memory, such as ecs.gn7i-c8g1.2xlarge or ecs.gn7i-c16g1.4xlarge.

For full instance type details and pricing, see GPU-accelerated compute optimized instance family and Elastic GPU Service billing.

Preparations

This section prepares the Qwen3-8B model files and creates the corresponding OSS PersistentVolumes in each child cluster.

1. Download the model

Note

Make sure Git Large File Storage (LFS) is installed. If not, run yum install git-lfs or apt-get install git-lfs. For other installation methods, see Install Git Large File Storage.

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-8B
cd Qwen3-8B
git lfs pull

2. Upload the model to OSS

Note

For ossutil installation and usage, see Install ossutil.

ossutil mkdir oss://<your-bucket-name>/models/Qwen3-8B
ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/models/Qwen3-8B

3. Create a PersistentVolume and PersistentVolumeClaim in each child cluster

For detailed steps, see Use ossfs 1.0 static persistent volume.

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak>      # AccessKey ID for accessing OSS
  akSecret: <your-oss-sk>  # AccessKey Secret for accessing OSS
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen3-8b
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi
  selector:
    matchLabels:
      alicloud-pvname: qwen3-8b
  storageClassName: oss
  volumeMode: Filesystem
  volumeName: qwen3-8b
---
apiVersion: v1
kind: PersistentVolume
metadata:
  labels:
    alicloud-pvname: qwen3-8b
  name: qwen3-8b
spec:
  accessModes:
    - ReadWriteMany
  capacity:
    storage: 20Gi
  csi:
    driver: ossplugin.csi.alibabacloud.com
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name>       # Bucket name
      otherOpts: '-o allow_other -o umask=000'
      path: <your-model-path>          # Example: /models/Qwen3-8B/
      url: <your-bucket-endpoint>      # Example: oss-cn-hangzhou-internal.aliyuncs.com
    volumeHandle: qwen3-8b
  persistentVolumeReclaimPolicy: Retain
  storageClassName: oss
  volumeMode: Filesystem

Step 1: Configure node pools for child clusters

Create or edit a GPU node pool in each child cluster with the following settings:

Setting Value
Instance type ecs.gn7i-c8g1.2xlarge
Scaling Mode Auto
Desired nodes 0

Starting at zero nodes eliminates idle GPU costs. The node pool scales out only when the fleet scheduler assigns workloads.

For detailed configuration steps, see Create and manage node pools.

Note

To reduce wait time in Step 3, shorten the scale-in trigger delay in the node pool settings.

Step 2: Create an application and distribution policy in the fleet cluster

All resources in this step use the fleet cluster's kubeconfig.

1. Create deploy.yaml

The Deployment runs 4 replicas of the Qwen3-8B model using vLLM, with each pod requesting 1 GPU.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen3-8b
  name: qwen3-8b
  namespace: default
spec:
  replicas: 4
  selector:
    matchLabels:
      app: qwen3-8b
  template:
    metadata:
      labels:
        app: qwen3-8b
    spec:
      volumes:
        - name: qwen3-8b
          persistentVolumeClaim:
            claimName: qwen3-8b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 20Gi
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/qwen3-8b --port 8000 --trust-remote-code --served-model-name qwen3-8b --tensor-parallel=1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-openai:v0.9.1
        name: vllm
        ports:
        - containerPort: 8000
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "1"
        volumeMounts:
          - mountPath: /models/qwen3-8b
            name: qwen3-8b
          - mountPath: /dev/shm
            name: dshm

2. Create PropagationPolicy.yaml

The PropagationPolicy distributes the Deployment across two child clusters.

Key fields:

Field Value Description
autoScaling.ecsProvision true Enables inventory-aware elastic scheduling. The scheduler queries real-time ECS inventory when placing workloads.
replicaSchedulingType Divided Splits replicas across clusters. Use Duplicated to deploy a full copy of the Deployment to each cluster instead.
dynamicWeight: AvailableReplicas Allocates replicas proportionally to each cluster's schedulable capacity—including capacity from ECS inventory. A cluster with available inventory gets more replicas, even if it has no running GPU nodes yet.
apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
  name: demo-policy
spec:
  # Enables inventory-aware elastic scheduling
  autoScaling:
    ecsProvision: true
  preserveResourcesOnDeletion: false
  conflictResolution: Overwrite
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    name: qwen3-8b
    namespace: default
  placement:
    replicaScheduling:
      replicaSchedulingType: Divided
      weightPreference:
        dynamicWeight: AvailableReplicas
    clusterAffinity:
      clusterNames:
      - ${cluster1-id}  # Replace with your actual child cluster ID
      - ${cluster2-id}  # Replace with your actual child cluster ID

3. Apply the manifests

kubectl apply -f deploy.yaml
kubectl apply -f PropagationPolicy.yaml

After a short time, GPU node pools in both child clusters begin scaling out automatically.

Step 3: Verify elastic scaling

Check workload scheduling status

kubectl get resourcebinding

Expected output:

NAME                  SCHEDULED   FULLYAPPLIED   OVERRIDDEN   ALLAVAILABLE   AGE
qwen3-8b-deployment   True        True           True         False          7m47s
Field Value Meaning
SCHEDULED True The fleet scheduler successfully placed the workload across child clusters.
ALLAVAILABLE False Nodes are still scaling out. This is a normal intermediate state, not an error. The value changes to True once all pods are running.

Check pod distribution across clusters

After pods reach the Running state, run:

kubectl amc get deploy qwen3-8b -M

Expected output:

NAME       CLUSTER           READY   UP-TO-DATE   AVAILABLE   AGE     ADOPTION
qwen3-8b   cxxxxxxxxxxxxxx   2/2     2            2           3m22s   Y
qwen3-8b   cxxxxxxxxxxxxxx   2/2     2            2           3m22s   Y
Field Value Meaning
READY 2/2 All replicas in the child cluster are running.
ADOPTION Y The child cluster has taken ownership of the workload.

All 4 replicas are running across both clusters, even though the child clusters had zero GPU nodes before deployment.

Verify scale-in behavior

Scale down the Deployment to 2 replicas and re-apply:

kubectl apply -f deploy.yaml
Note

Alternatively, delete the workload to simulate scaling to zero replicas.

After ten minutes, GPU node pools in the child clusters scale in to one node each. If you deleted the workload, the node count scales in to zero.

This confirms the full workflow: the fleet schedules workloads based on real-time inventory, scales out GPU nodes to run the workload, and automatically scales back in once the workload is removed—eliminating idle GPU costs.

What's next