All Products
Search
Document Center

Container Service for Kubernetes:Inventory-aware elastic scheduling for a cross-region multi-cluster fleet

Last Updated:Dec 11, 2025

In multi-region application deployments, managing resource allocation across different regions can be challenging. To address this, ACK One provides an intelligent, inventory-aware scheduler for multi-cluster fleets. This topic describes how inventory-aware scheduling works and how to enable it for your fleet.

Overview

When serving GPU inference workloads, users face two primary challenges:

  • Dynamic GPU availability: The supply of GPU resources fluctuates among regions, making it difficult to guarantee real-time availability.

  • High cost of GPUs: Pre-provisioning GPU nodes to handle potential demand can lead to significant and unnecessary costs.

The inventory-aware scheduling mechanism, combined with instant scaling, effectively addresses these two challenges. When the existing resources in a fleet's member clusters are insufficient to schedule a new application, the scheduler intelligently places the workload onto a cluster located in a region where GPU inventory is available. That cluster's instant scaling feature then provisions the required nodes on-demand.

This capability maximizes the successful scheduling of applications that depend on scarce resources, such as GPUs, while significantly reducing operational costs.

Important

This feature is currently in preview. To use it, submit a ticket.

How it works

When an application is deployed to a fleet and a member cluster has insufficient resources, the following workflow is triggered:

  1. An application and its propagation policy are created in the fleet's control plane.

  2. The scheduler detects that the target member cluster lacks the necessary resources.

  3. The scheduler queries the member cluster's scaler (ACK GOATScaler) to check for available GPU inventory in its region.

  4. Based on the inventory report, the scheduler re-evaluates its placement decision and dispatches the application to a cluster with available inventory.

  5. Once the application is dispatched to the target cluster, the instant scaling feature provisions new nodes, and the application's pods are scheduled and start running.

image

Prerequisites

GPU-accelerated instance specification and estimated cost

GPU memory is occupied by model parameters during the inference phase. The usage is calculated by using the following formula:

Take a 7B model with default FP16 precision as an example: the model parameter count is 7 billion, and the byte size per parameter is 2 bytes (default 16-bit floating number/8 bits per byte).

In addition to the memory used to load the model, you also need to consider the size of the key-value (KV) cache and the GPU utilization. Typically, a proportion of memory is reserved for buffering. Therefore, we recommend using instance types that provide 24 GiB of memory, such as ecs.gn7i-c8g1.2xlarge or ecs.gn7i-c16g1.4xlarge. See GPU-accelerated compute-optimized instance families and Billing for Elastic GPU Service.

Step 1: Prepare the model data

In this step, you will prepare the Qwen-8B model files and create corresponding Object Storage Service (OSS) persistent volumes (PVs) for them in each member cluster

  1. Download the model.

    Note

    Check whether the git-lfs plug-in is installed. If it's not, run yum install git-lfs or apt-get install git-lfs to install it. For more information, see Install git-lfs.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-8B
    cd Qwen3-8B
    git lfs pull
  2. Create a folder in OSS and upload the model to it.

    Note

    For detailed steps of how to install and use ossutil, see Install ossutil.

    ossutil mkdir oss://<your-bucket-name>/models/Qwen3-8B
    ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/models/Qwen3-8B
  3. Create a PV and a PersistentVolumeClaim (PVC) in each member cluster to mount the model files from OSS. For more information, see Use an ossfs 1.0 statically provisioned volume.

    YAML template

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
    stringData:
      akId: <your-oss-ak> # The AccessKey ID used to access OSS.
      akSecret: <your-oss-sk> # The AccessKey secret used to access OSS.
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: qwen3-8b
      namespace: default
    spec:
      accessModes:
        - ReadWriteMany
      resources:
        requests:
          storage: 20Gi
      selector:
        matchLabels:
          alicloud-pvname: qwen3-8b
      storageClassName: oss
      volumeMode: Filesystem
      volumeName: qwen3-8b
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      labels:
        alicloud-pvname: qwen3-8b
      name: qwen3-8b
    spec:
      accessModes:
        - ReadWriteMany
      capacity:
        storage: 20Gi
      csi:
        driver: ossplugin.csi.alibabacloud.com
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: <your-bucket-name> # The name of the bucket.
          otherOpts: '-o allow_other -o umask=000'
          path: <your-model-path> # In this example, the path is /models/Qwen3-8B/.
          url: <your-bucket-endpoint> # The endpoint, such as oss-cn-hangzhou-internal.aliyuncs.com.
        volumeHandle: qwen3-8b
      persistentVolumeReclaimPolicy: Retain
      storageClassName: oss
      volumeMode: Filesystem

Step 2: Configure node pools in member clusters

In each member cluster, create or edit a node pool with the following settings:

  • Instance Type: ecs.gn7i-c8g1.2xlarge (or another suitable GPU instance type)

  • Scaling Mode: Auto

  • Expected Nodes: 0

For more operations and parameter configurations, see Create and manage node pools.

When adjusting the node scaling configuration, you can adjust the Defer Scale-in For parameter to shorten the waiting time for subsequent steps.

Step 3: Create the application and propagation policy in the fleet cluster

  1. Create a file named deploy.yaml to define the inference service deployment.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: qwen3-8b
      name: qwen3-8b
      namespace: default
    spec:
      replicas: 4
      selector:
        matchLabels:
          app: qwen3-8b
      template:
        metadata:
          labels:
            app: qwen3-8b
        spec:
          volumes:
            - name: qwen3-8b
              persistentVolumeClaim:
                claimName: qwen3-8b
            - name: dshm
              emptyDir:
                medium: Memory
                sizeLimit: 20Gi
          containers:
          - command:
            - sh
            - -c
            - vllm serve /models/qwen3-8b --port 8000 --trust-remote-code --served-model-name qwen3-8b --tensor-parallel=1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
            image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-openai:v0.9.1
            name: vllm
            ports:
            - containerPort: 8000
            readinessProbe:
              tcpSocket:
                port: 8000
              initialDelaySeconds: 30
              periodSeconds: 30
            resources:
              limits:
                nvidia.com/gpu: "1"
            volumeMounts:
              - mountPath: /models/qwen3-8b
                name: qwen3-8b
              - mountPath: /dev/shm
                name: dshm
  2. Create a file named PropagationPolicy.yaml. The key field autoScaling.ecsProvision: true enables inventory-aware scheduling.

    apiVersion: policy.one.alibabacloud.com/v1alpha1
    kind: PropagationPolicy
    metadata:
      name: demo-policy
    spec:
      # This field enables inventory-aware elastic scheduling.
      autoScaling:
        ecsProvision: true
      preserveResourcesOnDeletion: false
      conflictResolution: Overwrite
      resourceSelectors:
      - apiVersion: apps/v1
        kind: Deployment
        name: qwen3-8b
        namespace: default
      placement:
        replicaScheduling:
          replicaSchedulingType: Divided
          weightPreference:
            dynamicWeight: AvailableReplicas
        clusterAffinity:
          clusterNames:
          - ${cluster1-id} # Replace with your member cluster ID.
          - ${cluster2-id} # Replace with your member cluster ID.
  3. Use the kubeconfig file of the fleet to deploy the application and its propagation policy.

    kubectl apply -f deploy.yaml
    kubectl apply -f PropagationPolicy.yaml

    After a few moments, the GPU node pools in your member clusters will begin to scale up automatically.

Step 4: Validate the elastic scaling

  1. Check the scheduling status of the workload in the fleet.

    kubectl get resourcebinding

    Expected output:

    NAME                  SCHEDULED   FULLYAPPLIED   OVERRIDDEN   ALLAVAILABLE   AGE
    qwen3-8b-deployment   True        True           True         False          7m47s

    The output shows that SCHEDULED is TRUE, indicating that the workload was successfully scheduled.

  2. Once the pods are in the Running state, check their distribution across the member clusters.

    kubectl amc get deploy qwen3-8b -M

    Expected output:

    NAME       CLUSTER           READY   UP-TO-DATE   AVAILABLE   AGE     ADOPTION
    qwen3-8b   cxxxxxxxxxxxxxx   2/2     2            2           3m22s   Y
    qwen3-8b   cxxxxxxxxxxxxxx   2/2     2            2           3m22s   Y

    The output shows that all replicas are scheduled and running, even though the member clusters initially had no available GPU nodes.

  3. Update the deploy.yaml file to scale down the number of replicas for qwen3-8b to 2, and re-apply it.

    Alternatively, delete the workload to simulate a scenario where the number of replicas is scaled down to 0.
    kubectl apply -f deploy.yaml
  4. After about 10 minutes, the GPU node pools in the member clusters will automatically scale down to release the unused nodes, thereby reducing costs.

    If you delete the workload, the number of nodes is scaled down to 0.