GPU availability across regions is unpredictable. Pre-provisioning nodes to guarantee capacity is expensive. ACK One multi-cluster fleets address this with an inventory-aware scheduler: when all child clusters in a fleet lack available GPU nodes, the scheduler queries real-time ECS inventory, selects a cluster with available capacity, and triggers instant node elasticity to scale out nodes on demand. Workloads run without requiring idle GPU nodes to be standing by.
This feature is currently in invitational preview. To try it, submit a ticket.
How it works
Three components collaborate to deliver inventory-aware elastic scheduling:
| Component | Role |
|---|---|
| Fleet scheduler | Detects resource shortfalls in child clusters, queries inventory via ACK GOATScaler, and distributes replicas based on available capacity—including inventory not yet provisioned as running nodes. |
| ACK GOATScaler | Runs in each child cluster. Checks real-time ECS inventory and returns available instance counts to the fleet scheduler. |
| Child cluster node pools | Configured with instant node elasticity and zero desired nodes. Nodes scale out only when the scheduler assigns workloads, and scale back in when workloads are removed. |
When you create an application in a fleet and no child cluster has enough running resources, the following sequence runs:
-
The scheduler detects that child clusters lack resources and cannot schedule the workload.
-
The scheduler triggers ACK GOATScaler in each child cluster to check inventory.
-
Based on the inventory result, the scheduler reschedules and distributes the application to the cluster with available capacity.
-
The selected child cluster scales out nodes and runs the application.
Prerequisites
Before you begin, make sure you have:
-
Multiple associated clusters in the fleet — the scheduler distributes workloads across these child clusters.
-
Instant node elasticity enabled for each child cluster — this allows child clusters to scale out GPU nodes when the scheduler assigns workloads to them.
-
The AMC command-line tool installed — used in Step 3 to verify pod distribution across clusters.
If node autoscaling is already enabled for a child cluster, switch to instant node elasticity before proceeding. See Enable instant node elasticity.
GPU instance specifications and cost estimation
Model parameters are the main consumer of GPU memory during inference. Use the following formula to estimate required GPU memory:
GPU memory = Number of parameters × Bytes per parameter
Example: 7B model at FP16 precision
| Factor | Value |
|---|---|
| Parameters | 7 × 10⁹ |
| Bytes per parameter (FP16) | 2 bytes |
| Model memory | 7 × 10⁹ × 2 bytes ≈ 13.04 GiB |
Beyond model loading, account for KV cache and computation buffers. For a 7B model at FP16, use a GPU instance with at least 24 GiB of GPU memory, such as ecs.gn7i-c8g1.2xlarge or ecs.gn7i-c16g1.4xlarge.
For full instance type details and pricing, see GPU-accelerated compute optimized instance family and Elastic GPU Service billing.
Preparations
This section prepares the Qwen3-8B model files and creates the corresponding OSS PersistentVolumes in each child cluster.
1. Download the model
Make sure Git Large File Storage (LFS) is installed. If not, run yum install git-lfs or apt-get install git-lfs. For other installation methods, see Install Git Large File Storage.
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-8B
cd Qwen3-8B
git lfs pull
2. Upload the model to OSS
For ossutil installation and usage, see Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/Qwen3-8B
ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/models/Qwen3-8B
3. Create a PersistentVolume and PersistentVolumeClaim in each child cluster
For detailed steps, see Use ossfs 1.0 static persistent volume.
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: <your-oss-ak> # AccessKey ID for accessing OSS
akSecret: <your-oss-sk> # AccessKey Secret for accessing OSS
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: qwen3-8b
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
selector:
matchLabels:
alicloud-pvname: qwen3-8b
storageClassName: oss
volumeMode: Filesystem
volumeName: qwen3-8b
---
apiVersion: v1
kind: PersistentVolume
metadata:
labels:
alicloud-pvname: qwen3-8b
name: qwen3-8b
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 20Gi
csi:
driver: ossplugin.csi.alibabacloud.com
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: <your-bucket-name> # Bucket name
otherOpts: '-o allow_other -o umask=000'
path: <your-model-path> # Example: /models/Qwen3-8B/
url: <your-bucket-endpoint> # Example: oss-cn-hangzhou-internal.aliyuncs.com
volumeHandle: qwen3-8b
persistentVolumeReclaimPolicy: Retain
storageClassName: oss
volumeMode: Filesystem
Step 1: Configure node pools for child clusters
Create or edit a GPU node pool in each child cluster with the following settings:
| Setting | Value |
|---|---|
| Instance type | ecs.gn7i-c8g1.2xlarge |
| Scaling Mode | Auto |
| Desired nodes | 0 |
Starting at zero nodes eliminates idle GPU costs. The node pool scales out only when the fleet scheduler assigns workloads.
For detailed configuration steps, see Create and manage node pools.
To reduce wait time in Step 3, shorten the scale-in trigger delay in the node pool settings.
Step 2: Create an application and distribution policy in the fleet cluster
All resources in this step use the fleet cluster's kubeconfig.
1. Create deploy.yaml
The Deployment runs 4 replicas of the Qwen3-8B model using vLLM, with each pod requesting 1 GPU.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: qwen3-8b
name: qwen3-8b
namespace: default
spec:
replicas: 4
selector:
matchLabels:
app: qwen3-8b
template:
metadata:
labels:
app: qwen3-8b
spec:
volumes:
- name: qwen3-8b
persistentVolumeClaim:
claimName: qwen3-8b
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 20Gi
containers:
- command:
- sh
- -c
- vllm serve /models/qwen3-8b --port 8000 --trust-remote-code --served-model-name qwen3-8b --tensor-parallel=1 --max-model-len 8192 --gpu-memory-utilization 0.95 --enforce-eager
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-openai:v0.9.1
name: vllm
ports:
- containerPort: 8000
readinessProbe:
tcpSocket:
port: 8000
initialDelaySeconds: 30
periodSeconds: 30
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /models/qwen3-8b
name: qwen3-8b
- mountPath: /dev/shm
name: dshm
2. Create PropagationPolicy.yaml
The PropagationPolicy distributes the Deployment across two child clusters.
Key fields:
| Field | Value | Description |
|---|---|---|
autoScaling.ecsProvision |
true |
Enables inventory-aware elastic scheduling. The scheduler queries real-time ECS inventory when placing workloads. |
replicaSchedulingType |
Divided |
Splits replicas across clusters. Use Duplicated to deploy a full copy of the Deployment to each cluster instead. |
dynamicWeight: AvailableReplicas |
— | Allocates replicas proportionally to each cluster's schedulable capacity—including capacity from ECS inventory. A cluster with available inventory gets more replicas, even if it has no running GPU nodes yet. |
apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
name: demo-policy
spec:
# Enables inventory-aware elastic scheduling
autoScaling:
ecsProvision: true
preserveResourcesOnDeletion: false
conflictResolution: Overwrite
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: qwen3-8b
namespace: default
placement:
replicaScheduling:
replicaSchedulingType: Divided
weightPreference:
dynamicWeight: AvailableReplicas
clusterAffinity:
clusterNames:
- ${cluster1-id} # Replace with your actual child cluster ID
- ${cluster2-id} # Replace with your actual child cluster ID
3. Apply the manifests
kubectl apply -f deploy.yaml
kubectl apply -f PropagationPolicy.yaml
After a short time, GPU node pools in both child clusters begin scaling out automatically.
Step 3: Verify elastic scaling
Check workload scheduling status
kubectl get resourcebinding
Expected output:
NAME SCHEDULED FULLYAPPLIED OVERRIDDEN ALLAVAILABLE AGE
qwen3-8b-deployment True True True False 7m47s
| Field | Value | Meaning |
|---|---|---|
SCHEDULED |
True |
The fleet scheduler successfully placed the workload across child clusters. |
ALLAVAILABLE |
False |
Nodes are still scaling out. This is a normal intermediate state, not an error. The value changes to True once all pods are running. |
Check pod distribution across clusters
After pods reach the Running state, run:
kubectl amc get deploy qwen3-8b -M
Expected output:
NAME CLUSTER READY UP-TO-DATE AVAILABLE AGE ADOPTION
qwen3-8b cxxxxxxxxxxxxxx 2/2 2 2 3m22s Y
qwen3-8b cxxxxxxxxxxxxxx 2/2 2 2 3m22s Y
| Field | Value | Meaning |
|---|---|---|
READY |
2/2 |
All replicas in the child cluster are running. |
ADOPTION |
Y |
The child cluster has taken ownership of the workload. |
All 4 replicas are running across both clusters, even though the child clusters had zero GPU nodes before deployment.
Verify scale-in behavior
Scale down the Deployment to 2 replicas and re-apply:
kubectl apply -f deploy.yaml
Alternatively, delete the workload to simulate scaling to zero replicas.
After ten minutes, GPU node pools in the child clusters scale in to one node each. If you deleted the workload, the node count scales in to zero.
This confirms the full workflow: the fleet schedules workloads based on real-time inventory, scales out GPU nodes to run the workload, and automatically scales back in once the workload is removed—eliminating idle GPU costs.