PyTorch distributed training jobs require every pod in a job to run simultaneously — a partial start wastes resources and causes the job to hang indefinitely. Gang scheduling enforces an all-or-nothing guarantee: either all pods in a job are scheduled together, or none are. This prevents resource deadlocks in multi-GPU, multi-machine training scenarios.
This topic shows how to configure Kube Queue on an ACK Fleet instance to queue PyTorchJobs, and how to apply gang scheduling so all pods land on the same member cluster atomically.
How it works
The Fleet instance coordinates PyTorchJob scheduling across member clusters using two components:
Kube Queue manages job queues and enforces elastic quota limits, holding jobs until enough resources are available in a member cluster.
ACK Scheduler applies gang scheduling semantics when the Fleet instance distributes pods to a member cluster, ensuring all replicas (Master and Workers) are placed atomically.
The scheduling flow works as follows:
A PyTorchJob is submitted to the Fleet instance with a PropagationPolicy that specifies
customSchedulingType: Gang.If queue management is enabled (
suspension.scheduling: true), the job enters Kube Queue and waits until a quota slot is available.The Fleet instance evaluates available resources across member clusters and selects a target cluster.
ACK Scheduler places all pods atomically on the selected cluster, maintaining gang semantics.
The Fleet instance monitors the job and syncs status back.
Prerequisites
Before you begin, make sure you have:
Cloud-native AI suite installed in the member clusters — deploy only the Arena component
The AliyunAdcpFullAccess RAM (Resource Access Management) policy attached to your RAM user. For details, see Grant permissions to RAM users
The AMC command-line tool installed. For details, see Use AMC
(Optional) Resource reservation enabled if you want the Fleet instance to guarantee scheduling consistency with the member cluster. Resource reservation requires Kubernetes 1.28 or later and ACK Scheduler 6.8.0 or later.
(Optional) Enable resource reservation
Without resource reservation, the Fleet instance estimates available capacity by summing remaining resources across all nodes in a member cluster. With resource reservation enabled, the Fleet instance holds actual capacity on the target cluster before committing, so the Fleet-level scheduling decision matches the member cluster result.
Log in to the ACK console and click Clusters in the left navigation pane.
Click the name of your cluster. In the left navigation pane, click Add-ons.
On the Add-ons page, find Kube Scheduler and click Configuration.
In the Kube Scheduler Parameters dialog box, set enableReservation to true and click OK.
Choose a scheduling mode
Two modes are available:
| Mode | When to use | Key configuration |
|---|---|---|
| Gang scheduling only | You want pods placed atomically without managing queues | Set customSchedulingType: Gang in PropagationPolicy |
| Gang scheduling + queue management | You have many jobs competing for limited resources and need orderly queuing with quota enforcement | Set customSchedulingType: Gang and suspension.scheduling: true |
Follow Step 1 if you need queue management; skip to Step 2 if you only need gang scheduling.
Step 1 (Optional): Set up job queues with Kube Queue
Use ElasticQuotaTree to define quota limits and control how many jobs can run concurrently across namespaces.
Submit an
ElasticQuotaTreeto the Fleet instance. The following example configures a quota for thedefaultnamespace that allows only one job to run at a time, with a maximum of 10,000 CPUs, 10,000 GiB of memory, and 10,000 GPUs.apiVersion: scheduling.sigs.k8s.io/v1beta1 kind: ElasticQuotaTree metadata: name: elasticquotatree # Only a single ElasticQuotaTree is supported. namespace: kube-system # Must be created in the kube-system namespace. spec: root: name: root max: cpu: 999900 memory: 400000Gi kube-queue/max-jobs: 10000000000 nvidia.com/gpu: 100000 min: cpu: 999900 memory: 400000Gi kube-queue/max-jobs: 10000000000 nvidia.com/gpu: 100000 children: - name: child-2 max: kube-queue/max-jobs: 1 # Only one job can be dequeued at a time. cpu: 10000 nvidia.com/gpu: 10000 memory: 10000Gi namespaces: - defaultVerify that Kube Queue created the corresponding queues:
kubectl get queue -n kube-queueExpected output:
NAME AGE root-child-2-v5zxz 15d root-kdzw7 15d
Step 2: Submit a PyTorchJob for multi-cluster scheduling
Submit a PropagationPolicy
A PropagationPolicy tells the Fleet instance how to distribute the PyTorchJob across member clusters and which scheduling mode to apply.
Gang scheduling only
Set customSchedulingType: Gang to enable atomic pod placement without queuing.
apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
name: example-policy
namespace: default
spec:
propagateDeps: true
failover:
application:
decisionConditions:
tolerationSeconds: 30
purgeMode: Immediately
placement:
replicaScheduling:
replicaSchedulingType: Divided
customSchedulingType: Gang
resourceSelectors:
- apiVersion: kubeflow.org/v1
kind: PyTorchJobGang scheduling with queue management
Add suspension.scheduling: true so the Fleet instance holds the job in Kube Queue until a quota slot becomes available, then places all pods atomically.
apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
name: example-policy
namespace: default
spec:
suspension:
scheduling: true
propagateDeps: true
failover:
application:
decisionConditions:
tolerationSeconds: 30
purgeMode: Immediately
placement:
replicaScheduling:
replicaSchedulingType: Divided
customSchedulingType: Gang
resourceSelectors:
- apiVersion: kubeflow.org/v1
kind: PyTorchJobStep 3: Verify the job status
Run these commands on the Fleet instance to confirm the job was scheduled and all pods are running.
Check the PyTorchJob state on the Fleet instance:
kubectl get pytorchjobExpected output:
NAME STATE AGE pytorch-test Created 3m44sCheck which member cluster the job was scheduled to:
kubectl describe pytorchjob pytorch-testLook for
ScheduleBindingSucceedin the events. The result field shows the target cluster and replica counts:Normal ScheduleBindingSucceed 4m59s default-scheduler Binding has been scheduled successfully. Result: {cfxxxxxx:0,[{master 1} {worker 2}]}cfxxxxxxis the member cluster ID where all pods will run.Confirm the job is running in the member cluster:
kubectl amc get pytorchjob -MExpected output:
NAME CLUSTER STATE AGE ADOPTION pytorch-test cfxxxxxx Running 6m23s YADOPTION: Ymeans the Fleet instance has taken over scheduling for this job.Confirm all pods are running:
kubectl amc get pod -MExpected output:
NAME CLUSTER READY STATUS RESTARTS AGE pytorch-test-master-0 cfxxxxxx 1/1 Running 0 7m16s pytorch-test-worker-0 cfxxxxxx 1/1 Running 0 7m16s pytorch-test-worker-1 cfxxxxxx 1/1 Running 0 7m16sAll three pods (one Master and two Workers) are running on the same cluster, confirming that gang scheduling placed them atomically.
To inspect the full PyTorchJob YAML in the member cluster, run:
kubectl amc get pytorchjob pytorch-test -m ${member clusterid} -oyaml