All Products
Search
Document Center

Container Service for Kubernetes:Multi-cluster PyTorchJob scheduling with priority queuing

Last Updated:Mar 26, 2026

PyTorch distributed training jobs require every pod in a job to run simultaneously — a partial start wastes resources and causes the job to hang indefinitely. Gang scheduling enforces an all-or-nothing guarantee: either all pods in a job are scheduled together, or none are. This prevents resource deadlocks in multi-GPU, multi-machine training scenarios.

This topic shows how to configure Kube Queue on an ACK Fleet instance to queue PyTorchJobs, and how to apply gang scheduling so all pods land on the same member cluster atomically.

How it works

The Fleet instance coordinates PyTorchJob scheduling across member clusters using two components:

  • Kube Queue manages job queues and enforces elastic quota limits, holding jobs until enough resources are available in a member cluster.

  • ACK Scheduler applies gang scheduling semantics when the Fleet instance distributes pods to a member cluster, ensuring all replicas (Master and Workers) are placed atomically.

image

The scheduling flow works as follows:

  1. A PyTorchJob is submitted to the Fleet instance with a PropagationPolicy that specifies customSchedulingType: Gang.

  2. If queue management is enabled (suspension.scheduling: true), the job enters Kube Queue and waits until a quota slot is available.

  3. The Fleet instance evaluates available resources across member clusters and selects a target cluster.

  4. ACK Scheduler places all pods atomically on the selected cluster, maintaining gang semantics.

  5. The Fleet instance monitors the job and syncs status back.

Prerequisites

Before you begin, make sure you have:

  • Cloud-native AI suite installed in the member clusters — deploy only the Arena component

  • The AliyunAdcpFullAccess RAM (Resource Access Management) policy attached to your RAM user. For details, see Grant permissions to RAM users

  • The AMC command-line tool installed. For details, see Use AMC

  • (Optional) Resource reservation enabled if you want the Fleet instance to guarantee scheduling consistency with the member cluster. Resource reservation requires Kubernetes 1.28 or later and ACK Scheduler 6.8.0 or later.

(Optional) Enable resource reservation

Without resource reservation, the Fleet instance estimates available capacity by summing remaining resources across all nodes in a member cluster. With resource reservation enabled, the Fleet instance holds actual capacity on the target cluster before committing, so the Fleet-level scheduling decision matches the member cluster result.

  1. Log in to the ACK console and click Clusters in the left navigation pane.

  2. Click the name of your cluster. In the left navigation pane, click Add-ons.

  3. On the Add-ons page, find Kube Scheduler and click Configuration.

  4. In the Kube Scheduler Parameters dialog box, set enableReservation to true and click OK.

Choose a scheduling mode

Two modes are available:

ModeWhen to useKey configuration
Gang scheduling onlyYou want pods placed atomically without managing queuesSet customSchedulingType: Gang in PropagationPolicy
Gang scheduling + queue managementYou have many jobs competing for limited resources and need orderly queuing with quota enforcementSet customSchedulingType: Gang and suspension.scheduling: true

Follow Step 1 if you need queue management; skip to Step 2 if you only need gang scheduling.

Step 1 (Optional): Set up job queues with Kube Queue

Use ElasticQuotaTree to define quota limits and control how many jobs can run concurrently across namespaces.

  1. Submit an ElasticQuotaTree to the Fleet instance. The following example configures a quota for the default namespace that allows only one job to run at a time, with a maximum of 10,000 CPUs, 10,000 GiB of memory, and 10,000 GPUs.

    apiVersion: scheduling.sigs.k8s.io/v1beta1
    kind: ElasticQuotaTree
    metadata:
      name: elasticquotatree  # Only a single ElasticQuotaTree is supported.
      namespace: kube-system   # Must be created in the kube-system namespace.
    spec:
      root:
        name: root
        max:
          cpu: 999900
          memory: 400000Gi
          kube-queue/max-jobs: 10000000000
          nvidia.com/gpu: 100000
        min:
          cpu: 999900
          memory: 400000Gi
          kube-queue/max-jobs: 10000000000
          nvidia.com/gpu: 100000
        children:
        - name: child-2
          max:
            kube-queue/max-jobs: 1  # Only one job can be dequeued at a time.
            cpu: 10000
            nvidia.com/gpu: 10000
            memory: 10000Gi
          namespaces:
            - default
  2. Verify that Kube Queue created the corresponding queues:

    kubectl get queue -n kube-queue

    Expected output:

    NAME                 AGE
    root-child-2-v5zxz   15d
    root-kdzw7           15d

Step 2: Submit a PyTorchJob for multi-cluster scheduling

Submit a PropagationPolicy

A PropagationPolicy tells the Fleet instance how to distribute the PyTorchJob across member clusters and which scheduling mode to apply.

Gang scheduling only

Set customSchedulingType: Gang to enable atomic pod placement without queuing.

apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
  name: example-policy
  namespace: default
spec:
  propagateDeps: true
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 30
      purgeMode: Immediately
  placement:
    replicaScheduling:
      replicaSchedulingType: Divided
      customSchedulingType: Gang
  resourceSelectors:
    - apiVersion: kubeflow.org/v1
      kind: PyTorchJob

Gang scheduling with queue management

Add suspension.scheduling: true so the Fleet instance holds the job in Kube Queue until a quota slot becomes available, then places all pods atomically.

apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
  name: example-policy
  namespace: default
spec:
  suspension:
    scheduling: true
  propagateDeps: true
  failover:
    application:
      decisionConditions:
        tolerationSeconds: 30
      purgeMode: Immediately
  placement:
    replicaScheduling:
      replicaSchedulingType: Divided
      customSchedulingType: Gang
  resourceSelectors:
    - apiVersion: kubeflow.org/v1
      kind: PyTorchJob

Submit a PyTorchJob

Submit the following PyTorchJob to the Fleet instance. It defines one Master pod and two Worker pods.

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
    app: pytorchjob
  name: pytorch-test
  namespace: default
spec:
  cleanPodPolicy: None
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          labels:
            app: pytorchjob
          name: pytorch-test
        spec:
          schedulerName: default-scheduler
          containers:
          - command:
            - sh
            - -c
            - sleep 1h
            env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: void
            - name: gpus
              value: "0"
            - name: workers
              value: "8"
            image: registry-cn-hangzhou.ack.aliyuncs.com/acs/nginx
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                cpu: "3"
              requests:
                cpu: "10m"
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            workingDir: /root
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 2Gi
            name: dshm
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: pytorchjob
          name: pytorch-test
        spec:
          containers:
          - command:
            - bash
            - -c
            - |
              echo "$WORKER_INDEX"
              sleep 1h
            env:
            - name: WORKER_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['pytorch-replica-index']
            - name: NVIDIA_VISIBLE_DEVICES
              value: void
            - name: gpus
              value: "0"
            - name: workers
              value: "8"
            image: registry-cn-hangzhou.ack.aliyuncs.com/acs/nginx
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                cpu: "2"
              requests:
                cpu: "2"
                memory: "2Gi"
            volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            workingDir: /root
          volumes:
          - emptyDir:
              medium: Memory
              sizeLimit: 2Gi
            name: dshm

Step 3: Verify the job status

Run these commands on the Fleet instance to confirm the job was scheduled and all pods are running.

  1. Check the PyTorchJob state on the Fleet instance:

    kubectl get pytorchjob

    Expected output:

    NAME           STATE     AGE
    pytorch-test   Created   3m44s
  2. Check which member cluster the job was scheduled to:

    kubectl describe pytorchjob pytorch-test

    Look for ScheduleBindingSucceed in the events. The result field shows the target cluster and replica counts:

    Normal   ScheduleBindingSucceed  4m59s   default-scheduler   Binding has been scheduled successfully. Result: {cfxxxxxx:0,[{master 1} {worker 2}]}

    cfxxxxxx is the member cluster ID where all pods will run.

  3. Confirm the job is running in the member cluster:

    kubectl amc get pytorchjob -M

    Expected output:

    NAME           CLUSTER    STATE     AGE     ADOPTION
    pytorch-test   cfxxxxxx   Running   6m23s   Y

    ADOPTION: Y means the Fleet instance has taken over scheduling for this job.

  4. Confirm all pods are running:

    kubectl amc get pod -M

    Expected output:

    NAME                    CLUSTER    READY   STATUS    RESTARTS   AGE
    pytorch-test-master-0   cfxxxxxx   1/1     Running   0          7m16s
    pytorch-test-worker-0   cfxxxxxx   1/1     Running   0          7m16s
    pytorch-test-worker-1   cfxxxxxx   1/1     Running   0          7m16s

    All three pods (one Master and two Workers) are running on the same cluster, confirming that gang scheduling placed them atomically.

  5. To inspect the full PyTorchJob YAML in the member cluster, run:

    kubectl amc get pytorchjob pytorch-test -m ${member clusterid} -oyaml