All Products
Search
Document Center

Container Service for Kubernetes:How to use Kube Queue on a Fleet instance and schedule PyTorchJob by using gang scheduling

Last Updated:Mar 05, 2025

PyTorch is a widely used machine learning framework that helps model developers implement multi-machine and multi-GPU distributed training jobs. In Kubernetes, you can use PyTorchJob to submit machine learning jobs within the PyTorch framework. This topic describes how to use Kube Queue to manage jobs on a Fleet instance and how to use the gang scheduling feature when you assign resources on a Fleet instance.

Architecture

To implement multi-machine and multi-GPU distributed training and ensure that the pods of the training jobs can run as expected, the workloads must comply with the gang scheduling semantics during scheduling. All workloads of the training job must be running or no workload must be running. The Fleet instance implements the scheduling of PyTorchJob across multiple clusters by using Kube Queue and ACK Scheduler to ensure that the gang semantics declared in PyTorchJob are maintained during scheduling.

image

Prerequisites

  • Cloud-native AI suite is installed in the sub-clusters. You need to only deploy Arena.

  • The Resource Access Management (RAM) policy AliyunAdcpFullAccess is attached to a RAM user. For more information, see Grant permissions to RAM users.

  • The AMC command-line tool is installed. For more information, see Use AMC.

  • (Optional) If resource reservation is enabled for the cluster, the Fleet instance uses resource reservation to ensure that the scheduling result of the Fleet instance is the same as the scheduling result of the sub-cluster. If resource reservation is disabled for a cluster, the Fleet instance evaluates resources and schedules jobs by comparing total number of remaining resources of all nodes in the sub-cluster with the total number of required resources. The following section describes how to enable resource reservation.

    Note

    To enable resource reservation, the Kubernetes version of the cluster must be 1.28 or later and the scheduler version must be 6.8.0 or later.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Operations > Add-ons.

    3. On the Add-ons page, find the Kube Scheduler component, click Configuration to go to the Kube Scheduler Parameters page, set the enableReservation parameter to true, and then click OK.

(Optional) Step 1: Use ack-kube-queue to queue jobs and manage quotas

In a multi-cluster Fleet, you can use the ElasticQuotaTree to manage quotas and configure queues. This helps you queue a large number of jobs in multi-user scenarios.

  1. The following sample code provides an example on how to submit an ElasticQuotaTree.

    Cluster administrators can submit an ElasticQuotaTree on the Fleet instance to configure job queues. In this example, a quota in the default namespace is configured, which includes a total of 10000 CPUs, 10000 GiB memory, 10000 GPUs, and 1 job.

    apiVersion: scheduling.sigs.k8s.io/v1beta1
    kind: ElasticQuotaTree
    metadata:
      name: elasticquotatree # Only single ElasticQuotaTree is supported. 
      namespace: kube-system # The elastic quota group takes effect only if you create the group in the kube-system namespace. 
    spec:
      root:
        name: root # The maximum amount of resources for the root must be equal to the minimum amount of resources for the root. 
        max:
          cpu: 999900
          memory: 400000Gi
          kube-queue/max-jobs: 10000000000
          nvidia.com/gpu: 100000
        min:
          cpu: 9999
          memory: 40Gi
        children:
        - name: child-2
          max:
            # Only one job can be dequeued at a time. 
            kube-queue/max-jobs: 1
            cpu: 10000
            nvidia.com/gpu: 10000
            memory: 10000Gi
          namespaces: # Configure the namespace. 
            - default
  2. Run the following command on the Fleet instance to view the ElasticQuotaTree and the queues created by Kube Queue:

    kubectl get queue -n kube-queue

    Expected output:

    NAME                 AGE
    root-child-2-v5zxz   15d
    root-kdzw7           15d

Step 2: Submit a PyTorchJob for multi-cluster scheduling

  1. Submit a PropagationPolicy on the Fleet instance and set Custom Policy to Gang.

    Use only the Gang scheduling policy

    When you use the gang scheduling feature, you must specify the customSchedulingType=Gang parameter in the submitted PropagationPolicy.

    apiVersion: policy.one.alibabacloud.com/v1alpha1
    kind: PropagationPolicy
    metadata:
      name: example-policy # The default namespace is `default`.
    spec:
      propagateDeps: true
      # Reschedule the job when the job fails to run. 
      failover:
        application:
          decisionConditions:
            tolerationSeconds: 30
          purgeMode: Immediately
      placement:
        replicaScheduling:
          replicaSchedulingType: Divided
          # Use gang scheduling. 
          customSchedulingType: Gang
      resourceSelectors:
        - apiVersion: kubeflow.org/v1
          kind: PyTorchJob

    Use ElasticQuotaTree and gang scheduling

    You must specify the customSchedulingType=Gang parameter in the submitted PropagationPolicy, and set .Spec.Suspension.Scheduling to true. This way, jobs can be added to the queue for scheduling.

    apiVersion: policy.one.alibabacloud.com/v1alpha1
    kind: PropagationPolicy
    metadata:
      name: example-policy # The default namespace is `default`.
    spec:
      propagateDeps: true
      # Deschedule the job when the job fails to run. 
      failover:
        application:
          decisionConditions:
            tolerationSeconds: 30
          purgeMode: Immediately
      placement:
        replicaScheduling:
          replicaSchedulingType: Divided
          # Use gang scheduling. 
          customSchedulingType: Gang
      resourceSelectors:
        - apiVersion: kubeflow.org/v1
          kind: PyTorchJob
  2. The following sample code provides an example on how to submit a PyTorchJob on the Fleet instance.

    View the sample YAML file

    apiVersion: kubeflow.org/v1
    kind: PyTorchJob
    metadata:
      labels:
        app: pytorchjob
      name: pytorch-test
      namespace: default
    spec:
      cleanPodPolicy: None
      pytorchReplicaSpecs:
        Master:
          replicas: 1
          restartPolicy: Never
          template:
            metadata:
              labels:
                app: pytorchjob
              name: pytorch-test
            spec:
              schedulerName: default-scheduler
              containers:
              - command:
                - sh
                - -c
                - sleep 1h
                env:
                - name: NVIDIA_VISIBLE_DEVICES
                  value: void
                - name: gpus
                  value: "0"
                - name: workers
                  value: "8"
                image: registry-cn-hangzhou.ack.aliyuncs.com/acs/nginx
                imagePullPolicy: Always
                name: pytorch
                resources:
                  limits:
                    cpu: "3"
                  requests:
                    cpu: "10m"
                volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
                workingDir: /root
              volumes:
              - emptyDir:
                  medium: Memory
                  sizeLimit: 2Gi
                name: dshm
        Worker:
          replicas: 2
          restartPolicy: OnFailure
          # restartPolicy: Never
          template:
            metadata:
              labels:
                app: pytorchjob
              name: pytorch-test
            spec:
              containers:
              - command:
                - bash
                - -c
                - |
                  #!/bin/bash
                  #sleep 180
                  echo "$WORKER_INDEX"
                  #if [[ "$WORKER_INDEX" == "0" ]]
                  #then
                  #  exit -1
                  #fi
                  sleep 1h
                env:
                - name: WORKER_INDEX
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.labels['pytorch-replica-index']
                - name: NVIDIA_VISIBLE_DEVICES
                  value: void
                - name: gpus
                  value: "0"
                - name: workers
                  value: "8"
                image: registry-cn-hangzhou.ack.aliyuncs.com/acs/nginx
                imagePullPolicy: Always
                name: pytorch
                resources:
                  limits:
                    cpu: "2"
                  requests:
                    cpu: "2"
                    memory: "2Gi"
                volumeMounts:
                - mountPath: /dev/shm
                  name: dshm
                workingDir: /root
              volumes:
              - emptyDir:
                  medium: Memory
                  sizeLimit: 2Gi
                name: dshm

Step 3: View the job status

  1. Run the following command on the Fleet instance to query the status of the PyTorchJob:

    kubectl get pytorchjob

    Expected output:

    NAME           STATE     AGE
    pytorch-test   Created   3m44s
  2. Run the following command on the Fleet instance to check which associated cluster the PyTorchJob is scheduled to:

    kubectl describe pytorchjob pytorch-test

    Expected output:

     Normal   ScheduleBindingSucceed  4m59s                   default-scheduler                   Binding has been scheduled successfully. Result: {cfxxxxxx:0,[{master 1} {worker 2}]}
  3. Run the following command on the Fleet instance to query the status of the PyTorchJob in the associated cluster:

    kubectl amc get pytorchjob -M

    Expected output:

    NAME           CLUSTER    STATE     AGE     ADOPTION
    pytorch-test   cfxxxxxx   Running   6m23s   Y
  4. Run the following command on the Fleet instance to query the status of the pod:

    kubectl amc get pod -M   

    Expected output:

    NAME                    CLUSTER    READY   STATUS      RESTARTS   AGE
    pytorch-test-master-0   cfxxxxxx   1/1     Running     0          7m16s
    pytorch-test-worker-0   cfxxxxxx   1/1     Running     0          7m16s
    pytorch-test-worker-1   cfxxxxxx   1/1     Running     0          7m16s
  5. Run the following command on the Fleet instance to view the details of the PyTorchJob in the associated cluster:

    kubectl amc get pytorchjob pytorch-test -m ${member clusterid} -oyaml