All Products
Search
Document Center

Container Service for Kubernetes:Work with gang scheduling

Last Updated:Sep 04, 2025

The gang scheduling feature provided by Container Service for Kubernetes (ACK) is developed on top of the new kube-scheduler framework. Gang scheduling ensures that a group of correlated pods are scheduled at the same time. If the scheduling requirements are not met, none of the pods is scheduled. Gang scheduling provides a solution to job scheduling in All-or-Nothing scenarios. It is suitable for distributed applications which strictly require you to schedule or share resources for all big data computing jobs at the same time, such as Spark and Hadoop jobs. This topic describes how to enable gang scheduling.

Usage notes

Make sure that the resource capacity of the elastic node pool that you use and the node labels meet the requirements for pod scheduling. Otherwise, pods may fail to be scheduled to the nodes in the node pool.

Prerequisites

Usage notes

While Kubernetes excels at orchestrating online services, ACK aims to extend it into a unified platform for both online and offline workloads to improve cluster utilization and efficiency. However, the default scheduler's limitations can prevent certain offline jobs from running effectively, especially those requiring all-or-nothing execution.

In these scenarios, all tasks of a job must be scheduled simultaneously. If only a portion of the tasks are launched, they will sit idle while consuming resources, waiting for the rest to be scheduled. This can lead to resource wastage and, in worst-case scenarios, a cluster-wide deadlock where no jobs can proceed. This is where gang scheduling becomes essential.

Gang scheduling is a policy that ensures all pods in a predefined group are launched at the same time. Its all-or-nothing principle prevents the partial execution issues that cause deadlocks. A job will only start once enough resources are available for all its required members; otherwise, the entire group waits.

ACK implements gang scheduling through a concept called a PodGroup. When submitting a job, you define its membership in a PodGroup using labels and specify the minimum number of pods required for the job to start. The scheduler will only dispatch the entire group of pods once it can satisfy this minimum requirement. Until then, all pods in the group will remain in the pending state, preventing them from consuming resources prematurely.

How to enable gang scheduling

  1. To enable gang scheduling, set min-available and name by adding labels to the pods. When this method is used, kube-scheduler automatically creates a PodGroup named after the value of the pod-group.scheduling.sigs.k8s.io/name label. You must set the value of pod-group.scheduling.sigs.k8s.io/name to a subdomain name. For more information, see Object names and IDs.

    labels:
        pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
        pod-group.scheduling.sigs.k8s.io/min-available: "3"
    • name: the name of a PodGroup.

    • min-available: the minimum number of pods that must be scheduled to run a job.

  2. You can use one of the following methods to enable gang scheduling. In clusters that run Kubernetes 1.22 or later, the kube-scheduler version must be later than 1.xx.xx-aliyun-4.0.

    • Create a PodGroup and use the pod-group.scheduling.sigs.k8s.io or pod-group.scheduling.sigs.k8s.io/name label to specify the PodGroup to which your pods belong. The pods and the PodGroup must belong to the same namespace.

      Important

      Since version 1.31, ACK no longer supports the PodGroup resource of version scheduling.sigs.k8s.io/v1alpha1. It supports only the PodGroup resource of version scheduling.x-k8s.io/v1alpha1.

      # PodGroup CRD spec
      apiVersion: scheduling.sigs.k8s.io/v1alpha1
      kind: PodGroup
      metadata: 
       name: nginx
      spec: 
       scheduleTimeoutSeconds: 10 
       minMember: 3
      ---
      # Add the pod-group.scheduling.sigs.k8s.io/name label to the pods. 
      labels: 
       pod-group.scheduling.sigs.k8s.io/name: nginx
    • Add the min-available and name annotations to the configurations of the pods that you want to manage. The total-number and mode parameters in the koordinator API are not supported.

      annotations:  
       gang.scheduling.koordinator.sh/name: "gang-example" 
       gang.scheduling.koordinator.sh/min-available: "2"
Note

Pods that belong to the same PodGroup must be assigned the same priority.

Advanced gang scheduling configurations

Limits

To use advanced gang scheduling configurations in clusters that run Kubernetes 1.22 or later, the kube-scheduler version must be later than 1.xx.xx-aliyun-4.0.

Declare a GangGroup

When you use gang scheduling, some jobs may use different roles that have different requirements on the value of the min-available parameter. For example, PyTorch training jobs use parameter servers and workers. In this case, if you use a PodGroup to manage the pods of all roles, the requirements on the min-available parameter for different roles may not be met at the same time. If you create multiple PodGroups for the roles, the pods of the roles cannot be scheduled in one batch. To resolve this issue, we recommend that you use the GangGroup feature to manage multiple gangs as a group. The job can be run only when the number of pods that are scheduled reaches the value of min-available for each gang. This ensures that the requirements on the min-available parameter for different roles are met.

  • If you use labels to enable gang scheduling, add the following label to the configurations of the pod:

    pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
  • If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:

    pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
  • If you use annotations to enable gang scheduling, add the following label to the configurations of the pod:

    gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"

Declare a matchpolicy

When you enable gang scheduling, you can declare a match-policy to enable a PodGroup to count pods by type.

  • If you use labels to enable gang scheduling, add the following label to the configurations of the pod:

    pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
  • If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:

    pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
  • If you use annotations to enable gang scheduling, only the once-satisfied match method is supported.

The following table describes different match methods.

Match method

Description

only-waiting

Only pods that have completed resource preallocation are matched.

waiting-and-running

Pods in the Running state and pods that have completed resource preallocation are matched.

waiting-running-succeed

Pods in the Succeed state, pods in the Running state, and pods that have completed resource preallocation are matched.

once-satisfied

Only pods that have completed resource preallocation are matched. The PodGroup becomes invalid after pods are matched.

Examples

In this example, a distributed TensorFlow job is used to demonstrate how to enable gang scheduling. The ACK cluster that is used in this example has four GPUs.

  1. Install Arena and deploy an environment in your cluster to run TensorFlow jobs. For more information, see Install Arena.

    Note

    Arena is a subproject of Kubeflow. Kubeflow is an open source project for Kubernetes-based machine learning. Arena allows you to manage the lifecycle of machine learning jobs by using a CLI or SDK. Lifecycle management includes environment setup, data preparation, model development, model training, and model prediction. This improves the working efficiency of data scientists.

  2. Use the following template to submit a distributed TensorFlow job to the ACK cluster. The job runs in one parameter server (PS) pod and four worker pods. Each worker pod requires two GPUs.

    Show complete content

    apiVersion: "kubeflow.org/v1"
    kind: "TFJob"
    metadata:
      name: "tf-smoke-gpu"
    spec:
      tfReplicaSpecs:
        PS:
          replicas: 1
          template:
            metadata:
              creationTimestamp: null
              labels:
                pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
                pod-group.scheduling.sigs.k8s.io/min-available: "5"
            spec:
              containers:
              - args:
                - python
                - tf_cnn_benchmarks.py
                - --batch_size=32
                - --model=resnet50
                - --variable_update=parameter_server
                - --flush_stdout=true
                - --num_gpus=1
                - --local_parameter_device=cpu
                - --device=cpu
                - --data_format=NHWC
                image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
                name: tensorflow
                ports:
                - containerPort: 2222
                  name: tfjob-port
                resources:
                  limits:
                    cpu: '1'
                workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
              restartPolicy: OnFailure
        Worker:
          replicas: 4
          template:
            metadata:
              creationTimestamp: null
              labels:
                pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
                pod-group.scheduling.sigs.k8s.io/min-available: "5"
            spec:
              containers:
              - args:
                - python
                - tf_cnn_benchmarks.py
                - --batch_size=32
                - --model=resnet50
                - --variable_update=parameter_server
                - --flush_stdout=true
                - --num_gpus=1
                - --local_parameter_device=cpu
                - --device=gpu
                - --data_format=NHWC
                image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
                name: tensorflow
                ports:
                - containerPort: 2222
                  name: tfjob-port
                resources:
                  limits:
                    nvidia.com/gpu: 2
                workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
              restartPolicy: OnFailure
    • Submit the distributed TensorFlow job without enabling gang scheduling

      Run the following command to query the status of the pods that run the TensorFlow job:

      kubectl get pods

      The output shows that only two worker pods are running and the other worker pods are in the Pending state.

      NAME                    READY   STATUS    RESTARTS   AGE
      tf-smoke-gpu-ps-0       1/1     Running   0          6m43s
      tf-smoke-gpu-worker-0   1/1     Running   0          6m43s
      tf-smoke-gpu-worker-1   1/1     Running   0          6m43s
      tf-smoke-gpu-worker-2   0/1     Pending   0          6m43s
      tf-smoke-gpu-worker-3   0/1     Pending   0          6m43s

      Run the following command to query the log data of the running worker pods.

      kubectl logs -f tf-smoke-gpu-worker-0

      The returned log data indicates that the two worker pods are launched and waiting for the system to start the pending worker pods. The GPU resources occupied by the running worker pods are not in use.

      INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
      INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
    • Submit the distributed TensorFlow job with gang scheduling enabled

      Run the following command to query the status of the pods that run the TensorFlow job:

      kubectl get pods

      The computing resources in the cluster are insufficient to schedule the minimum number of pods. Therefore, the PodGroup cannot be scheduled and all pods are in the Pending state.

      NAME                    READY   STATUS    RESTARTS   AGE
      tf-smoke-gpu-ps-0       0/1     Pending   0          43s
      tf-smoke-gpu-worker-0   0/1     Pending   0          43s
      tf-smoke-gpu-worker-1   0/1     Pending   0          43s
      tf-smoke-gpu-worker-2   0/1     Pending   0          43s
      tf-smoke-gpu-worker-3   0/1     Pending   0          43s

      After four GPUs are allocated to the cluster, the computing resources in the cluster are sufficient to schedule the minimum number of pods. After the PodGroup is scheduled, the four worker pods start to run. Run the following command to query the status of the pods that run the TensorFlow job:

      kubectl get pods

      Expected output:

      NAME                    READY   STATUS    RESTARTS   AGE
      tf-smoke-gpu-ps-0       1/1     Running   0          3m16s
      tf-smoke-gpu-worker-0   1/1     Running   0          3m16s
      tf-smoke-gpu-worker-1   1/1     Running   0          3m16s
      tf-smoke-gpu-worker-2   1/1     Running   0          3m16s
      tf-smoke-gpu-worker-3   1/1     Running   0          3m16s

      Run the following command to query the log data of a running worker pod: The output shows that the training job

      kubectl logs -f tf-smoke-gpu-worker-0

      The following output indicates that the training job has started.

      INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
      INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
      INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step  Img/sec loss
      INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318
      INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10  images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343
      INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20  images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142

Error messages

  • Error message: "rejected by podgroup xxx".

  • Possible causes: If you create multiple PodGroups in a cluster, the pods in a PodGroup may fail to be scheduled at the same time due to the BackOff queue of kube-scheduler. In this case, pods that have completed resource pre-allocation may be rejected when the system schedules the pods in subsequent PodGroups. You can ignore the error if the situation lasts no more than 20 minutes. If the situation lasts more than 20 minutes, submit a ticket.

References

  • For more information about release notes for kube-scheduler, see kube-scheduler.

  • Kubernetes uses the ResourceQuota object to allocate resources statically. This method does not ensure high resource utilization in Kubernetes clusters. To improve the resource utilization of ACK clusters, Alibaba Cloud has developed the capacity scheduling feature based on the Yarn capacity scheduler and the Kubernetes scheduling framework. This feature uses elastic quota groups to meet the resource requests in an ACK cluster and share resources to improve resource utilization. For more information, see Work with capacity scheduling.