All Products
Search
Document Center

Container Service for Kubernetes:Use Gang scheduling

Last Updated:Mar 26, 2026

Distributed jobs like Apache Spark, Apache Hadoop, and TensorFlow training require all their pods to run simultaneously. Without gang scheduling, some pods start and hold resources while waiting for the rest, which can cause a cluster-wide deadlock where no job makes progress. Gang scheduling solves this with an all-or-nothing guarantee: either all required pods are dispatched together, or none are. This topic describes how to enable and configure gang scheduling on ACK.

How it works

ACK implements gang scheduling through a PodGroup resource. Each pod in a distributed job is assigned to a PodGroup and specifies a minimum pod count (min-available). The scheduler holds the entire group in Pending until it can satisfy the minimum. Once satisfied, all pods in the group are dispatched together.

ACK supports three methods to define a PodGroup:

Method How pods are grouped Best for
Labels kube-scheduler auto-creates a PodGroup Simple jobs, no separate PodGroup object needed
PodGroup CRD Explicit PodGroup resource with timeout control Jobs that need scheduleTimeoutSeconds
Koordinator annotations Annotation-based grouping Clusters using the Koordinator scheduling stack

All three methods require the pods and their PodGroup to be in the same namespace. All pods in a PodGroup must share the same priority.

Prerequisites

Before you begin, ensure that you have:

  • An ACK managed Pro cluster running Kubernetes 1.16 or later. Upgrade the cluster if needed.

  • For advanced configurations (GangGroup and match policy): a cluster running Kubernetes 1.22 or later with kube-scheduler version later than 1.xx.xx-aliyun-4.0.

Make sure the resource capacity of the elastic node pool and the node labels meet the requirements for pod scheduling. Otherwise, pods may fail to schedule to the nodes in the node pool.

Enable gang scheduling

Method 1: Labels (recommended)

Add two labels to each pod. kube-scheduler automatically creates a PodGroup named after the pod-group.scheduling.sigs.k8s.io/name value.

labels:
  pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu       # PodGroup name (must be a valid DNS subdomain)
  pod-group.scheduling.sigs.k8s.io/min-available: "3"       # Minimum pods required to start the job

The value of pod-group.scheduling.sigs.k8s.io/name must be a valid DNS subdomain name. For naming rules, see Object names and IDs.

Method 2: PodGroup CRD

Create a PodGroup resource explicitly, then reference it from each pod using a label.

Important

Since ACK version 1.31, only the scheduling.x-k8s.io/v1alpha1 API version is supported. The scheduling.sigs.k8s.io/v1alpha1 version is no longer supported.

# PodGroup resource
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: nginx
spec:
  scheduleTimeoutSeconds: 10   # Seconds to wait before rejecting the group if min-available is not met
  minMember: 3                 # Minimum number of pods required to start the job
---
# Pod label — must match the PodGroup name and namespace
labels:
  pod-group.scheduling.sigs.k8s.io/name: nginx

Method 3: Koordinator annotations

Add annotations to each pod. This method does not support the total-number or mode parameters from the Koordinator API.

annotations:
  gang.scheduling.koordinator.sh/name: "gang-example"
  gang.scheduling.koordinator.sh/min-available: "2"

Advanced configurations

Group multiple gangs (GangGroup)

Some jobs use different roles with separate min-available requirements — for example, PyTorch jobs with a parameter server and workers. A single PodGroup cannot express per-role minimums, and separate PodGroups cannot coordinate scheduling across roles.

GangGroup solves this by linking multiple PodGroups. The job only starts when every gang in the group satisfies its own min-available. Add the following label to each pod or PodGroup (use the annotation key for the Koordinator method):

Method Resource Key
Labels Pod pod-group.scheduling.sigs.k8s.io/groups
PodGroup CRD PodGroup pod-group.scheduling.sigs.k8s.io/groups
Koordinator annotations Pod gang.scheduling.koordinator.sh/groups

Example value (JSON array of <namespace>/<gang-name> entries):

pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
GangGroup requires Kubernetes 1.22 or later with kube-scheduler version later than 1.xx.xx-aliyun-4.0.

Configure match policy

By default, a PodGroup counts only pods that have completed resource preallocation (only-waiting). Use match-policy to include pods in other states toward the minimum count — for example, when some pods from a previous scheduling cycle are still running and should count toward the minimum.

Add the label to each pod (labels method) or to the PodGroup resource (PodGroup CRD method). The Koordinator annotations method only supports once-satisfied.

Match policy Pods counted toward min-available When to use
only-waiting Pods that completed resource preallocation Strictest — prevents any already-running pod from counting toward a new scheduling cycle. Use for stateless jobs with no carry-over from prior cycles.
waiting-and-running Pods in Running state + pods that completed preallocation Use when some pods from a previous cycle are still running and should count toward the minimum. Reduces the risk of idle resource hold caused by over-strict counting.
waiting-running-succeed Pods in Succeeded state + Running + completed preallocation Use for jobs that tolerate partial restarts — already-succeeded pods still count. Avoids re-scheduling pods that have already completed.
once-satisfied Pods that completed resource preallocation; PodGroup becomes invalid once satisfied Use for one-shot jobs. Once the gang is dispatched, the PodGroup is invalidated.

Labels-based example:

pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"

PodGroup CRD example (add to the PodGroup, not the pod):

pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
Match policy configuration requires Kubernetes 1.22 or later with kube-scheduler version later than 1.xx.xx-aliyun-4.0.

Example: distributed TensorFlow job

This example shows the difference between running a distributed TensorFlow job with and without gang scheduling. The cluster has 4 GPUs. The job runs 1 parameter server (PS) pod and 4 worker pods; each worker requires 2 GPUs, with a minimum of 5 pods.

Step 1: Install Arena and prepare the cluster to run TensorFlow jobs. For setup instructions, see Install Arena.

Arena is a Kubeflow subproject that manages the lifecycle of machine learning jobs — including environment setup, data preparation, model development, model training, and prediction — through a CLI or SDK.

Step 2: Submit the TensorFlow job using the following manifest. Both the PS and worker templates include the gang scheduling labels with min-available: "5".

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "5"
        spec:
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=cpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                cpu: '1'
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure
    Worker:
      replicas: 4
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "5"
        spec:
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=gpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                nvidia.com/gpu: 2
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure

Without gang scheduling enabled:

Run the following command to check pod status:

kubectl get pods

With only 4 GPUs available, 2 worker pods start Running and claim all GPUs, while the remaining 2 workers stay Pending. The running workers are blocked waiting for the others:

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       1/1     Running   0          6m43s
tf-smoke-gpu-worker-0   1/1     Running   0          6m43s
tf-smoke-gpu-worker-1   1/1     Running   0          6m43s
tf-smoke-gpu-worker-2   0/1     Pending   0          6m43s
tf-smoke-gpu-worker-3   0/1     Pending   0          6m43s

Check a running worker's logs:

kubectl logs -f tf-smoke-gpu-worker-0

The log shows the workers are stalled waiting for the Pending pods — GPUs are held but no training runs:

INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2

With gang scheduling enabled:

All 5 pods remain Pending until the cluster has enough resources to satisfy min-available: 5:

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       0/1     Pending   0          43s
tf-smoke-gpu-worker-0   0/1     Pending   0          43s
tf-smoke-gpu-worker-1   0/1     Pending   0          43s
tf-smoke-gpu-worker-2   0/1     Pending   0          43s
tf-smoke-gpu-worker-3   0/1     Pending   0          43s

After 4 GPUs are added to the cluster, the scheduler dispatches all 5 pods simultaneously:

kubectl get pods

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       1/1     Running   0          3m16s
tf-smoke-gpu-worker-0   1/1     Running   0          3m16s
tf-smoke-gpu-worker-1   1/1     Running   0          3m16s
tf-smoke-gpu-worker-2   1/1     Running   0          3m16s
tf-smoke-gpu-worker-3   1/1     Running   0          3m16s

Check the worker log to confirm training has started:

kubectl logs -f tf-smoke-gpu-worker-0

Expected output:

INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step  Img/sec loss
INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318
INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10  images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343
INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20  images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142

Troubleshooting

Error: "rejected by podgroup xxx"

When multiple PodGroups exist in a cluster, the kube-scheduler's backoff queue can cause pods that completed resource preallocation in one scheduling cycle to be rejected when later PodGroups are processed.

This is expected behavior. You can ignore the error if the situation lasts no more than 20 minutes. If the error persists beyond 20 minutes, submit a ticket.

What's next