Distributed jobs like Apache Spark, Apache Hadoop, and TensorFlow training require all their pods to run simultaneously. Without gang scheduling, some pods start and hold resources while waiting for the rest, which can cause a cluster-wide deadlock where no job makes progress. Gang scheduling solves this with an all-or-nothing guarantee: either all required pods are dispatched together, or none are. This topic describes how to enable and configure gang scheduling on ACK.
How it works
ACK implements gang scheduling through a PodGroup resource. Each pod in a distributed job is assigned to a PodGroup and specifies a minimum pod count (min-available). The scheduler holds the entire group in Pending until it can satisfy the minimum. Once satisfied, all pods in the group are dispatched together.
ACK supports three methods to define a PodGroup:
| Method | How pods are grouped | Best for |
|---|---|---|
| Labels | kube-scheduler auto-creates a PodGroup | Simple jobs, no separate PodGroup object needed |
| PodGroup CRD | Explicit PodGroup resource with timeout control | Jobs that need scheduleTimeoutSeconds |
| Koordinator annotations | Annotation-based grouping | Clusters using the Koordinator scheduling stack |
All three methods require the pods and their PodGroup to be in the same namespace. All pods in a PodGroup must share the same priority.
Prerequisites
Before you begin, ensure that you have:
-
An ACK managed Pro cluster running Kubernetes 1.16 or later. Upgrade the cluster if needed.
-
For advanced configurations (GangGroup and match policy): a cluster running Kubernetes 1.22 or later with kube-scheduler version later than
1.xx.xx-aliyun-4.0.
Make sure the resource capacity of the elastic node pool and the node labels meet the requirements for pod scheduling. Otherwise, pods may fail to schedule to the nodes in the node pool.
Enable gang scheduling
Method 1: Labels (recommended)
Add two labels to each pod. kube-scheduler automatically creates a PodGroup named after the pod-group.scheduling.sigs.k8s.io/name value.
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu # PodGroup name (must be a valid DNS subdomain)
pod-group.scheduling.sigs.k8s.io/min-available: "3" # Minimum pods required to start the job
The value of pod-group.scheduling.sigs.k8s.io/name must be a valid DNS subdomain name. For naming rules, see Object names and IDs.
Method 2: PodGroup CRD
Create a PodGroup resource explicitly, then reference it from each pod using a label.
Since ACK version 1.31, only the scheduling.x-k8s.io/v1alpha1 API version is supported. The scheduling.sigs.k8s.io/v1alpha1 version is no longer supported.
# PodGroup resource
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
name: nginx
spec:
scheduleTimeoutSeconds: 10 # Seconds to wait before rejecting the group if min-available is not met
minMember: 3 # Minimum number of pods required to start the job
---
# Pod label — must match the PodGroup name and namespace
labels:
pod-group.scheduling.sigs.k8s.io/name: nginx
Method 3: Koordinator annotations
Add annotations to each pod. This method does not support the total-number or mode parameters from the Koordinator API.
annotations:
gang.scheduling.koordinator.sh/name: "gang-example"
gang.scheduling.koordinator.sh/min-available: "2"
Advanced configurations
Group multiple gangs (GangGroup)
Some jobs use different roles with separate min-available requirements — for example, PyTorch jobs with a parameter server and workers. A single PodGroup cannot express per-role minimums, and separate PodGroups cannot coordinate scheduling across roles.
GangGroup solves this by linking multiple PodGroups. The job only starts when every gang in the group satisfies its own min-available. Add the following label to each pod or PodGroup (use the annotation key for the Koordinator method):
| Method | Resource | Key |
|---|---|---|
| Labels | Pod | pod-group.scheduling.sigs.k8s.io/groups |
| PodGroup CRD | PodGroup | pod-group.scheduling.sigs.k8s.io/groups |
| Koordinator annotations | Pod | gang.scheduling.koordinator.sh/groups |
Example value (JSON array of <namespace>/<gang-name> entries):
pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
GangGroup requires Kubernetes 1.22 or later with kube-scheduler version later than 1.xx.xx-aliyun-4.0.
Configure match policy
By default, a PodGroup counts only pods that have completed resource preallocation (only-waiting). Use match-policy to include pods in other states toward the minimum count — for example, when some pods from a previous scheduling cycle are still running and should count toward the minimum.
Add the label to each pod (labels method) or to the PodGroup resource (PodGroup CRD method). The Koordinator annotations method only supports once-satisfied.
| Match policy | Pods counted toward min-available | When to use |
|---|---|---|
only-waiting |
Pods that completed resource preallocation | Strictest — prevents any already-running pod from counting toward a new scheduling cycle. Use for stateless jobs with no carry-over from prior cycles. |
waiting-and-running |
Pods in Running state + pods that completed preallocation | Use when some pods from a previous cycle are still running and should count toward the minimum. Reduces the risk of idle resource hold caused by over-strict counting. |
waiting-running-succeed |
Pods in Succeeded state + Running + completed preallocation | Use for jobs that tolerate partial restarts — already-succeeded pods still count. Avoids re-scheduling pods that have already completed. |
once-satisfied |
Pods that completed resource preallocation; PodGroup becomes invalid once satisfied | Use for one-shot jobs. Once the gang is dispatched, the PodGroup is invalidated. |
Labels-based example:
pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
PodGroup CRD example (add to the PodGroup, not the pod):
pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
Match policy configuration requires Kubernetes 1.22 or later with kube-scheduler version later than 1.xx.xx-aliyun-4.0.
Example: distributed TensorFlow job
This example shows the difference between running a distributed TensorFlow job with and without gang scheduling. The cluster has 4 GPUs. The job runs 1 parameter server (PS) pod and 4 worker pods; each worker requires 2 GPUs, with a minimum of 5 pods.
Step 1: Install Arena and prepare the cluster to run TensorFlow jobs. For setup instructions, see Install Arena.
Arena is a Kubeflow subproject that manages the lifecycle of machine learning jobs — including environment setup, data preparation, model development, model training, and prediction — through a CLI or SDK.
Step 2: Submit the TensorFlow job using the following manifest. Both the PS and worker templates include the gang scheduling labels with min-available: "5".
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "5"
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: '1'
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 4
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "5"
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 2
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Without gang scheduling enabled:
Run the following command to check pod status:
kubectl get pods
With only 4 GPUs available, 2 worker pods start Running and claim all GPUs, while the remaining 2 workers stay Pending. The running workers are blocked waiting for the others:
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 1/1 Running 0 6m43s
tf-smoke-gpu-worker-0 1/1 Running 0 6m43s
tf-smoke-gpu-worker-1 1/1 Running 0 6m43s
tf-smoke-gpu-worker-2 0/1 Pending 0 6m43s
tf-smoke-gpu-worker-3 0/1 Pending 0 6m43s
Check a running worker's logs:
kubectl logs -f tf-smoke-gpu-worker-0
The log shows the workers are stalled waiting for the Pending pods — GPUs are held but no training runs:
INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
With gang scheduling enabled:
All 5 pods remain Pending until the cluster has enough resources to satisfy min-available: 5:
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 0/1 Pending 0 43s
tf-smoke-gpu-worker-0 0/1 Pending 0 43s
tf-smoke-gpu-worker-1 0/1 Pending 0 43s
tf-smoke-gpu-worker-2 0/1 Pending 0 43s
tf-smoke-gpu-worker-3 0/1 Pending 0 43s
After 4 GPUs are added to the cluster, the scheduler dispatches all 5 pods simultaneously:
kubectl get pods
Expected output:
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 1/1 Running 0 3m16s
tf-smoke-gpu-worker-0 1/1 Running 0 3m16s
tf-smoke-gpu-worker-1 1/1 Running 0 3m16s
tf-smoke-gpu-worker-2 1/1 Running 0 3m16s
tf-smoke-gpu-worker-3 1/1 Running 0 3m16s
Check the worker log to confirm training has started:
kubectl logs -f tf-smoke-gpu-worker-0
Expected output:
INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step Img/sec loss
INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318
INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10 images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343
INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20 images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142
Troubleshooting
Error: "rejected by podgroup xxx"
When multiple PodGroups exist in a cluster, the kube-scheduler's backoff queue can cause pods that completed resource preallocation in one scheduling cycle to be rejected when later PodGroups are processed.
This is expected behavior. You can ignore the error if the situation lasts no more than 20 minutes. If the error persists beyond 20 minutes, submit a ticket.
What's next
-
Work with capacity scheduling — use elastic quota groups to improve cluster resource utilization