Gang scheduling provides all-or-nothing scheduling for multi-pod jobs in Alibaba Cloud Container Compute Service (ACS). The scheduler holds all pods until it can place the minimum required number simultaneously, preventing resource deadlocks in distributed workloads such as AI training jobs, MPI tasks, and multi-role inference pipelines.
How gang scheduling works
When a job creates multiple pods, all pods must start together. Gang scheduling ensures that resources are allocated to the entire group at once — if the minimum number of pods cannot be scheduled simultaneously, none of the pods are scheduled. This prevents resource deadlocks caused by jobs partially acquiring resources and blocking each other.
Gang scheduling in ACS is implemented using the PodGroup custom resource (podgroups.scheduling.sigs.k8s.io/v1alpha1). You create a PodGroup to define the group constraint, then associate job pods with it using a label.
All pods configured for gang scheduling must belong to the same compute class.
Prerequisites
kube-scheduler is installed and its version meets the following requirements.
ACS cluster version
Scheduler component version
1.31
v1.31.0-aliyun-1.2.0 and later
1.30
v1.30.3-aliyun-1.1.1 and later
1.28
v1.28.9-aliyun-1.1.0 and later
-
Gang scheduling supports only the high-performance network GPU (gpu-hpn) compute type. For more information, see Definition of computing types.
-
The Enable Custom Labels And Schedulers For GPU-HPN Nodes setting is disabled. For more information, see Component configuration.
Configure gang scheduling
-
Create a PodGroup custom resource. The
minMemberfield sets the minimum number of pods that must be scheduled simultaneously. ThescheduleTimeoutSecondsfield sets how long the scheduler waits before marking the attempt as failed.apiVersion: scheduling.sigs.k8s.io/v1alpha1 kind: PodGroup metadata: name: demo-job-podgroup namespace: default spec: scheduleTimeoutSeconds: 10 minMember: 3 # Set the minimum number of running pods. -
Create a job and associate it with the PodGroup. Save the following content to
gang-job.yaml. The labelpod-group.scheduling.sigs.k8s.io: demo-job-podgroupon the pod template associates every pod with the named PodGroup.apiVersion: batch/v1 kind: Job metadata: name: gang-job namespace: default spec: parallelism: 3 # The number of pods must be greater than or equal to minMember in the PodGroup object. template: metadata: labels: alibabacloud.com/compute-class: "gpu-hpn" # Specify the compute class as gpu-hpn. alibabacloud.com/gpu-model-series: "example-model" # A GPU model must be specified for the GPU compute class. pod-group.scheduling.sigs.k8s.io: demo-job-podgroup # Associate with the demo-job-podgroup PodGroup instance. spec: containers: - name: demo-job image: registry.cn-hangzhou.aliyuncs.com/acs/stress:v1.0.4 args: - 'infinity' command: - sleep resources: requests: cpu: "1" memory: "1Gi" nvidia.com/gpu: "1" limits: cpu: "1" memory: "1Gi" nvidia.com/gpu: "1" restartPolicy: Never backoffLimit: 4 -
Deploy the job to the cluster.
kubectl apply -f gang-job.yaml -
Verify that the pods are scheduled. When scheduling succeeds, all pods transition from the Pending state to the Running state simultaneously.
kubectl get podgroup -n default kubectl get pods -n default -l pod-group.scheduling.sigs.k8s.io=demo-job-podgroup
Make sure that the number of associated pods is greater than or equal to the `minMember` value configured for the PodGroup instance. Otherwise, the pods cannot be scheduled.
Examples
This example demonstrates both successful and failed scheduling outcomes when you use gang scheduling for a job.
-
Run the following command to create the
test-gangnamespace.kubectl create ns test-gang -
Run the following command to create a ResourceQuota in the
test-gangnamespace to demonstrate how gang scheduling behaves when resources are insufficient.cat << EOF | kubectl apply -f - apiVersion: v1 kind: ResourceQuota metadata: name: object-counts namespace: test-gang spec: hard: pods: "2" EOF -
Run the following command to create a PodGroup object. In the object,
minMemberis set to 3, which specifies that at least 3 associated pods must be scheduled successfully at the same time. If one of the pods fails to be created or scheduled, all pods in the group remain in the Pending state.cat << EOF | kubectl apply -f - apiVersion: scheduling.sigs.k8s.io/v1alpha1 kind: PodGroup metadata: name: demo-job-podgroup namespace: test-gang spec: minMember: 3 # Set the minimum number of running pods. EOF -
Use the following YAML content to create a gang-job.yaml file. This file defines a Job object that specifies four pod replicas and is associated with the PodGroup object.
apiVersion: batch/v1 kind: Job metadata: name: gang-job namespace: test-gang spec: parallelism: 4 # The number of pods must be greater than or equal to minMember in the PodGroup object. template: metadata: labels: alibabacloud.com/compute-class: "gpu-hpn" # Specify the compute class as gpu-hpn. alibabacloud.com/gpu-model-series: "example-model" # A GPU model must be specified for the GPU compute class. pod-group.scheduling.sigs.k8s.io: demo-job-podgroup # Associate with the demo-job-podgroup PodGroup instance. spec: containers: - name: demo-job image: registry.cn-hangzhou.aliyuncs.com/acs/stress:v1.0.4 args: - 'infinity' command: - sleep resources: requests: cpu: "1" memory: "1Gi" nvidia.com/gpu: "1" limits: cpu: "1" memory: "1Gi" nvidia.com/gpu: "1" restartPolicy: Never backoffLimit: 4 -
Run the following command to deploy the gang-job job to the cluster.
kubectl apply -f gang-job.yaml -
Run the following command to view the pod status.
kubectl get pod -n test-gangExpected output:
NAME READY STATUS RESTARTS AGE gang-job-hrnc6 0/1 Pending 0 23s gang-job-wthnq 0/1 Pending 0 23sThe ResourceQuota limits the number of running pods to two, so only two pods are created for this job. This number is less than the `minMember` value specified in the PodGroup. Therefore, both pods remain in the Pending state and are not scheduled.
-
Run the following command to delete the ResourceQuota and remove the limit on the number of pods.
kubectl delete resourcequota -n test-gang object-counts -
Run the following command to view the pod status.
kubectl get pod -n test-gangExpected output:
NAME READY STATUS RESTARTS AGE gang-job-24cz9 1/1 Running 0 96s gang-job-mmkxl 1/1 Running 0 96s gang-job-msr8v 1/1 Running 0 96s gang-job-qnclz 1/1 Running 0 96sThe output indicates that the pods are scheduled successfully.