Use Gang scheduling to resolve All-or-Nothing job scheduling issues - Container Service for Kubernetes

The gang scheduling feature provided by Container Service for Kubernetes (ACK) is developed on top of the new kube-scheduler framework. Gang scheduling ensures that a group of correlated pods are scheduled at the same time. If the scheduling requirements are not met, none of the pods is scheduled. Gang scheduling provides a solution to job scheduling in All-or-Nothing scenarios. It is suitable for distributed applications which strictly require you to schedule or share resources for all big data computing jobs at the same time, such as Spark and Hadoop jobs. This topic describes how to enable gang scheduling.

Usage notes

Make sure that the resource capacity of the elastic node pool that you use and the node labels meet the requirements for pod scheduling. Otherwise, pods may fail to be scheduled to the nodes in the node pool.

Prerequisites

An ACK managed Pro cluster running Kubernetes 1.16 or later is created. Manually upgrade the cluster if needed.

Usage notes

While Kubernetes excels at orchestrating online services, ACK aims to extend it into a unified platform for both online and offline workloads to improve cluster utilization and efficiency. However, the default scheduler's limitations can prevent certain offline jobs from running effectively, especially those requiring all-or-nothing execution.

In these scenarios, all tasks of a job must be scheduled simultaneously. If only a portion of the tasks are launched, they will sit idle while consuming resources, waiting for the rest to be scheduled. This can lead to resource wastage and, in worst-case scenarios, a cluster-wide deadlock where no jobs can proceed. This is where gang scheduling becomes essential.

Gang scheduling is a policy that ensures all pods in a predefined group are launched at the same time. Its all-or-nothing principle prevents the partial execution issues that cause deadlocks. A job will only start once enough resources are available for all its required members; otherwise, the entire group waits.

ACK implements gang scheduling through a concept called a PodGroup. When submitting a job, you define its membership in a PodGroup using labels and specify the minimum number of pods required for the job to start. The scheduler will only dispatch the entire group of pods once it can satisfy this minimum requirement. Until then, all pods in the group will remain in the pending state, preventing them from consuming resources prematurely.

How to enable gang scheduling

To enable gang scheduling, set min-available and name by adding labels to the pods. When this method is used, kube-scheduler automatically creates a PodGroup named after the value of the pod-group.scheduling.sigs.k8s.io/name label. You must set the value of pod-group.scheduling.sigs.k8s.io/name to a subdomain name. For more information, see Object names and IDs.
```
labels:
    pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
    pod-group.scheduling.sigs.k8s.io/min-available: "3"
```
- name: the name of a PodGroup.
- min-available: the minimum number of pods that must be scheduled to run a job.
You can use one of the following methods to enable gang scheduling. In clusters that run Kubernetes 1.22 or later, the kube-scheduler version must be later than 1.xx.xx-aliyun-4.0.
- Create a PodGroup and use the pod-group.scheduling.sigs.k8s.io or pod-group.scheduling.sigs.k8s.io/name label to specify the PodGroup to which your pods belong. The pods and the PodGroup must belong to the same namespace.
  Important
  Since version 1.31, ACK no longer supports the PodGroup resource of version scheduling.sigs.k8s.io/v1alpha1. It supports only the PodGroup resource of version scheduling.x-k8s.io/v1alpha1.
```
# PodGroup CRD spec
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata: 
 name: nginx
spec: 
 scheduleTimeoutSeconds: 10 
 minMember: 3
---
# Add the pod-group.scheduling.sigs.k8s.io/name label to the pods. 
labels: 
 pod-group.scheduling.sigs.k8s.io/name: nginx
```
- Add the min-available and name annotations to the configurations of the pods that you want to manage. The total-number and mode parameters in the koordinator API are not supported.
```
annotations:  
 gang.scheduling.koordinator.sh/name: "gang-example" 
 gang.scheduling.koordinator.sh/min-available: "2"
```

Note

Pods that belong to the same PodGroup must be assigned the same priority.

Advanced gang scheduling configurations

Limits

To use advanced gang scheduling configurations in clusters that run Kubernetes 1.22 or later, the kube-scheduler version must be later than 1.xx.xx-aliyun-4.0.

Declare a GangGroup

When you use gang scheduling, some jobs may use different roles that have different requirements on the value of the min-available parameter. For example, PyTorch training jobs use parameter servers and workers. In this case, if you use a PodGroup to manage the pods of all roles, the requirements on the min-available parameter for different roles may not be met at the same time. If you create multiple PodGroups for the roles, the pods of the roles cannot be scheduled in one batch. To resolve this issue, we recommend that you use the GangGroup feature to manage multiple gangs as a group. The job can be run only when the number of pods that are scheduled reaches the value of min-available for each gang. This ensures that the requirements on the min-available parameter for different roles are met.

If you use labels to enable gang scheduling, add the following label to the configurations of the pod:
```
pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
```
If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:
```
pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
```
If you use annotations to enable gang scheduling, add the following label to the configurations of the pod:
```
gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
```

Declare a matchpolicy

When you enable gang scheduling, you can declare a match-policy to enable a PodGroup to count pods by type.

If you use labels to enable gang scheduling, add the following label to the configurations of the pod:
```
pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
```
If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:
```
pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
```
If you use annotations to enable gang scheduling, only the once-satisfied match method is supported.

The following table describes different match methods.

Match method	Description
only-waiting	Only pods that have completed resource preallocation are matched.
waiting-and-running	Pods in the Running state and pods that have completed resource preallocation are matched.
waiting-running-succeed	Pods in the Succeed state, pods in the Running state, and pods that have completed resource preallocation are matched.
once-satisfied	Only pods that have completed resource preallocation are matched. The PodGroup becomes invalid after pods are matched.

Examples

In this example, a distributed TensorFlow job is used to demonstrate how to enable gang scheduling. The ACK cluster that is used in this example has four GPUs.

Install Arena and deploy an environment in your cluster to run TensorFlow jobs. For more information, see Install Arena.
Note
Arena is a subproject of Kubeflow. Kubeflow is an open source project for Kubernetes-based machine learning. Arena allows you to manage the lifecycle of machine learning jobs by using a CLI or SDK. Lifecycle management includes environment setup, data preparation, model development, model training, and model prediction. This improves the working efficiency of data scientists.

Use the following template to submit a distributed TensorFlow job to the ACK cluster. The job runs in one parameter server (PS) pod and four worker pods. Each worker pod requires two GPUs.

Show complete content

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "5"
        spec:
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=cpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                cpu: '1'
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure
    Worker:
      replicas: 4
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "5"
        spec:
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=gpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                nvidia.com/gpu: 2
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure

Submit the distributed TensorFlow job without enabling gang scheduling

Run the following command to query the status of the pods that run the TensorFlow job:

kubectl get pods

The output shows that only two worker pods are running and the other worker pods are in the Pending state.

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       1/1     Running   0          6m43s
tf-smoke-gpu-worker-0   1/1     Running   0          6m43s
tf-smoke-gpu-worker-1   1/1     Running   0          6m43s
tf-smoke-gpu-worker-2   0/1     Pending   0          6m43s
tf-smoke-gpu-worker-3   0/1     Pending   0          6m43s

Run the following command to query the log data of the running worker pods.

kubectl logs -f tf-smoke-gpu-worker-0

The returned log data indicates that the two worker pods are launched and waiting for the system to start the pending worker pods. The GPU resources occupied by the running worker pods are not in use.

INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2

Submit the distributed TensorFlow job with gang scheduling enabled

Run the following command to query the status of the pods that run the TensorFlow job:

kubectl get pods

The computing resources in the cluster are insufficient to schedule the minimum number of pods. Therefore, the PodGroup cannot be scheduled and all pods are in the Pending state.

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       0/1     Pending   0          43s
tf-smoke-gpu-worker-0   0/1     Pending   0          43s
tf-smoke-gpu-worker-1   0/1     Pending   0          43s
tf-smoke-gpu-worker-2   0/1     Pending   0          43s
tf-smoke-gpu-worker-3   0/1     Pending   0          43s

After four GPUs are allocated to the cluster, the computing resources in the cluster are sufficient to schedule the minimum number of pods. After the PodGroup is scheduled, the four worker pods start to run. Run the following command to query the status of the pods that run the TensorFlow job:

kubectl get pods

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       1/1     Running   0          3m16s
tf-smoke-gpu-worker-0   1/1     Running   0          3m16s
tf-smoke-gpu-worker-1   1/1     Running   0          3m16s
tf-smoke-gpu-worker-2   1/1     Running   0          3m16s
tf-smoke-gpu-worker-3   1/1     Running   0          3m16s

Run the following command to query the log data of a running worker pod: The output shows that the training job

kubectl logs -f tf-smoke-gpu-worker-0

The following output indicates that the training job has started.

INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step  Img/sec loss
INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318
INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10  images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343
INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20  images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142

Error messages

Error message: "rejected by podgroup xxx".
Possible causes: If you create multiple PodGroups in a cluster, the pods in a PodGroup may fail to be scheduled at the same time due to the BackOff queue of kube-scheduler. In this case, pods that have completed resource pre-allocation may be rejected when the system schedules the pods in subsequent PodGroups. You can ignore the error if the situation lasts no more than 20 minutes. If the situation lasts more than 20 minutes, submit a ticket.

References

For more information about release notes for kube-scheduler, see kube-scheduler.
Kubernetes uses the ResourceQuota object to allocate resources statically. This method does not ensure high resource utilization in Kubernetes clusters. To improve the resource utilization of ACK clusters, Alibaba Cloud has developed the capacity scheduling feature based on the Yarn capacity scheduler and the Kubernetes scheduling framework. This feature uses elastic quota groups to meet the resource requests in an ACK cluster and share resources to improve resource utilization. For more information, see Work with capacity scheduling.