Work with gang scheduling - Container Service for Kubernetes

The gang scheduling feature provided by Container Service for Kubernetes (ACK) is developed on top of the new kube-scheduler framework. This feature provides a solution to job scheduling in all-or-nothing scenarios. This topic describes how to enable gang scheduling.

Usage notes

Make sure that the resource capacity of the elastic node pool that you use and the node labels meet the requirements for pod scheduling. Otherwise, pods may fail to be scheduled to the nodes in the node pool.

Prerequisites

An ACK Pro cluster is created. For more information, see Create an ACK Pro cluster.
Important
Gang scheduling is available only for ACK Pro clusters. To enable gang scheduling in ACK dedicated clusters, submit a ticket.
The Kubernetes version must be 1.16 or later.

Background Information

Kubernetes is widely used in online service orchestration. ACK wants to use Kubernetes as a platform for central management of online services and offline jobs. This improves the resource utilization and performance of clusters. However, kube-scheduler cannot migrate specific offline workloads to Kubernetes clusters. For example, if a job requires all-or-nothing scheduling, all tasks of the job must be scheduled at the same time. If only some of the tasks are started, the started jobs must wait until all the remaining tasks are scheduled. If each submitted job contains unscheduled tasks, all submitted jobs remain in the Pending state and the cluster is deadlocked. To avoid this situation, you must enable gang scheduling for kube-scheduler.

Usage notes

Gang scheduling is a scheduling algorithm that schedules multiple correlated processes to different processors in a parallel system and simultaneously starts these processes. Gang scheduling aims to start all correlated processes at the same time. This ensures that the process group is not stuck when the system fails to start some processes. For example, if you submit a batch job that contains multiple tasks, either all of the tasks are scheduled or none of them is scheduled. Task scheduling in all-or-nothing scenarios is known as gang scheduling.

In ACK, a PodGroup is a group of pods that need to be scheduled at the same time. When you submit a job that requires all-or-nothing scheduling, you can add labels to pods. The labels specify the name of the PodGroup to which the job belongs and the minimum number of tasks that must be scheduled to run the job. kube-scheduler schedules tasks based on the minimum number of tasks that must be scheduled. The tasks are scheduled only when the cluster resources are sufficient to schedule the required number of tasks. Otherwise, the job remains in the Pending state.

How to enable gang scheduling

To enable gang scheduling, set min-available and name by adding labels to the pods. When this method is used, kube-scheduler automatically creates a PodGroup named after the value of the pod-group.scheduling.sigs.k8s.io/name label. You must set the value of pod-group.scheduling.sigs.k8s.io/name to a subdomain name. For more information, see Object names and IDs.
```
labels:
    pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
    pod-group.scheduling.sigs.k8s.io/min-available: "3"
```
- name: the name of a PodGroup.
- min-available: the minimum number of pods that must be scheduled to run a job.
In clusters that run Kubernetes 1.22 and kube-scheduler versions later than 1.22.15-4.0 or clusters that run Kubernetes 1.24 and kube-scheduler versions later than 1.24.6-4.0, you can use one of the following methods to enable gang scheduling:
- Create a PodGroup and use the pod-group.scheduling.sigs.k8s.io or pod-group.scheduling.sigs.k8s.io/name label to specify the PodGroup to which your pods belong. The pods and the PodGroup must belong to the same namespace.
```
# PodGroup CRD spec
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata: 
 name: nginx
spec: 
 scheduleTimeoutSeconds: 10 
 minMember: 3
---
# Add the pod-group.scheduling.sigs.k8s.io/name label to the pods. 
labels: 
 pod-group.scheduling.sigs.k8s.io/name: nginx
```
- Add the min-available and name annotations to the configurations of the pods that you want to manage. The total-number and mode parameters in the koordinator API are not supported.
```
annotations:  
 gang.scheduling.koordinator.sh/name: "gang-example" 
 gang.scheduling.koordinator.sh/min-available: "2"
```

Note

Pods that belong to the same PodGroup must be assigned the same priority.

Advanced gang scheduling configurations

Limits

To use advanced gang scheduling configurations in clusters that run Kubernetes 1.22, the kube-scheduler version must be higher than 1.22.15-4.0. To use advanced gang scheduling configurations in clusters that run Kubernetes 1.24, the kube-scheduler version must be higher than 1.24.6-4.0.

Declare a GangGroup

When you use gang scheduling, some jobs may use different roles that have different requirements on the value of the min-available parameter. For example, PyTorch training jobs use parameter servers and workers. In this case, if you use a PodGroup to manage the pods of all roles, the requirements on the min-available parameter for different roles may not be met at the same time. If you create multiple PodGroups for the roles, the pods of the roles cannot be scheduled in one batch. To resolve this issue, we recommend that you use the GangGroup feature to manage multiple gangs as a group. The job can be run only when the number of pods that are scheduled reaches the value of min-available for each gang. This ensures that the requirements on the min-available parameter for different roles are met.

If you use labels to enable gang scheduling, add the following label to the configurations of the pod:
```
pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
```
If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:
```
pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
```
If you use annotations to enable gang scheduling, add the following label to the configurations of the pod:
```
gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
```

Declare a matchpolicy

When you enable gang scheduling, you can declare a match-policy to enable a PodGroup to count pods by type.

If you use labels to enable gang scheduling, add the following label to the configurations of the pod:
```
pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
```
If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:
```
pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
```
If you use annotations to enable gang scheduling, only the once-satisfied match method is supported.

The following table describes different match methods.

Match method	Description
only-waiting	Only pods that have completed resource preallocation are matched.
waiting-and-running	Pods in the Running state and pods that have completed resource preallocation are matched.
waiting-running-succeed	Pods in the Succeed state, pods in the Running state, and pods that have completed resource preallocation are matched.
once-satisfied	Only pods that have completed resource preallocation are matched. The PodGroup becomes invalid after pods are matched.

Examples

In this example, a distributed TensorFlow job is used to demonstrate how to enable gang scheduling. The ACK cluster that is used in this example has four GPUs.

Install Arena and deploy an environment in your cluster to run TensorFlow jobs. For more information, see Install Arena.
Note
Arena is a subproject of Kubeflow. Kubeflow is an open source project for Kubernetes-based machine learning. Arena allows you to manage the lifecycle of machine learning jobs by using a CLI or SDK. Lifecycle management includes environment setup, data preparation, model development, model training, and model prediction. This improves the working efficiency of data scientists.

Use the following template to submit a distributed TensorFlow job to the ACK cluster. The job runs in one parameter server (PS) pod and four worker pods. Each worker pod requires two GPUs.

Show complete content

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "5"
        spec:
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=cpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                cpu: '1'
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure
    Worker:
      replicas: 4
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "5"
        spec:
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=gpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                nvidia.com/gpu: 2
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure

Submit the distributed TensorFlow job without enabling gang scheduling

Run the following command to query the status of the pods that run the TensorFlow job:

kubectl get pods

The output shows that only two worker pods are running and the other worker pods are in the Pending state.

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       1/1     Running   0          6m43s
tf-smoke-gpu-worker-0   1/1     Running   0          6m43s
tf-smoke-gpu-worker-1   1/1     Running   0          6m43s
tf-smoke-gpu-worker-2   0/1     Pending   0          6m43s
tf-smoke-gpu-worker-3   0/1     Pending   0          6m43s

Run the following command to query the log data of the running worker pods.

kubectl logs -f tf-smoke-gpu-worker-0

The returned log data indicates that the two worker pods are launched and waiting for the system to start the pending worker pods. The GPU resources occupied by the running worker pods are not in use.

INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2

Submit the distributed TensorFlow job with gang scheduling enabled

Run the following command to query the status of the pods that run the TensorFlow job:

kubectl get pods

The computing resources in the cluster are insufficient to schedule the minimum number of pods. Therefore, the PodGroup cannot be scheduled and all pods are in the Pending state.

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       0/1     Pending   0          43s
tf-smoke-gpu-worker-0   0/1     Pending   0          43s
tf-smoke-gpu-worker-1   0/1     Pending   0          43s
tf-smoke-gpu-worker-2   0/1     Pending   0          43s
tf-smoke-gpu-worker-3   0/1     Pending   0          43s

After four GPUs are allocated to the cluster, the computing resources in the cluster are sufficient to schedule the minimum number of pods. After the PodGroup is scheduled, the four worker pods start to run. Run the following command to query the status of the pods that run the TensorFlow job:

kubectl get pods

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       1/1     Running   0          3m16s
tf-smoke-gpu-worker-0   1/1     Running   0          3m16s
tf-smoke-gpu-worker-1   1/1     Running   0          3m16s
tf-smoke-gpu-worker-2   1/1     Running   0          3m16s
tf-smoke-gpu-worker-3   1/1     Running   0          3m16s

Run the following command to query the log data of a running worker pod: The output shows that the training job

kubectl logs -f tf-smoke-gpu-worker-0

The following output indicates that the training job has started.

INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step  Img/sec loss
INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318
INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10  images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343
INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20  images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142

Error messages

Error message: "rejected by podgroup xxx".
Possible causes: If you create multiple PodGroups in a cluster, the pods in a PodGroup may fail to be scheduled at the same time due to the BackOff queue of kube-scheduler. In this case, pods that have completed resource pre-allocation may be rejected when the system schedules the pods in subsequent PodGroups. You can ignore the error if the situation lasts no more than 20 minutes. If the situation lasts more than 20 minutes, submit a ticket.