The Burgeoning Kubernetes Scheduling System – Part 2: Coscheduling and Gang Scheduling That Support Batch Jobs

By Wang Qingcan (Li Fan) and Zhang Kai

Preface

With years of experience in supporting Kubernetes products and customers, the Alibaba Cloud Container Service for Kubernetes Team has significantly optimized and extended Kube-scheduler to stably and efficiently schedule various complex workloads in different scenarios. This series of articles entitled, “The Burgeoning Kubernetes Scheduling System” provides a comprehensive summary of our experiences, technical thinking, and specific implementation methods to Kubernetes users and developers. We hope that articles can help you better understand the powerful capabilities and future trends of the Kubernetes scheduling system.

Preface

First, let's take a look at the definitions of coscheduling and gang scheduling. According to Wikipedia, coscheduling is the scheduling related processes to run on different processors at the same time in concurrent systems. In coscheduling scenarios, the main principle is to ensure that all related processes can be started at the same time. This prevents exceptions in some processes from blocking the entire process group. An abnormal process that blocks a group is called a fragment.

During implementation, coscheduling can be classified into explicit coscheduling, local coscheduling, and implicit coscheduling based on whether fragments are allowed. Among them, explicit coscheduling is known as gang scheduling. Gang scheduling allows no fragments, which means "all or nothing."

By mapping the preceding concepts to Kubernetes, you can understand why the Kubernetes scheduling system supports coscheduling for batch jobs. A batch job (equivalent to a related process group) contains N pods (equivalent to processes.) The Kubernetes scheduler schedules these N pods to run simultaneously on M nodes (equivalent to processors.) Assume this batch job can run as long as a certain number of pods are started simultaneously. We define the minimum number of pods that we need to start simultaneously as min-available. When min-available is equal to N, the batch job must meet the gang scheduling requirements.

Why Does the Kubernetes Scheduling System Require Coscheduling?

Kubernetes is widely used in online service orchestration. To improve the utilization and operating efficiency of clusters, we hope to use Kubernetes as a unified management platform to manage online services and offline jobs. The default scheduler schedules pods serially without considering the relationship between pods. However, many offline jobs that involve data computing require combined scheduling. Combined scheduling means that all tasks must be created before the overall job can run properly. If some tasks are started but other tasks are not, the started tasks wait for the scheduler to schedule the remaining tasks. This is a gang scheduling scenario.

As shown in the following figure, JobA can only run properly when four pods are started at the same time. Kube-scheduler sequentially schedules and creates three pods. However, cluster resources are insufficient for Kube-scheduler to schedule the fourth pod. As a result, the first three pods for JobA remain in the pending state and continue to occupy resources. If the fourth pod cannot be started in time, the entire JobA cannot be run, and worse still, the occupied cluster resources are wasted.

In a worse case, as shown in the following figure, other cluster resources are occupied by the first three pods of JobB and Kube-scheduler is also waiting to create the fourth pod for JobB. As a result, a deadlock occurs, rendering the entire cluster inoperable.

Solutions Provided by the Community

To overcome the preceding pain points, the community provides the Kube-batch project and the Volcano project derived from the Kube-batch project. Specifically, the community developed a new scheduler to schedule PodGroups instead of pods during scheduling. In other words, the scheduler schedules pods by group. In these projects, the new scheduler schedules pods that require the coscheduling feature, and Kube-scheduler schedules other pods, such as those that run online services.

These projects can resolve coscheduling problems but create new problems. As we all know, a scheduler needs to centralize resources in a single cluster. However, if two schedulers coexist in the same cluster, decision conflicts may occur. For example, one unit of resources may be separately allocated to two different pods. As a result, a pod scheduled to a node may fail to be created due to insufficient resources. The only solutions are to forcibly divide nodes using labels or deploy multiple clusters. In this case, both online services and offline jobs are run in the same Kubernetes cluster, inevitably wasting cluster resources and increasing O&M costs. Furthermore, to run the Volcano project, you must start the custom MutatingAdmissionWebhook and ValidatingAdmissionWebhook. These webhooks pose introduce single point of failure risks. If any webhook fails, all the pods in the cluster may fail to be created. Running an additional scheduler also increases the complexity of maintenance and compromises on compatibility with the upstream Kube-scheduler API.

Scheduling Framework-Based Solution

In the first article in this series, we introduced the architectural principles and development method of the Kubernetes Scheduling Framework. On this basis, we can extend and implement a coscheduling plug-in to enable the native Kubernetes scheduler to schedule batch jobs while avoiding the problems of the preceding solution. The previous article also provided a detailed description of the Scheduling Framework. You are welcome to read it for more information.

To better manage scheduling plug-ins in different scenarios, the sig-scheduling team, which is responsible for Kube-scheduler in Kubernetes, created a project named scheduler-plug-ins. The coscheduling plug-in implemented based on the Scheduling Framework became the first official plug-in for this project. In the following section, I will describe the implementation and usage of the coscheduling plug-in in detail.

Technical Solution

Overall Architecture

PodGroup

We define PodGroups using labels. Pods with the same label belong to the same PodGroup. In addition, min-available is used to indicate the minimum number of replicas that a job of the PodGroup requires to run properly.

labels:
     pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
     pod-group.scheduling.sigs.k8s.io/min-available: "2"

Note: Pods in the same PodGroup must have the same priority.

Permit

The Permit plug-in of the Scheduling Framework provides the delayed binding feature. Specifically, for a pod that enters the Permit phase, you can customize a condition to allow the pod to pass the phase, deny the pod from passing the phase, or keep the pod pending. When you keep the pod pending, you can specify a timeout period for the pod. The delayed binding feature for the Permit phase allows pods belonging to the same PodGruop to wait after they are scheduled to a node. When the required number of pods is scheduled to the node, the scheduler can run all the pods of the same PodGroup to bind and create all the pods.

Assume that JobA can run properly only when four pods for the job are started at the same time, but the current cluster resources only allow you to create three pods. Unlike the default scheduler that schedules and creates three pods first, the Permit plug-in of the Scheduling Framework keeps all the pods pending.

Then, when idle resources are released in the cluster, and all of the resources required by the pods for JobA are available, the scheduler schedules and creates all four pods of JobA and runs JobA.

QueueSort

The queue of the default scheduler does not perceive PodGroup information. Therefore, pods in PodGroups are not in order when they dequeue. As shown in the following figure, pods a and b are from different PodGroups. When pods of the two PodGroups enter a queue, the pods are staggered in the queue due to the staggered creation time.

After a pod is created and added to the queue, it is not adjacent to other pods of the same PodGroup. Instead, it is queued with other pods in a staggered manner.

As a result, if PodGroupA is pending in the Permit phase, pods of PodGroupB remain in the pending state after pods of PodGroupB are scheduled. The resources occupied by the two groups prevent both PodGroupA and PodGroupA from being scheduled. In this case, the deadlock occurs in the Permit phase instead of the node, and the preceding problem is not resolved.

To address the preceding problem, we implemented the QueueSort plug-in to ensure that pods of the same PodGroup are adjacent to each other in the queue. We define the Less method for the QueueSort plug-in to determine the order of pods in the queue:

func  Less(podA *PodInfo, podB *PodInfo) bool

First, the plug-in inherits the default priority-based comparison method, ensuring that pods with higher priorities precede pods of lower priorities.

Then, we define a new queuing logic to support the sorting of pods in a PodGroup in the case where pods have the same priority.

If both pods are regularPods, the pod created first is followed by the other pod in the queue.
If one of the pods is a regularPod and the other pod is a pgPod, which indicates a pod belonging to a certain PodGroup, the QueueSort plug-in compares the creation time of the regularPod with that of the PodGroup to which the pgPod belongs. The pod with the earlier creation time is followed by the other pod in the queue.
If both pods are pgPods, the QueueSort plug-in compares the creation time of the two PodGroups to which the pgPods belong. The pod with the earlier creation time is followed by the other pod in the queue. In addition, when both PodGroup are created at the same time, we introduce auto-incrementing IDs so that the pod with the lower PodGroup ID is followed by the other pod in the queue. The IDs are used to distinguish different PodGroups.

When using the preceding queuing policies, we allow pods in the same PodGroup to be adjacent to each other in the queue.

After a pod is created and added to the queue, the pod will be adjacent to other pods belonging to the same PodGroup.

Prefilter

To reduce ineffective scheduling operations and improve scheduling performance, we add a filtering condition in the Prefilter phase. Before scheduling a pod, the scheduler calculates the sum of pods, including running pods, in the same PodGroup as the pod. If the sum is less than min-available, the min-available requirement is not met. Then, the scheduler denies the pod in the Prefilter phase and the pod does not enter the main scheduling process.

UnReserve

If a pod times out in the Permit phase, the pod enters the UnReserve phase. The scheduler denies all pods in the same PodGroup as a pod in the UnReserve phase to prevent the remaining pods from waiting for a long time.

Try Out Coscheduling

Installation and Deployment

You can try out coscheduling in a self-build Kubernetes cluster or any dedicated Kubernetes service provided by a public cloud. Note: The cluster version must be 1.16 or later and you must have permission to update the primary nodes of the cluster.

This article uses the Kubernetes cluster provided by Alibaba Cloud Container Service for Kubernetes (ACK) to test the coscheduling feature.

Before You Begin

Make sure you are using Kubernetes 1.16 or later
Create a dedicated cluster provided by ACK
Ensure that nodes of the cluster can access the public network
Ensure that the Helm version installed on the primary nodes by default is the V3 version. If Helm V2 is installed, upgrade it to Helm V3. For more information about how to install Helm V3, see Install Helm V3](https://helm.sh/docs/intro/install/).

Deploy Coscheduling

We have already built the code of the coscheduling plug-in and the native scheduler into new container images and provided a Helm Chart package named ack-coscheduling for automatic installation. The package starts a job to automatically replace the native scheduler installed on the cluster with the coscheduling scheduler and modify the Config file related to the scheduler so the Scheduling Framework can load the coscheduling plug-in. After the trial, you can restore the default scheduler and related configurations of the cluster using the uninstall feature described in the following section.

Download the Helm Chart package and run the following command to install the Helm Chart:

$  wget http://kubeflow.oss-cn-beijing.aliyuncs.com/ack-coscheduling.tar.gz
$  tar zxvf ack-coscheduling.tar.gz
$  helm install ack-coscheduling -n kube-system ./ack-coscheduling
NAME: ack-coscheduling
LAST DEPLOYED: Mon Apr 13 16:03:57 2020
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

Verify Coscheduling

On the primary node, run the following helm command to verify that the coscheduling plug-in is installed:

$ helm get manifest ack-coscheduling -n kube-system | kubectl get -n kube-system -f -
NAME                           COMPLETIONS   DURATION   AGE
scheduler-update-clusterrole   1/1           8s         35s
scheduler-update               3/1 of 3      8s         35s

Uninstall Coscheduling

Run the following helm command to uninstall the coscheduling plug-in to roll back the version and configurations of the kube-scheduler to the default state in the cluster.

$ helm uninstall ack-coscheduling -n kube-system

Instructions

To use coscheduling, you only need to configure the following labels in the YAML file that you use to create the job: pod-group.scheduling.sigs.k8s.io/name and pod-group.scheduling.sigs.k8s.io/min-available.

labels:
    pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
    pod-group.scheduling.sigs.k8s.io/min-available: "3"

pod-group.scheduling.sigs.k8s.io/name: the name of the PodGroup.

pod-group.scheduling.sigs.k8s.io/min-available: indicates that the job can be scheduled as a whole only when the resources of the current cluster are sufficient to start the min-available pods.

Note: Pods in the same PodGroup must have the same priority.

Demo

In the following section, we will demonstrate the coscheduling results by running a distributed TensorFlow training job (TFJob). The test cluster has four graphics processing units (GPUs).

1. Deploy a runtime environment for a TFJob in the existing Kubernetes cluster using Kubeflow's Arena.

Arena is one of the subprojects of Kubeflow, an open-source community for Kubernetes-based machine learning systems. Arena allows you to manage machine learning jobs using command lines and SDKs in the following phases of the lifecycle: environment installation, data preparation, model development, model training, and model prediction. Arena effectively improves the productivity of data scientists.

git clone https://github.com/kubeflow/arena.git
kubectl create ns arena-system
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml

Check whether the runtime environment is deployed.

$ kubectl  get pods -n arena-system
NAME                                READY   STATUS    RESTARTS   AGE
tf-job-dashboard-56cf48874f-gwlhv   1/1     Running   0          54s
tf-job-operator-66494d88fd-snm9m    1/1     Running   0          54s

2. Have a user submit a TFJob to the cluster. In the following example, the TFJob involves one parameter server pod and four worker pods, and each worker requires two GPUs. After you configure a PodGroup, you can run the job only when at least five pods of the PodGroup are started.

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "5" 
        spec:
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=cpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                cpu: '1'
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure
    Worker:
      replicas: 4
      template:
        metadata:
          creationTimestamp: null
          labels:
            pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
            pod-group.scheduling.sigs.k8s.io/min-available: "5"
        spec:
          containers:
          - args:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=32
            - --model=resnet50
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --local_parameter_device=cpu
            - --device=gpu
            - --data_format=NHWC
            image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources:
              limits:
                nvidia.com/gpu: 2
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure

3. To simulate job implementation without the coscheduling feature, do the following:

Delete the pod-group.scheduling.sigs.k8s.io/name and pod-group.scheduling.sigs.k8s.io/min-available labels from the TFJob YAML file. This step indicates that coscheduling is not used in the job. After you create the job, the cluster resources allow you to start only two workers. The other two workers are in the pending state.

$ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       1/1     Running   0          6m43s
tf-smoke-gpu-worker-0   1/1     Running   0          6m43s
tf-smoke-gpu-worker-1   1/1     Running   0          6m43s
tf-smoke-gpu-worker-2   0/1     Pending   0          6m43s
tf-smoke-gpu-worker-3   0/1     Pending   0          6m43s

Check the logs of the running workers. You will find that both the workers are waiting for the other two workers to start. In this case, all the four GPUs are occupied but no job is run.

$ kubectl logs -f tf-smoke-gpu-worker-0
INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2

4. To simulate job implementation with the coscheduling feature, do the following:

Add labels related to the PodGroup and create a job. The cluster resources cannot meet the min-available requirements. As a result, the PodGroup cannot be scheduled and all the pods remain in the pending state.

$ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       0/1     Pending   0          43s
tf-smoke-gpu-worker-0   0/1     Pending   0          43s
tf-smoke-gpu-worker-1   0/1     Pending   0          43s
tf-smoke-gpu-worker-2   0/1     Pending   0          43s
tf-smoke-gpu-worker-3   0/1     Pending   0          43s

Now, if you scale out the cluster by adding four GPUs, the resources can meet the min-available requirements, the PodGroup can be scheduled, and all four workers will start to run.

$ kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
tf-smoke-gpu-ps-0       1/1     Running   0          3m16s
tf-smoke-gpu-worker-0   1/1     Running   0          3m16s
tf-smoke-gpu-worker-1   1/1     Running   0          3m16s
tf-smoke-gpu-worker-2   1/1     Running   0          3m16s
tf-smoke-gpu-worker-3   1/1     Running   0          3m16s

View the log of one of the workers. You will find that the training job has started.

$ kubectl logs -f tf-smoke-gpu-worker-0
INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step    Img/sec    loss
INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1    images/sec: 31.6 +/- 0.0 (jitter = 0.0)    8.318
INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10    images/sec: 31.1 +/- 0.4 (jitter = 0.7)    8.343
INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20    images/sec: 31.5 +/- 0.3 (jitter = 0.7)    8.142

Going Forward

Coscheduling is implemented based on the mechanism of the Kubernetes Scheduling Framework. It meets the requirements for combined scheduling in artificial intelligence (AI) and data computing batch jobs, reduces resource waste, and improves the overall resource utilization of clusters.

In subsequent articles in this series, we will provide more information about scheduling policies for batch jobs, including capacity scheduling and multi-queue management features. We will also describe the design and implementation of the scheduling policies in the Scheduling Framework. Stay tuned for more!

Community

The Burgeoning Kubernetes Scheduling System – Part 2: Coscheduling and Gang Scheduling That Support Batch Jobs

Preface

Preface

Why Does the Kubernetes Scheduling System Require Coscheduling?

Solutions Provided by the Community

Scheduling Framework-Based Solution

Technical Solution

Overall Architecture

PodGroup

Permit

QueueSort

Prefilter

UnReserve

Try Out Coscheduling

Installation and Deployment

Before You Begin

Deploy Coscheduling

Verify Coscheduling

Uninstall Coscheduling

Instructions

Demo

Going Forward

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

Container Service for Kubernetes

ACK One

Platform For AI

Cloud-Native Applications Management Solution