Coscheduling/Gang scheduling that supports batch tasks

Foreword

First of all, let's understand what is Coscheduling and Gang scheduling. Wikipedia's definition of Coscheduling is "a strategy for scheduling multiple associated processes to run concurrently on different processors in a concurrent system". In the Coscheduling scenario, the main principle is to ensure that all associated processes can start at the same time. Prevent the exception of some processes from blocking the entire associated process group. This part of the abnormal process that causes blocking is called "fragmentation".
In the specific implementation process of Coscheduling, according to whether "fragmentation" is allowed, it can be subdivided into Explicit Coscheduling, Local Coscheduling and Implicit Coscheduling. Among them, Explicit Coscheduling is the Gang Scheduling that everyone often hears. Gang Scheduling requires no "fragmentation" at all, that is, "All or Nothing".

We map the concepts defined above to Kubernetes, and we can understand the meaning that the Kubernetes scheduling system supports batch task Coscheduling. A batch task (associated process group) includes N Pods (processes), and the Kubernetes scheduler is responsible for scheduling the N Pods to M nodes (processors) to run simultaneously. If this batch task requires some Pods to be started at the same time, we call the minimum number of Pods that need to be started min-available. In particular, when min-available=N, batch tasks are required to meet Gang Scheduling.

Why does the Kubernetes scheduling system need Coscheduling?

Kubernetes has been widely used in online service orchestration. In order to improve the utilization and operation efficiency of the cluster, we hope to use Kubernetes as a unified management platform to manage online services and offline jobs. The default scheduler uses Pod as the scheduling unit to schedule sequentially, without considering the relationship between Pods. However, many offline jobs of the data computing type have the characteristics of combined scheduling, that is, the entire job can run normally only after all subtasks can be successfully created. If only some subtasks are started, the started subtasks will continue to wait for the remaining subtasks to be scheduled. This is exactly the scene of Gang Scheduling.

As shown in the figure below, JobA requires four Pods to start at the same time to run normally. Kube-scheduler schedules three Pods in sequence and creates them successfully. When the fourth Pod arrives, the cluster resources are insufficient, and the three Pods of JobA are in an empty state. But they have already occupied part of the resources. If the fourth Pod cannot be started in time, the entire JobA will not be able to run successfully, and even worse, it will lead to a waste of cluster resources.

If there is a worse situation, as shown in the figure below, other resources of the cluster are just occupied by 3 Pods of JobB, and at the same time they are waiting for the creation of the 4th Pod of JobB. At this time, a deadlock occurs in the entire cluster.

Community Related Programs

The community currently has two projects, Kube-batch and Volcano derived from Kube-batch, to solve the pain points mentioned above. The way to achieve this is to modify the scheduling unit in the Scheduler from Pod to PodGroup by developing a new scheduler, and schedule in the form of groups. The way to use it is that if the Pod that needs the Coscheduling function uses the new scheduler, other Pods such as online services use the Kube-scheduler for scheduling.

Although these solutions can solve the Coscheduling problem, they also introduce new problems. As we all know, for the same cluster resources, the scheduler needs to be centralized. However, if there are two schedulers at the same time, decision conflicts may occur. For example, the same resource is allocated to two different Pods, resulting in the problem that a Pod cannot be created due to insufficient resources after being scheduled to a node. The solution can only be to forcibly divide the nodes in the form of labels, or deploy multiple clusters. This method uses the same Kubernetes cluster to run online services and offline jobs at the same time, which will inevitably lead to a waste of overall cluster resources and an increase in operation and maintenance costs. Furthermore, Volcano needs to start custom MutatingAdmissionWebhook and ValidatingAdmissionWebhook. These webhooks themselves have a single point of risk. Once a failure occurs, it will affect the creation of all pods in the cluster. In addition, running an additional set of schedulers will also bring maintenance complexity and uncertainty in compatibility with the upstream Kube-scheduler interface.

Scheme based on Scheduling Framework

The first article in this series "Attack on the Kubernetes Scheduling System (1): Scheduling Framework" introduces the architectural principles and development methods of the Kubernetes Scheduling Framework. On this basis, we extended the implementation of the Coscheduling scheduling plug-in to help the Kubernetes native scheduler support batch job scheduling while avoiding the problems in the above solutions. The content of the Scheduling framework is introduced in detail in the previous article, and you are welcome to read it.

In order to better manage scheduling-related Plugins, sig-scheduler, the group responsible for Kube-scheduler in Kubernetes, created a new project scheduler-plugins to manage Plugins in different scenarios. The Coscheduling Plugin we implemented based on the scheduling framework has become the first official plug-in of the project. Below I will introduce the implementation and usage of the Coscheduling Plugin in detail.

Technical solutions

Overall architecture

We define the concept of PodGroup in the form of label, and Pods with the same label belong to the same PodGroup. min-available is used to identify the minimum number of replicas required for the PodGroup's job to be able to run officially.

Note: Pods belonging to the same PodGroup must maintain the same priority

permit

The Permit plugin of the Framework provides the function of delayed binding, that is, when the Pod enters the Permit phase, the user can customize the conditions to allow the Pod to pass, reject the Pod to pass, and let the Pod wait (the timeout period can be set). The delayed binding function of Permit just allows the Pods belonging to the same PodGruop to be scheduled to this node to wait, and when the accumulated number of Pods meets a sufficient number, all Pods running the same PodGruop will be bound and run in a unified manner. create.

To give a practical example, when JobA is scheduled, four Pods need to be started at the same time to run normally. But at this time, the cluster can only satisfy the creation of 3 Pods. At this time, it is different from the Default Scheduler in that it does not directly schedule and create 3 Pods. Instead, it waits through the Permit mechanism of the Framework.

At this time, when idle resources in the cluster are released, the resources required by the Pods in JobA can be satisfied.

Then the four Pods of JobA are scheduled and created together, and the tasks run normally.

QueueSort

Since the queue of the Default Scheduler cannot perceive the information of the PodGroup, the Pod is out of order when it is dequeued (for the PodGroup). As shown in the figure below, a and b represent two different PodGroups. When the Pods of the two PodGroups enter the queue, they are arranged in an interleaved order in the queue due to the staggered creation time.

When a new Pod is created and entered into the queue, it cannot be arranged with the Pods of the same PodGroup, but can only continue to be arranged in a chaotic manner.

This kind of disorder will cause that if PodGroupA is in the waiting state during the Permit phase, the Pods in PodGroupB will also be in the waiting state after the scheduling is completed. The mutual occupation of resources makes it impossible for PodGroupA and PodGroupB to be scheduled normally. In this case, moving the location of the deadlock phenomenon from the Node node to the Permit stage cannot solve the problems mentioned above.

In response to the problems shown above, we implement the QueueSort plug-in to ensure that Pods belonging to the same PodGroup in the queue can be arranged together. We define the Less method used by QueueSort to act on the order in which Pods are queued after entering the queue:

First, inheriting the default priority-based comparison method, high-priority Pods will be ranked before low-priority Pods.

Then, we define new queuing logic to support ordering of PodGroups if two Pods have the same priority.

If both Pods are regularPods (ordinary Pods), whoever creates them first will be at the front of the queue.

If one of the two Pods is a regularPod and the other is a pgPod (a Pod belonging to a certain PodGroup), we compare the creation time of the regularPod with the creation time of the PodGroup to which the pgPod belongs, and whoever creates it first will be ranked first in the queue.

If both Pods are pgPods, we compare the creation time of the two PodGroups, and whoever creates them first will be at the front of the queue. At the same time, it is possible that the creation time of the two PodGroups is the same. We introduced an auto-increment Id, so that the PodGroup whose Id is smaller will be ranked first (the purpose here is to distinguish different PodGroups).

Through the above queuing strategy, we realize that Pods belonging to the same PodGroup can be queued together with Pods of the same PodGroup.

When a new Pod is created and enqueued, it will be arranged with the Pods of the same PodGroup.

Prefilter

In order to reduce invalid scheduling operations and improve scheduling performance, we add a filter condition in the Prefilter stage. When a Pod is scheduled, it will calculate the Sum of the Pod (including the Running state) of the PodGroup to which the Pod belongs. If the Sum is less than min-available , it will definitely not meet the min-available requirements, and it will be rejected directly in the Prefilter stage, and will not enter the main process of scheduling.

UnReserve

If a Pod waits for a timeout in the Permit phase, it will enter the UnReserve phase, and we will directly reject all Pods that belong to the same PodGroup as the Pod to avoid long-term invalid waiting for the remaining Pods.

Coscheduling trial

Installation and deployment

Users can try Coscheduling either in their own Kubernetes cluster or in any standard Kubernetes service provided by any public cloud. It should be noted that the cluster version is 1.16+, and has the permission to update the cluster master.
This article will use the Kubernetes cluster provided by Alibaba Cloud Container Service ACK for testing.

prerequisite

Support Kubernetes 1.16 and above

Choose to create a standard dedicated cluster (Dedicated cluster) provided by ACK

Ensure that the cluster nodes can access the public network

helm v3: ACK installs helm on the master node by default. Please confirm whether the version is helm v3. If it is helm v2, please upgrade the value to helm v3. To install helm v3, please refer to helm v3 installation.

Deploy Coscheduling

We have built the Coscheduling plug-in and native scheduler code into a new container image. And a helm chart package ack-coscheduling is provided for automatic installation. It will start a task, automatically replace the native scheduler installed by default in the cluster with the Coscheduling scheduler, and modify the related Config files of the scheduler so that the scheduling framework can correctly load the Coscheduling plug-in. After completing the trial, the user can restore the default scheduler and related configurations of the cluster through the uninstall function as prompted below.

Follow-up

Coscheduling is realized by using the mechanism of Kubernetes Scheduling Framework, which solves the problem of combined scheduling for batch tasks of AI and data computing, and reduces the waste of resources at the same time. Thereby improving the overall resource utilization of the cluster.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us