Cloud Native knowledge: scheduling algorithm (Scheduler)-Alibaba Cloud Developer Community

Technical Field: Cloud Native | Noun definition | Scheduler Scheduler is one of the important components of Kubernetes. The Pod to be scheduled is scheduled to the node that is most suitable for running according to a scheduling policy. Three objects are involved: pods to be scheduled, scheduling policies, and node queues to be deployed. Scheduler and associated components:

| Development history | Simple development history of Docker and Kubernetes 2013 open source Docker project

in 2013, the open-source Pass project represented by AWS, OpenStack, and Cloud Foundry became a clean stream in the Cloud computing field. pass provides an application hosting capability. At that time, virtual machines and cloud computing were already common technologies. The mainstream usage was to rent a batch of AWS or OpenStack virtual machines, then deploy the Pass project such as application Cloud FoudryFoundry on the machine in script or manual mode. The core component is a set of packaging and distribution mechanism, the system calls the Cgroups and Namespace mechanisms of the operating system to create a separate sandbox isolation environment for each application, and then runs these processes in the sandbox, the purpose of the isolation operation. This sandbox is the so-called container. Docker, also known as dotCloud this year, is also a member of the Pass boom. However, compared with the big boss of Heroku, Pivotal, Red Hat and so on, dotCloud companies seem to be too insignificant. The main products are out of line with the mainstream CloudFoundry community, and when they are about to die, dotCloud, the company decided to open source its own container project Docker "container". In fact, it was not a new thing and was not invented by Docker. In the hottest Pass project Cloud Foundry at that time, containers were only at the bottom, the least concerned part. In just a few months, Docker rose rapidly, and Cloud Foundry and other Pass communities were eliminated before they could become rivals. Most Docker projects and Cloud Foundry containers have the same functions and implementation principles, but different Docker images solve the problem of environment packaging, the entire operating system required for application running is packaged directly, and the scheduling of the local environment and the cloud environment is consistent, it avoids the painful process that users match the differences between different environments through "trial and error", which is also the essence of Docker. The definition of Pass has changed into a containerization idea that takes Docker containers as the technical core and Docker images as the packaging standard. At the end of 2013, dotClound officially changed its name to Docker. The 2014 Docker release Docker Swarm

Docker SwarnSwarm Docker is released, it provides the cluster management function as a whole. The biggest highlight is that it completely uses the original container management API of the Docker project to manage the cluster. docker run "container" only needs to be changed into docker run -H "swarm cluster API address" and "container". Users only need to use the original docker command to create a container, and this request will be intercepted by swarm, find a suitable Docker Daemon through a specific scheduling algorithm. This method is very friendly to developers who are already familiar with docker CLI. Docker purchased Fig and renamed it Compose

Docker acquired the Fig project and renamed it (Compose). The Fig project first proposed "Container Orchestration" (Container Orchestration) in the field of cloud computing, "Orchestration" mainly refers to how users use some tools or configurations to define, configure, create, and delete a group of virtual machines and associated resources, the process is then completed by the cloud computing platform according to these specified logic. In the container era, orchestration is the management of a series of definitions, configurations, and creation actions of Docker containers. Competition between Docker and Mesosphere companies

In addition to the Docker ecosystem, Mesos and the company Mesosphere behind it are also very popular. Mesos is the most popular resource management project in big data and a powerful competitor to Yarn. Big data focuses on computing-intensive offline services. In fact, it is not suitable for hosting and resizing containers like Web services, and there is no strong need for application packaging. Therefore, Hadoop, projects such as Spark have not put much effort into container technology, but Mesos, as one of the big data suites, the inherent two-tier scheduling mechanism makes it easy to support a wider range of Pass services independently from the big data field. Therefore, Mesos released Marathon project, he became a strong competitor of Docker Swarm. Although it cannot provide Docker API like Swarm, the Mesos community has a very strong competitiveness: ultra-large-scale cluster management experience Mesos + Marathon combination has evolved into a well-scheduled Pass project, it also supports big data business. Docker and CoreOSCoreOS are start-ups in the field of infrastructure. The core product is a customized operating system. Users can use the distributed cluster mode, manage all nodes with this system installed, and use users to deploy applications in the cluster as easily as using a single machine. After the Docker project is released, corecd soon realized that it can integrate the concept of container into its own solution, thus providing users with a higher level of Pass capability. Therefore, CoreOS has long been a contributor to Docker projects, however, the cooperation ended in 2014 and launched its own container Rocket(rkt). However, this rkt container was completely suppressed by Docker. The formulation of OCI standard was jointly announced by CoreOS, Google, RatHat and other companies. Docker company donated Libcontainer (Container Runtime Library) and renamed it RunC project, which was managed by a completely neutral Foundation, A set of standard specifications for containers and images, called OCI(Open Container Initiative), are jointly developed based on RunC. These specifications are intended to completely separate the implementation of Container Runtime and images from the Docker project, to suppress the dominance of Docker, it also makes it possible to build platform layer capabilities without relying on Docker projects. However, Docker's dominance in the container field has not been changed Kubernetes

In June 2014, Google, the leader in infrastructure, officially announced the birth of Kubernetes (Borg's open-source version), which changed the container market again, just like Docker's emerging. Microsoft, RedHat, IBM, and Docker joined the Kubernetes community the CNCF Foundation was established in 2015

in order to gain an absolute advantage in container orchestration and compete with Swarm and Mesos, open-source infrastructure companies such as Google and RedHat jointly launched a CNCF Foundation: we hope to take Kubernetes as the basis, build a platform community led by vendors in the open-source infrastructure field and operated in the form of an independent fund foundation to fight against the container business ecosystem with Docker as the core. Simply put, it is to build a "moat" around the Kubernetes project ". Docker is good at seamless integration of Docker ecosystem, while Mesos is good at scheduling and management of large-scale clusters. Kubernetest, functions and modes such as pods and Sidecar are selected as entry points (mostly from the internal features of Borg and Omega systems).. The team size of Kubernetes is very small, and the engineering capability invested is limited. At this time, RedHat joined the alliance with Google, and officially opened the written office of Container Orchestration "Three Kingdoms. Kubernetes comes from the accumulation and sublimation of Google's years of practical experience in the field of containerized infrastructure. Various indicators on Github have soared, leaving the Swarm project far behind. In the same year, Kubernetes released The Helm software package management system, kubeam installation tools, release Mikibube, and other updates. The CNCF community quickly added Prometheus, Fluentd, OpenTracing, A series of well-known tools and projects in the container ecosystem, such as CNI. A large number of companies and creation teams have focused on the CNCF community instead of Docker. In 2016, Docker gave up the existing Swarm project, to face the competitive advantage of CNCF, Docker announced that it would abandon the existing Swarm project and transfer the container orchestration and cluster management functions to the Docker project. However, the technical complexity and maintenance difficulty brought by this change have caused a very unfavorable written support Kuberntes for OpenApi Docker project, providing developers with greater flexibility in customization, which is different from Docker companies, kubernetes promotes the "democratization" Architecture: each layer from API to container running exposes an extensible plug-in mechanism to developers, encouraging users to intervene in each stage through code. This reform of the Kubernetes project was very effective, and soon a large number of secondary innovative products based on Kubernetes API and extended interfaces emerged in the entire container community:

  1. highly popular Istio microservice governance tool
  2. application Deployment Framework Operator
  3. the Rook open-source startup project has encapsulated Ceph's heavyweight products into an easy-to-use container storage plug-in. Docker has been defeated after the rise and growth of the Kubernetes community.

In 2017, Docker donated Containerd to the CNCF community. Docker donated part of the container runtime Containerd to the CNCF community, which marked that the Docker project was upgraded to a Pass platform, docker announced that the Docker project would be renamed Moby and handed over to the community for self-maintenance. The commercial products of Docker still hold the registered trademark of Docker. In October of the same year, Docker announced that it would build Kubernetes project in its main product Docker Enterprise Edition. The two-year-long battle for container orchestration finally came to an end. In 2018, RatHat announced the acquisition of $0.25 billion CoreOSDocker the company CTO Solomon Hykes resigned, the container technology circle has been settled since then.

| Technical features | Common scheduling algorithms

the essence of scheduling is resource allocation. Different systems and system goals usually adopt different scheduling algorithms-what suits you is the best. 1. First come first service scheduling algorithm FCFS

A simplest scheduling algorithm that performs scheduling in sequence. It can be used for both job scheduling and process scheduling. Allocate CPU resources according to the order in which jobs are submitted or processes become ready; A new job only gets CPU when the current job or process is finished or blocked. The Awakened job or process does not resume execution immediately. Usually, the current job or process transfers CPU. (Therefore, the default mode is non-preemptible) is not conducive to short jobs.

  1. The priority of short jobs (processes) is calculated by the SJF/SPF algorithm based on the job length. The shorter the job, the higher the priority.

As can be seen from the above table, the average turnaround time and average weighted turnaround time are significantly improved by using SJF/SPF algorithm. SJF/SPF scheduling algorithm can effectively reduce the average waiting time of jobs and improve the system throughput. Disadvantages of SJF/SPF:

  1. it is beneficial to short-term operations, but at the same time it is unfavorable to long-term operations. 2. Because the length of the job (process) contains subjective factors, it is not necessarily possible to give priority to short jobs.

3. The urgency of the operation is not considered, so the timely handling of the urgent operation (process) cannot be guaranteed.

  1. The high priority scheduling algorithm HPF takes care of urgent tasks to make them get priority processing and then introduces the scheduling algorithm. Job Scheduling algorithms commonly used in batch processing systems and process scheduling algorithms in multiple operating systems

there are two methods: Non-preemptible priority algorithm preemptible priority algorithm key point: the type of priority when a new job is generated static priority: it is determined when the process is created and remains unchanged during the whole running period. Generally, it is expressed by an integer of a certain range, which is also called preferred number. Dynamic priority: the priority given when creating a process can change as the process advances or as the waiting time increases. v about the determination of process priority? The basis is as follows: 1) process type: generally, system processes are higher than user processes. 2) resource requirements of processes: for example, the estimated time and memory requirements of processes give higher priority to processes with less requirements. 3) User Requirements: priority is determined by the urgency of the user process and the amount of fees paid by the user.

4. High Response ratio priority scheduling algorithm HRRNHighes Response Raito Next

short job priority algorithm is a relatively good algorithm (equivalent to a static priority algorithm set according to job length), which is suitable for batch processing systems with more short jobs, the main disadvantage is that the running of long jobs cannot be guaranteed. HRRN introduces dynamic priority to each job, so that the priority of the job increases with the waiting time: priority = (waiting time + required service time)/Required service time = response time/required service time 1. At the same time, the task priority is the same. Initial t = 0, with the increase of time, if the waiting time t is the same, the shorter the execution time, the higher the priority, which is beneficial to short jobs. For a long job, the priority of the job increases with the waiting time. When the waiting time is long enough, the processor can be obtained. Take care of long homework. 2. When jobs with the same execution time are executed, the priority depends on the waiting time, that is, first come first serve.

  1. Time slice-based rotation scheduling algorithm RR time-sharing system new requirements: respond to user requests in a timely manner; Adopt time slice-based rotation process scheduling algorithm.

The early time-sharing system adopted a simple time slice rotation method, and after the 1990 s, multi-level feedback queue scheduling algorithm was widely used. The following describes the two methods separately and compares their performance. (1) time slice rotation algorithm 1. Arrange all ready processes in the system into a queue according to the FCFS principle. 2. During each scheduling, the CPU is assigned to the first queue process to run a time slice. The length of the time slice ranges from several ms to several hundred ms. 3. A clock interrupt occurs at the end of a time slice. 4. Based on this, the scheduler suspends the execution of the current process, sends it to the end of the ready queue, and runs the currently ready first process through context switching. When the process is blocked, the multi-level CPU feedback queue algorithm FB is also required when the time slice is not used up Features: multiple ready queues, cyclic feedback dynamic priority, time slice rotation setting multiple ready queues, each queue has a different priority. The priority is reduced from the first queue. The execution time slice size of each queue process varies. The higher the priority, the shorter the time slice.

3) scheduling process triggered when a new process enters the memory 1. Prepare scheduling: put it at the end of the first queue, and queue for scheduling according to the FCFS principle. 2.IF the time slice is completed, the system can be evacuated. 3.IF the time slice is not completed, the scheduler will transfer the process to the end of the second queue and wait for it to be scheduled again. 4. When all the processes in the first queue are executed, the system schedules the second queue according to the FCFS principle. It is still not completed in the longer time slice of the second queue, and then it is put into the third queue in sequence. 5. After dropping to the n-th queue in sequence, the n-th queue runs by time slice rotation. Performance of multi-level feedback queue scheduling algorithm multi-level feedback queue scheduling algorithm has good performance and can better meet the needs of various types of users. The user of the terminal job. Most of them are small interactive jobs. Users can be satisfied as long as the jobs are completed in the time slice of the first queue. The user of a short batch job. The turnaround time is still short and can be completed at most in the second to third queues. The user of a long batch job. It will be rotated in the 1-n queue in sequence, without worrying that the job will not be processed for a long time.

| Related words | 1 scheduling policy Kubernetes scheduling policies are divided into Predicates (pre-selection policy) and Priorites (optimization policy). The whole scheduling process is divided into two steps: 1. A pre-selection policy. Predicates is a mandatory rule. It traverses all Node nodes and filters out the list of nodes that meet the requirements according to the specific pre-selection policy. If no Node meets the Predicates policy, the Pod is suspended until a Node can meet the requirements. 2. Optimization strategy: based on the first step of filtering, score and sort the nodes to be selected according to the optimization strategy to obtain the optimizer. 1.1 Pre-selection policies as the version evolves, Kubernetes support more Predicates policies. v1.0 only supports 4 policies, v1.7 supports 15 policies, and the Kubernetes policies available in Predicates (v1.7) are as follows: matchNodeSelector: Check whether the label definition of the Node meets the NodeSelector requirements of the Pod. PodFitsResources: Check whether the resources of the host meet the requirements of the Pod. Schedule the resources based on the actual allocated (Limit) resources, instead of using the actual amount of resources used for scheduling PodFitsHostPorts: Check whether the HostPort required by each container in the Pod is occupied by other containers. If the required HostPort does not meet the requirements, the Pod cannot be scheduled to this host HostName: Check whether the host name is the NodeNameNoDiskConflict specified by the Pod: Check whether there are volume conflicts on this host. If a volume is already attached to this host, other pods that use this volume cannot be scheduled to this host. The specific rules for different storage backend NoVolumeZoneConflict: check the specified zone limit, check if the Pod deployed on this host has Volume conflicts. PodToleratesNodeTaints: make sure that the tolerates defined by the pod can accept the taintsCheckNodeMemoryPressure defined by the node: check whether the pod can be scheduled to nodes that have reported excessive host memory pressure CheckNodeDiskPressure: Check whether the pod can be scheduled to nodes that have reported excessive host memory pressure MaxEBSVolumeCount: make sure that the number of attached EBS volumes does not exceed the maximum value. Default value: 39MaxGCEPDVolumeCount: make sure that the number of attached GCE volumes does not exceed the maximum value, Default value: 16maxazurediskvolumecount: make sure that the number of attached Azure volumes does not exceed the specified maximum value. Default value: 16matchinterpodaffinity: Check whether the pod and other pods meet the affinity rules. GeneralPredicates: check whether the pod and kubernetes components on the host match NoVolumeNodeConflict, check whether there are volume conflicts in pods deployed on this host. The Predicates policies that have been registered but are not loaded by default are as follows: PodFitsHostPortsPodFitsResourcesHostNameMatchNodeSelectorPS: In addition, there is a PodFitsPorts policy (scheduled to be disabled), podFitsHostPorts replaces 1.2 optimization policies. Similarly, Priorites policies are becoming more and more diversified with version evolution. v1.0 only supports 3 policies, v1.7 supports 10 policies, and each policy has a corresponding weight, the total node score is calculated based on the weight. The available Kubernetes policies in Priorites (v1.7) are as follows: EqualPriority: all nodes have the same priority and have no actual effect ImageLocalityPriority: the score is calculated based on whether the host has a Pod running environment. If the required image does not exist, the system returns 0 points. If the image exists, the larger the image, the higher the score LeastRequestedPriority: calculate the CPU required by Pods and the percentage of available resources of the current node. The node with the smallest percentage is the best. The Score calculation formula is cpu((capacity - sum(requested))10 / capacity) + memory((capacity - sum(requested)) 10 / capacity) / 2balancedresourceallocation: the most balanced usage of resources (CPU and memory) on the node is the best. The Score calculation formula is: 10-abs (totalCpu/cpuNodeCapacity-totalMemory/memoryNodeCapacity)* 10selectorspreadpriority: calculate the number of pods of the same type with the least distribution on the Node by Service and Replicaset. Score: the smaller the number, the higher the score. NodePreferAvoidPodsPriority: determine alpha.kubernetes. The I/O preferAvoidPods attribute. Set the weight to 10000 to cover other policy NodeAffinityPriority: node affinity selection policy. Two selectors are supported: requiredDuringSchedulingIgnoredDuringExecution (ensure that the selected host must meet the requirements of all Pod rules for the host), preferresDuringSchedulingIgnoredDuringExecution (the scheduler will try to meet but does not ensure that all requirements of the NodeSelector) TaintTolerationPriority: similar to the Predicates in the PodToleratesNodeTaints policy, preferentially schedule nodes marked with Taint InterPodAffinityPriority:pod affinity selection policy. Similar to NodeAffinityPriority, two selectors are supported: requiredDuringSchedulingIgnoredDuringExecution (ensure that the selected hosts meet the requirements of all Pod rules for hosts), preferresDuringSchedulingIgnoredDuringExecution (the scheduler will try to meet all the requirements of the NodeSelector), two sub-policies: podAffinity and podAntiAffinity. The policy MostRequestedPriority will be explained in detail later: The Dynamic scaling cluster environment is applicable, and the pod is preferentially scheduled to the host node with the highest utilization rate, idle Machines are released for shutdown. The Priorites policies that have been registered but are not loaded by default are: EqualPriorityImageLocalityPriorityMostRequestedPriorityPS: In addition, there is a ServiceSpreadingPriority policy (scheduled to be disabled), which is replaced by SelectorSpreadPriority data sources:

  1. definition: CSDN community https://blog.csdn.net/qmw19910301/article/details/87304406
  2. development History: https://www.cnblogs.com/chenqionghe/p/11454248.html
  3. technical features: brief book https://www.jianshu.com/p/6a3612154183
  4. related words: https://cloud.tencent.com/developer/article/1450308
Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now