By Huang Tao and Wang Menghai
As a form of cloud computing, cloud-native is becoming a new technical standard in the cloud era. By reshaping the entire software lifecycle, cloud-native has become the easiest way to realize the value of the cloud.
It is an inevitable trend to make the cloud-native infrastructure the internal unified architecture in enterprises. However, it also brings about unavoidable compatibility problems caused by the integration of various basic platforms, especially for enterprises with larger scale and longer development; the "technical debt" is more obvious.
The experience shared in this article is derived from Alibaba's production practices on hybrid scheduling over the past few years, which is of significant practical value. Starting with basic introductions, this article focuses on the scheduler. It describes how the Alibaba Serverless Infrastructure (ASI) scheduler manages complex and busy resource scheduling tasks in large-scale container scheduling scenarios designed for cloud-native applications. Some specific cases are also introduced for adequate understanding, which provides references for readers with similar problems. This article helps readers systematically understand hybrid resource scheduling in Alibaba's complex task scenarios.
Leading the overall implementation of container cloud migration within the Alibaba Group, ASI takes charge of the evolution of the internal lightweight container architecture and the cloud-native O&M system. ASI further accelerates the implementation of emerging technologies, such as Mesh, Serverless, and FaaS in Alibaba. It also supports almost all Alibaba businesses, including Taobao, Tmall, Youku, AMAP, Ele.me, UC, and Koala, and various scenarios and ecosystems of Alibaba Cloud products.
ASI provides complete support for cloud-native technology stacks with a core based on Kubernetes. ASI has also successfully integrated with Alibaba Cloud Container Service for Kubernetes (ACK), which retains various cloud-based capabilities and successfully copes with the complex business environment in Alibaba.
ASI Scheduler is one of the core components of ASI cloud-native, which has always played an important role in ASI cloud-native development. Alibaba's large, online e-commerce transaction containers, such as shopping carts, orders, and Taobao details, are allocated and scheduled by the scheduler. It is the same for the distribution of each container, including container orchestration, single-machine computing resources, and memory resources. A few container orchestration errors may pose a fatal threat to the business, especially during the peak time of the Double 11 Global Shopping Festival. ASI Scheduler is responsible for controlling the computing quality of each container during peak hours, which is of great importance.
ASI Scheduler originated from online e-commerce transaction container scheduling in 2015. The earliest scheduler at that time only covered online transaction T4 (Alibaba's early customized container technology based on LXC and Linux Kernel) and Alidocker scenarios. It handled the peak traffic during the Double 11 Global Shopping Festival in 2015.
The evolution of ASI Scheduler follows the development of cloud-native, from the earliest online transaction container scheduler, Sigma Scheduler, Cerebulum Scheduler, to ASI Scheduler. Today, the next-generation Unified-Scheduler is under construction. It will absorb and integrate the advanced experience of Alibaba Open Data Processing Service (ODPS), Hippo, and online scheduling in various fields over the past few years. The following figure shows the evolution process of ASI Scheduler:
During the evolution of ASI Scheduler, there are many challenges to be solved, mainly in the following aspects:
This article also introduces how ASI Scheduler manages Alibaba's complex and busy computing resource scheduling tasks.
Scheduler plays a very central role among many ASI components. Scheduler is one of core components in the scheduling system of the cloud-native container platform, the basis for resource delivery, and the brain of ASI cloud-native. Its value is reflected below:
Generally, Scheduler aims to offer:
The location of the Scheduler in the ASI cloud-native system is shown in the following figure (the box marked in red):
Most of the time, scheduler refers to the central scheduler, such as the Kubernetes kube-scheduler in the community. However, the real scheduling scenario is complex, as each scheduling requires complex and flexible collaboration. When a task is submitted, it needs to be coordinated by the central scheduler, single-machine scheduling, and kernel scheduling. The task is executed with the cooperation of Kubernetes components, such as kubelet and controller. For online scheduling scenarios, there is a batch orchestration scheduler. The multiple scheduling under rescheduling ensures the perfect cluster performance.
The ASI generalized scheduler refers to the combination of central scheduler, single-machine scheduling, kernel scheduling, rescheduling, large-scale orchestration scheduling, and multi-layer scheduling.
Central scheduler calculates the resource orchestration for each task (or a batch of tasks), ensuring optimal scheduling. It determines information, such as clusters, regions, and execution nodes (host machines), for the specific task computing. It refines the allocation of CPU, storage, and network resource on the node.
In cooperation with components of the Kubernetes ecosystem, central scheduler manages the lifecycle of most tasks.
In the development of ASI cloud-native, central scheduler refers to Sigma Scheduler, Cerebulum Scheduler, and ASI Scheduler described above.
It is responsible for:
(1) The coordination of optimal multi-POD operation in a single machine. After receiving the node selection command from the central scheduler, ASI schedules the task to a specific node for execution. Then, single-machine scheduling starts to work like this:
(2) The information collection, reporting, and aggregation of single-machine resources, which helps central scheduling make decisions. In ASI, single-machine scheduling components mainly refer to some enhanced capabilities of SLO-Agent and Kubelet. For Unified-Scheduler, the components refer to the SLO-Agent, Task-Agent, and Kubelet.
The single-machine scheduling guarantees the optimal multi-POD operation from the resource perspective, but the task running state is controlled by the kernel. Thus, kernel scheduling is expected.
The central scheduler ensures the optimal scheduling of each task, which is one-time scheduling. However, the central scheduler cannot realize the global optimal cluster. Thus, rescheduling is expected.
Large-scale orchestration scheduling is a unique scenario for Alibaba's large-scale online scheduling. Since the development in 2017, it has become very mature and is still being enhanced.
Large-scale orchestration allows the scheduling of tens of thousands (or even hundreds of thousands of) containers at one time. It also ensures the global optimal orchestration of all containers in the cluster at one time. It skillfully makes up the disadvantages of one-time central scheduling and avoids the complexity of repeated rescheduling in the large-scale site building scenario.
The following sections will introduce kernel scheduling, rescheduling, and large-scale orchestration scheduling in detail.
Multi-layer scheduling includes layer-1, layer-2, and layer-3 scheduling. Sigma Scheduler introduces the concept of layer-0 scheduling in offline hybrid deployment scenarios. Different scheduling systems may have different understandings and definitions of multi-layer scheduling and develop their own concepts. For example, in the earlier Sigma system, scheduling was divided into Layer-0, Layer-1, and Layer-2 scheduling:
The fatal drawback of the Sigma multi-layer scheduling system is that the technical capabilities and investments of all Layer-2 scheduling vary a lot. For example, the Layer-2 scheduling system of advertisements is excellent, while not all Layer-2 scheduling is extremely considerate to the service. Learning from this, ASI has integrated many capabilities inside ASI and further standardized the upper-layer PaaS. Thus, it both simplifies the upper layer and enhances the upper-layer capabilities.
The next-generation scheduler being developed today is also divided into several layers. They include the computing load layer (mainly Workload scheduling management), computing scheduling layer (like DAG scheduling and MR scheduling), and the business layer (the same as the Layer-2 scheduling of Sigma.)
Let's take the ongoing Unified-Scheduler as an example. Three resource types are available for scheduling, Product resources, Batch resources, and Best Effort (BE) computing resources.
Different schedulers have different definitions on the leveled resource types, but they are essentially the same. The ASI Scheduler will be explained in detail in subsequent sections.
Product (online) resources have Quota budgets, and the scheduler needs to ensure the highest level of resource availability. The long-lifecycle POD instance for the online e-commerce core transactions is a typical product resource. The most classic cases are the transaction core business PODs, such as shopping cart (Cart2) and order (tradeplatform2), on the core procedure during Double 11. These resources have high requirements for computing power, high priority, real-time performance, and low response latency. Moreover, you cannot interfere with them.
For example, POD with long lifecycles for online transactions can exist for days, months, or even years. For most applications submitted by application developers, after construction, they need to apply for several long-lifecycle instances, which are Product resources. Most POD or container instances applied by developers from Taobao, Tmall, Juhuasuan, AMAP, Umeng, Heyi, Cainiao, Internationalization, and Xianyu are Product resources.
Product resources refer to PODs with long online lifecycles and resources that meet the requirements above. However, not all long-lifecycle PODs are Product resources. For example, POD used by Alibaba's internal Aone Lab for executing CI construction tasks is not a Product resource. It has a long lifecycle but can be evicted and preempted at a low cost.
The Gap between Allocate and Usage of Product resources used by online services is relatively stable for a time. The Gap and unallocated resources of Prod are regarded as BE resources. They are sold to businesses less sensitive to latency but with certain requirements on resource stability. Batch resources are supported by Quota budgets but only ensure resource availability with a probability, such as 90%, over a period, such as 10 minutes.
That is to say, Product (online) resources have been applied for seeming resources, while in fact, a lot of computing power may not be used in terms of the load utilization rate. In this case, the differentiated SLO multi-layer scheduling capabilities of the scheduler will be utilized. By doing so, those not occupied resources will be fully used as excessive resources and sold to Batch resources.
Best Effort (BE) resources are not supported by Quota budgets and do not ensure resource availability. They can be controlled and preempted at any time. When the Usage amount allocated on a node drops below the resource threshold, the scheduler considers the Gap as an "unstable or non-accounting" resource, namely, BE resources.
Product and Batch resources are responsible for large-scale resources, while BE resources are responsible for the remaining ones. For example, in daily development work, R&D needs to run many UT test tasks that do not have high requirements on the quality of computing resources. The tolerance for time delay is relatively high, and it is not easy to evaluate the quota budget. Therefore, it is not very cost-effective to purchase a large number of Product or Batch resources for such scenarios, while the cheapest BE resources can be a considerable choice. In this case, BE resources are resources not used during Product or Batch operation.
With this multi-layer resource scheduling capability, the Unified-Scheduler can technically maximize the use of resources on a physical node.
The figure above shows the responsibilities that ASI should take based on the generalized scheduling. These responsibilities correspond to scheduling capabilities based on requirements for different resource levels and various service scenarios. This figure shows the complete technical framework of the ASI scheduler.
On the ASI cloud-native container platform, the scheduling scenarios of dozens of business units are presented, including transactions, shopping guides, livestreaming, videos, local life, Cainiao, AMAP, Heyi, Umeng, and overseas business. The highest level of "Product resources" has the largest scheduling proportion.
Compared with offline scheduling and JOB-type scheduling, online business scheduling has typical features. Offline scheduling is also wonderful when describing online scenarios.
Compared with some typical short-lifecycle tasks (such as FaaS computing), long-lifecycle tasks are different in terms of task characteristics and technical challenges. For example, for function computing scenarios with relatively short lifecycles, the challenges are extreme scheduling efficiency, hundreds of milliseconds of execution efficiency, fast scheduling throughput, and POD runtime performance. For long-lifecycle pods, the global optimal scheduling depends on rescheduling for continuous iteration and optimization. Moreover, the optimal runtime scheduling must depend on the single-machine rescheduling for continuous optimization. In the non-cloud-native era, many businesses could not be migrated, making scheduling extremely difficult. This means the scheduler faces technical problems in the scheduling capability and extremely difficult stock business governance. Besides, the long startup time of online applications also reduces the rescheduling flexibility and makes it more complex.
Online container runtime is sensitive to business and computing power, posing a great challenge to the scheduling quality.
The following table describes the most common scheduling capabilities for online scheduling:
Due to the limited number of cluster nodes, applications that potentially interfere with each other have to coexist on the same node. Thus, inter-application orchestration strategies are expected to ensure the optimal operation of each host node and POD.
In scheduling practices in production, business stability always comes first. However, resources are also limited, making it difficult to balance optimal resource cost and business stability. In most cases, the balance can be perfectly achieved by inter-application orchestration strategies. By defining inter-application strategies for coexistence, such as CPU intensive, network intensive, and I/O intensive applications, and peak model characteristics, the cluster is fully scattered. Adequate constraints and protection are available for the application coexistence on the same node, so the interference probability between different pods can be minimized.
Furthermore, the scheduler adopts more technical means at runtime, such as network priority control and refined CPU orchestration control. Thus, the potential impact of inter-application runtime is avoided.
There are some challenges brought by inter-application orchestration strategies. For example, in addition to building inter-application orchestration capabilities, the scheduler needs to fully understand the operation characteristics of each business.
Refined CPU orchestration is an interesting topic in the online scheduling field, including CPUSet scheduling and CPUShare scheduling. In other scheduling scenarios like offline scheduling, it is not so important or difficult to understand. However, online transaction scenarios, theoretical inference, laboratory scenarios, and data obtained from stress testing during big promotions have proved that accurate CPU scheduling is of great importance.
In short, refined CPU orchestration is a core adjustment to ensure the maximum and most stable usage of the CPU core.
Refined CPU orchestration is so important that ASI has fully understood and applied it in the past few years, as is shown in the following table (only including CPUSet refined orchestration scheduling.)
Description: Take an x86-based physical machine or an X-dragon server with 96 cores (96 logical cores) as an example. It has 2 sockets, each of which has 48 physical cores with two logical cores, respectively. ARM architecture is different from x86 architecture.
Due to the layered cache design in the CPU architecture, the optimal allocation works like this: For the same physical core, one logic core is allocated to the core online transaction application, such as Carts2 (shopping cart business.) The other core is allocated to a non-core application that is not busy. By doing so, during daily operation or Double 11 peak hours, Carts2 can take full advantage, which has proved effective every time in production and stress testing.
If both logical cores are allocated to Carts2 at the same time, the maximum utilization of resources will be reduced significantly due to the same service peak, especially for the same POD.
Theoretically, two core transaction applications, such as Carts2 (shopping cart business) and tradePlatform2 (orders), should avoid sharing these two logic cores. However, at the micro-level, the peaks of Carts2 and tradePlatform2 are different, so the impact is low. Although such CPU allocation does not seem optimal, it still needs to be maintained because of limited physical resources.
When the numa-aware is available, users may want to use L3 Cache as much as possible to improve computing performance. For that purpose, more cores in the same POD should not be used cross-socket.
When CPUShare is used, the allocation of Request and Limit also matters a lot. When CPUSet and CPUShare coexist, scheduling is more complicated. For example, the scale-out or disenabling and the potential requirements of CPUSet containers involve CPU rescheduling of all PODs in the entire machine. However, in the emerging GPU heterogeneous scheduling scenarios, some tips are also available for concurrent CPU and GPU allocation.
Large-scale orchestration scheduling is mainly applied to site construction, site migration, or large-scale migration scenarios. Alibaba's frequent site construction during big promotions and super-large-scale site migration are typical examples. Considering costs, it's expected to quickly create hundreds of thousands of PODs in the shortest possible time with minimal labor costs.
The randomness and unpredictability of the sequential requests from multiple tasks result in the shortcomings of centralized scheduling in large-scale fields. Without large-scale orchestration, Alibaba's large-scale site construction often requires a complex process from self-scaling of services to repeated rescheduling. This consumes massive manpower and takes several weeks. Fortunately, large-scale orchestration scheduling ensures a resource allocation rate of more than 99% while achieving hour-level large-scale delivery.
The central scheduler achieves one-time optimal scheduling, but it is completely different from the expected cluster-dimension global optimal scheduling. Rescheduling also includes global central rescheduling and single-machine rescheduling.
Why is central rescheduling a necessary supplement for one-time scheduling? Please see the following examples:
The algorithms and implementation of central rescheduling are often very complicated. It requires a deep understanding and full coverage of various rescheduling scenarios, rescheduled DAG graphs with clear definition, and dynamic performance.
Many scenarios also require single-machine rescheduling, including SLO optimization of refined CPU orchestration and single-machine rescheduling optimization driven by QoS data.
It should be emphasized that the implementation of single-machine rescheduling must solve the problem of safety risk control to avoid uncontrollable explosion range. Considering the lack of risk control capability on the single-machine side, we recommend choosing a central unified trigger under strict protection and control rather than the node autonomy mode. There are many inevitable node autonomy scenarios in Kubernetes. For example, when the pod yaml file is changed, kubelet will perform the corresponding changes. In the past, ASI spent several years continuously enhancing each potential risk control point. ASI also iteratively built a Defender system for hierarchical risk control management, including core button, high risk, and medium risk. For potential risks, the interaction with the central Defender is required before operation executing on the single-machine side. Thus, disaster events are avoided through security prevention and control. The scheduler should also implement strict security defense. Otherwise, autonomous node operation is not allowed.
A busy host that runs many tasks in parallel cannot avoid kernel-state resource competition among multiple tasks in operation. It is evitable even though central scheduling and single-machine scheduling. They have jointly ensured the optimal resource allocation, such as CPU allocation and I/O dispersion. The competition is especially fierce in the well-known offline hybrid deployment scenarios. This requires collaboration between central scheduling, single-machine scheduling, and kernel scheduling. For example, the scheduling coordinates the priority of various resources of the task and then hands over to kernel scheduling for execution.
This corresponds to multiple kernel isolation technologies, including CPU (scheduling priority BVT and Noise Clean mechanism), memory (memory collection and OOM priority), and network (network priority and I/O).
Today, the isolation mechanism between Guest Kernel and Host Kernel based on the secure containers enables us to avoid the competition of kernel-state resources easily.
The logic of elastic scheduling and time-sharing scheduling enables better resource reuse in different dimensions.
The ASI scheduler, together with Alibaba Cloud infrastructure, uses the strong elasticity provided by ECS. In Ele.me scenarios, it allows resources to return to the cloud during off-peak hours and applies for resources again during peak hours.
The built-in elastic Buffer of the large ASI resource pool can be used. The host resources of the ASI resource pool come from Alibaba cloud resources. The elastic technology of the Alibaba Cloud IaaS layer can also be used. The balance between the two is a very controversial topic.
The time-sharing scheduling of ASI performs optimal resource reuse with great cost decrease. Online transaction POD instances are disabled on a large scale every night. The released resources are used for ODPS offline tasks and then used for online applications again every morning. This classic scenario maximizes the value of the offline hybrid deployment technology.
The core of time-sharing scheduling is resource reuse and construction and management of large resource pool dependent. This is the combination of resource operation and scheduling technologies. This requires the scheduler to accumulate massive jobs in different forms and a large number of tasks; the more, the better.
Vertical scaling scheduling is a second-level delivery technology that partially solves the burst traffic problem. It is also the key to control risks of peak pressure at midnight during big promotions. Computing resources can be delivered in seconds through shuffling algorithms by adjusting the resources of existing POD vertically and scheduling CPUs accurately and reliably. Vertical scaling scheduling is related to VPA technology, so vertical scaling scheduling is also one of the scenarios of VPA.
In a sense, "X+1" horizontal scaling scheduling can be considered as one of the HPA scenarios, but it is manually triggered. "X+1" emphasizes the ultimate efficiency of resource delivery, which requires great improvements in R&D efficiency. An online POD can start and provide services within "X" minutes. All operations, except application startup, must be completed in "1" minute.
Vertical scaling scheduling and "X+1" complement each other and jointly escort various peaks.
ASI is also implementing more VPA and HPA scenarios. For example, VPA technology can provide more free computing power for red envelope activities during Spring Festival. It can save a lot of costs.
The implementation of scheduling technologies, such as VPA and HPA, in more scenarios will improve continuously in the future.
Differentiated SLO scheduling is one of the core parts of the scheduler. This section is similar to the Scheduling Resource Type section above. Given the complexity of differentiated SLO, it will be introduced in the last section of this chapter.
The ASI scheduler also accurately defines Service Level Objectives (SLO), Quality of Service (QoS), and Priority.
SLO refers to service level objectives. ASI provides differentiated SLO through different QoS and Priority, and the pricing of different SLO varies. Users can decide to choose which type of SLO-guaranteed resources according to different business characteristics. For example, offline data analysis tasks can choose low-level SLO for a lower price, while important business scenarios can choose high-level SLO that costs more.
QoS is responsible for resource quality assurance. QoS defined in the community includes Guaranteed, Burstable, and Best Effort QoS. QoS defined in ASI is not completely mapped to the community. In the community, it is completely mapped by Request / Limit. ASI defines QoS in another dimension, including LSE, LSR, LS, and BE, to describe Alibaba's scenarios clearly, such as CPUShare and hybrid deployment). Different levels of resource assurance are divided clearly. Different businesses can choose different levels of QoS according to the latency sensitivity.
PriorityClass and QoS are two different concepts. PriorityClass focuses on the importance of the task.
The resource allocation strategies and the importance of the task, namely, QoS and PriorityClass, can be combined in different ways. They still require a certain correspondence. For example, a PriorityClass named Preemptible can correspond to Best Effort QoS for most tasks.
Different scheduling systems have different definitions for PriorityClass. For example:
The scheduling simulator is similar to Alibaba's full-procedure stress testing system. It verifies new scheduling capabilities in a simulated environment through real online traffic replaying or simulated traffic replaying. Thus, it continuously optimizes various scheduling algorithms and metrics.
Another common use of the scheduling simulator is to perform offline simulations of difficult online problems to locate various problems with high efficiency and no harm.
To some extent, the scheduling simulator is the basis for globally optimal scheduling. It allows repeated optimizations on various algorithms, technical frameworks, and technical procedures in the simulation environment. By doing so, it optimizes global metrics, such as the global distribution ratio, scheduling performance in different scenarios, and scheduling stability.
To achieve globally optimal scheduling, ASI built a brand-new ESP for the scheduler. ESP aims to establish a scheduler-centered, all-in-one, and closed-loop scheduling efficiency system, based on scheduling data guidance, core scheduling capabilities, and product-based scheduling operations.
Many similar modules have been developed, such as scheduling SLO inspection, scheduling tools, and two-layer scheduling platforms, for different scenarios. ESP, together with more two-layer scheduling capabilities, can provide globally optimal scheduling. They also focus on enhancing business stability, reducing resource costs, and improving user experience to provide more considerate services.
This article systematically introduces the basic concepts, principles, and various scenarios of ASI Scheduler. Unfortunately, many details of scheduler cannot be introduced thoroughly. Many in-depth scheduling capabilities, such as heterogeneous machine scheduling, scheduling profiling, fair scheduling, priority scheduling, shifting scheduling, preemptive scheduling, disk scheduling, Quota, CPU normalization, GANG Scheduling, scheduling tracing, and scheduling diagnosis were not introduced in this article. ASI's powerful scheduling framework structure and optimizations, scheduling performance optimizations, and other technologies were also not covered.
By 2019, ASI scaled the original Kubernetes single cluster to a cluster with tens of thousands of nodes. It continues to maintain a large number of large-scale computing clusters within Alibaba Group, thanks to the ACK powerful Kubernetes O&M system. ASI also accumulates industry-leading production practices of Kubernetes multiple clusters. In these large-scale Kubernetes clusters, ASI provides computing power for complex tasks continuously via its enhanced container scheduling technology.
Over the past few years, Alibaba Group has implemented comprehensive migration and evolution from ASI control to ACK in the scheduling field during Alibaba Group's comprehensive cloud migration. In the future, more cloud scheduling capabilities will be provided continuously and enhanced in complex, rich, and large-scale business scenarios of Alibaba.
AlibabaCloud_Network - November 12, 2018
Hologres - July 1, 2021
Alibaba Cloud Community - December 6, 2021
Alibaba Clouder - November 8, 2018
AlibabaCloud_Network - January 23, 2020
Alibaba Clouder - November 6, 2018
Accelerate and secure the development, deployment, and management of containerized applications cost-effectively.Learn More
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.Learn More
Visualization, O&M-free orchestration, and Coordination of Stateful Application ScenariosLearn More
Serverless Application Engine (SAE) is the world's first application-oriented serverless PaaS, providing a cost-effective and highly efficient one-stop application hosting solution.Learn More
More Posts by Alibaba Cloud Native Community