Unveiling Alibaba's Hybrid Scheduling Technology for Complex Task Resources

This article introduces hybrid resource scheduling in complex task scenarios and discusses how ASI Scheduler manages Alibaba's computing resource scheduling tasks.

By Huang Tao and Wang Menghai

As a form of cloud computing, cloud-native is becoming a new technical standard in the cloud era. By reshaping the entire software lifecycle, cloud-native has become the easiest way to realize the value of the cloud.

It is an inevitable trend to make the cloud-native infrastructure the internal unified architecture in enterprises. However, it also brings about unavoidable compatibility problems caused by the integration of various basic platforms, especially for enterprises with larger scale and longer development; the "technical debt" is more obvious.

The experience shared in this article is derived from Alibaba's production practices on hybrid scheduling over the past few years, which is of significant practical value. Starting with basic introductions, this article focuses on the scheduler. It describes how the Alibaba Serverless Infrastructure (ASI) scheduler manages complex and busy resource scheduling tasks in large-scale container scheduling scenarios designed for cloud-native applications. Some specific cases are also introduced for adequate understanding, which provides references for readers with similar problems. This article helps readers systematically understand hybrid resource scheduling in Alibaba's complex task scenarios.

Scheduler Overview

Leading the overall implementation of container cloud migration within the Alibaba Group, ASI takes charge of the evolution of the internal lightweight container architecture and the cloud-native O&M system. ASI further accelerates the implementation of emerging technologies, such as Mesh, Serverless, and FaaS in Alibaba. It also supports almost all Alibaba businesses, including Taobao, Tmall, Youku, AMAP, Ele.me, UC, and Koala, and various scenarios and ecosystems of Alibaba Cloud products.

ASI provides complete support for cloud-native technology stacks with a core based on Kubernetes. ASI has also successfully integrated with Alibaba Cloud Container Service for Kubernetes (ACK), which retains various cloud-based capabilities and successfully copes with the complex business environment in Alibaba.

ASI Scheduler is one of the core components of ASI cloud-native, which has always played an important role in ASI cloud-native development. Alibaba's large, online e-commerce transaction containers, such as shopping carts, orders, and Taobao details, are allocated and scheduled by the scheduler. It is the same for the distribution of each container, including container orchestration, single-machine computing resources, and memory resources. A few container orchestration errors may pose a fatal threat to the business, especially during the peak time of the Double 11 Global Shopping Festival. ASI Scheduler is responsible for controlling the computing quality of each container during peak hours, which is of great importance.

ASI Scheduler originated from online e-commerce transaction container scheduling in 2015. The earliest scheduler at that time only covered online transaction T4 (Alibaba's early customized container technology based on LXC and Linux Kernel) and Alidocker scenarios. It handled the peak traffic during the Double 11 Global Shopping Festival in 2015.

The evolution of ASI Scheduler follows the development of cloud-native, from the earliest online transaction container scheduler, Sigma Scheduler, Cerebulum Scheduler, to ASI Scheduler. Today, the next-generation Unified-Scheduler is under construction. It will absorb and integrate the advanced experience of Alibaba Open Data Processing Service (ODPS), Hippo, and online scheduling in various fields over the past few years. The following figure shows the evolution process of ASI Scheduler:

During the evolution of ASI Scheduler, there are many challenges to be solved, mainly in the following aspects:

Various Task Types: There are different SLO-level tasks, such as massive online long-lifecycle container and POD instances, Batch tasks, and Best Effort tasks. There are also tasks of different resource types, such as computing, storage, network, and heterogeneous tasks. Requirements and scenarios for different tasks vary a lot.
Different Host Resources: The scheduling system manages a large number of host resources in Alibaba Group, including various non-cloud stock physical machines, cloud X-Dragon instances, ECS, and heterogeneous models, such as GPU and FPGA.
Abundant Scheduler Service Scenarios: For example, the most typical pan transaction scenario, the most complex middleware scenario, emerging computing scenarios, such as FaaS, Serverless, Mesh, and Job and new ecological scenarios, such as Ele.me, Koala, and Shenma. There are also scheduling demands with multi-tenant isolation on the public cloud and the world-challenging unified scheduling scenario of ODPS, Hippo, Ant Financial, and ASI.
Numerous Responsibilities at the Infrastructure Layer: Schedulers are partly responsible for defining infrastructure models, integrating computing and storage network resources, converging hardware forms, and making infrastructure transparent.

This article also introduces how ASI Scheduler manages Alibaba's complex and busy computing resource scheduling tasks.

Preliminary Study on Scheduler

1. What Is Scheduler?

Scheduler plays a very central role among many ASI components. Scheduler is one of core components in the scheduling system of the cloud-native container platform, the basis for resource delivery, and the brain of ASI cloud-native. Its value is reflected below:

Powerful capabilities and scenario-rich resource delivery (computing and storage)
Resource delivery with the optimal cost
Stable and optimal resource delivery during business operation

Generally, Scheduler aims to offer:

Optimal Task Scheduling: Scheduler selects the most suitable host in the cluster and utilizes the resources appropriately. It minimizes mutual interference, such as CPU distribution and I/O competition, to run computing tasks submitted by users.
Optimal Global Scheduling for Clusters: Scheduler ensures optimal global resource orchestration (such as fragments), the most stable resource operation, and the lowest global cost.

The location of the Scheduler in the ASI cloud-native system is shown in the following figure (the box marked in red):

2. Generalized Scheduler

Most of the time, scheduler refers to the central scheduler, such as the Kubernetes kube-scheduler in the community. However, the real scheduling scenario is complex, as each scheduling requires complex and flexible collaboration. When a task is submitted, it needs to be coordinated by the central scheduler, single-machine scheduling, and kernel scheduling. The task is executed with the cooperation of Kubernetes components, such as kubelet and controller. For online scheduling scenarios, there is a batch orchestration scheduler. The multiple scheduling under rescheduling ensures the perfect cluster performance.

The ASI generalized scheduler refers to the combination of central scheduler, single-machine scheduling, kernel scheduling, rescheduling, large-scale orchestration scheduling, and multi-layer scheduling.

1) Central Scheduler

Central scheduler calculates the resource orchestration for each task (or a batch of tasks), ensuring optimal scheduling. It determines information, such as clusters, regions, and execution nodes (host machines), for the specific task computing. It refines the allocation of CPU, storage, and network resource on the node.

In cooperation with components of the Kubernetes ecosystem, central scheduler manages the lifecycle of most tasks.

In the development of ASI cloud-native, central scheduler refers to Sigma Scheduler, Cerebulum Scheduler, and ASI Scheduler described above.

2) Single-Machine Scheduling

It is responsible for:

(1) The coordination of optimal multi-POD operation in a single machine. After receiving the node selection command from the central scheduler, ASI schedules the task to a specific node for execution. Then, single-machine scheduling starts to work like this:

Single-machine scheduling ensures optimal multi-POD performance immediately, periodically, or in an O&M manner. This means resources are coordinated in a single machine, such as adjusting the CPU core allocation of each POD.
According to real-time POD metrics, such as load and queries per second (QPS), single-machine scheduling performs VPA scaling in a single machine for some running resources. It can also perform eviction for inferior tasks. For example, the CPU capacity of a POD can be scaled dynamically.

(2) The information collection, reporting, and aggregation of single-machine resources, which helps central scheduling make decisions. In ASI, single-machine scheduling components mainly refer to some enhanced capabilities of SLO-Agent and Kubelet. For Unified-Scheduler, the components refer to the SLO-Agent, Task-Agent, and Kubelet.

3) Kernel Scheduling

The single-machine scheduling guarantees the optimal multi-POD operation from the resource perspective, but the task running state is controlled by the kernel. Thus, kernel scheduling is expected.

4) Rescheduling

The central scheduler ensures the optimal scheduling of each task, which is one-time scheduling. However, the central scheduler cannot realize the global optimal cluster. Thus, rescheduling is expected.

5) Large-Scale Orchestration Scheduling

Large-scale orchestration scheduling is a unique scenario for Alibaba's large-scale online scheduling. Since the development in 2017, it has become very mature and is still being enhanced.

Large-scale orchestration allows the scheduling of tens of thousands (or even hundreds of thousands of) containers at one time. It also ensures the global optimal orchestration of all containers in the cluster at one time. It skillfully makes up the disadvantages of one-time central scheduling and avoids the complexity of repeated rescheduling in the large-scale site building scenario.

The following sections will introduce kernel scheduling, rescheduling, and large-scale orchestration scheduling in detail.

6) Multi-Layer Scheduling

Multi-layer scheduling includes layer-1, layer-2, and layer-3 scheduling. Sigma Scheduler introduces the concept of layer-0 scheduling in offline hybrid deployment scenarios. Different scheduling systems may have different understandings and definitions of multi-layer scheduling and develop their own concepts. For example, in the earlier Sigma system, scheduling was divided into Layer-0, Layer-1, and Layer-2 scheduling:

Layer-0 scheduling is responsible for global resource view and management, scheduling, arbitration, and specific execution between each Layer-1 scheduling. Layer-1 scheduling mainly corresponds to Sigma Scheduler, ODPS, and others.
In the Sigma system, the Sigma scheduler, as Layer-1 scheduling, is responsible for the allocation at the resource layer.
Layer-2 scheduling is implemented by different access services, such as e-commerce transactions, advertising Captain, and database AliDB. Layer-2 scheduling is closely related and fully understands their respective business. It conducts optimizations based on the overall business to enhance scheduling capabilities, such as business eviction and automatic O&M, for stateful application failures.

The fatal drawback of the Sigma multi-layer scheduling system is that the technical capabilities and investments of all Layer-2 scheduling vary a lot. For example, the Layer-2 scheduling system of advertisements is excellent, while not all Layer-2 scheduling is extremely considerate to the service. Learning from this, ASI has integrated many capabilities inside ASI and further standardized the upper-layer PaaS. Thus, it both simplifies the upper layer and enhances the upper-layer capabilities.

The next-generation scheduler being developed today is also divided into several layers. They include the computing load layer (mainly Workload scheduling management), computing scheduling layer (like DAG scheduling and MR scheduling), and the business layer (the same as the Layer-2 scheduling of Sigma.)

3. Scheduling Resource Type

Let's take the ongoing Unified-Scheduler as an example. Three resource types are available for scheduling, Product resources, Batch resources, and Best Effort (BE) computing resources.

Different schedulers have different definitions on the leveled resource types, but they are essentially the same. The ASI Scheduler will be explained in detail in subsequent sections.

1) Product (Online) Resources

Product (online) resources have Quota budgets, and the scheduler needs to ensure the highest level of resource availability. The long-lifecycle POD instance for the online e-commerce core transactions is a typical product resource. The most classic cases are the transaction core business PODs, such as shopping cart (Cart2) and order (tradeplatform2), on the core procedure during Double 11. These resources have high requirements for computing power, high priority, real-time performance, and low response latency. Moreover, you cannot interfere with them.

For example, POD with long lifecycles for online transactions can exist for days, months, or even years. For most applications submitted by application developers, after construction, they need to apply for several long-lifecycle instances, which are Product resources. Most POD or container instances applied by developers from Taobao, Tmall, Juhuasuan, AMAP, Umeng, Heyi, Cainiao, Internationalization, and Xianyu are Product resources.

Product resources refer to PODs with long online lifecycles and resources that meet the requirements above. However, not all long-lifecycle PODs are Product resources. For example, POD used by Alibaba's internal Aone Lab for executing CI construction tasks is not a Product resource. It has a long lifecycle but can be evicted and preempted at a low cost.

2) Batch Resources

The Gap between Allocate and Usage of Product resources used by online services is relatively stable for a time. The Gap and unallocated resources of Prod are regarded as BE resources. They are sold to businesses less sensitive to latency but with certain requirements on resource stability. Batch resources are supported by Quota budgets but only ensure resource availability with a probability, such as 90%, over a period, such as 10 minutes.

That is to say, Product (online) resources have been applied for seeming resources, while in fact, a lot of computing power may not be used in terms of the load utilization rate. In this case, the differentiated SLO multi-layer scheduling capabilities of the scheduler will be utilized. By doing so, those not occupied resources will be fully used as excessive resources and sold to Batch resources.

3) Best Effort (BE) Resources

Best Effort (BE) resources are not supported by Quota budgets and do not ensure resource availability. They can be controlled and preempted at any time. When the Usage amount allocated on a node drops below the resource threshold, the scheduler considers the Gap as an "unstable or non-accounting" resource, namely, BE resources.

Product and Batch resources are responsible for large-scale resources, while BE resources are responsible for the remaining ones. For example, in daily development work, R&D needs to run many UT test tasks that do not have high requirements on the quality of computing resources. The tolerance for time delay is relatively high, and it is not easy to evaluate the quota budget. Therefore, it is not very cost-effective to purchase a large number of Product or Batch resources for such scenarios, while the cheapest BE resources can be a considerable choice. In this case, BE resources are resources not used during Product or Batch operation.

With this multi-layer resource scheduling capability, the Unified-Scheduler can technically maximize the use of resources on a physical node.

Scheduler Capabilities Overview

The figure above shows the responsibilities that ASI should take based on the generalized scheduling. These responsibilities correspond to scheduling capabilities based on requirements for different resource levels and various service scenarios. This figure shows the complete technical framework of the ASI scheduler.

Typical Online Scheduling Capability

1. Business Requirements for Online Scheduling

On the ASI cloud-native container platform, the scheduling scenarios of dozens of business units are presented, including transactions, shopping guides, livestreaming, videos, local life, Cainiao, AMAP, Heyi, Umeng, and overseas business. The highest level of "Product resources" has the largest scheduling proportion.

Compared with offline scheduling and JOB-type scheduling, online business scheduling has typical features. Offline scheduling is also wonderful when describing online scenarios.

1) Lifecycle

Long Running: The container lifecycle of online applications is generally long, at least for a few days, mostly for months, some even for several years.
Long Startup Time: Large application image size, long image download time, and memory preheating of service startup lead to several seconds (or even tens of minutes) of the startup time.

Compared with some typical short-lifecycle tasks (such as FaaS computing), long-lifecycle tasks are different in terms of task characteristics and technical challenges. For example, for function computing scenarios with relatively short lifecycles, the challenges are extreme scheduling efficiency, hundreds of milliseconds of execution efficiency, fast scheduling throughput, and POD runtime performance. For long-lifecycle pods, the global optimal scheduling depends on rescheduling for continuous iteration and optimization. Moreover, the optimal runtime scheduling must depend on the single-machine rescheduling for continuous optimization. In the non-cloud-native era, many businesses could not be migrated, making scheduling extremely difficult. This means the scheduler faces technical problems in the scheduling capability and extremely difficult stock business governance. Besides, the long startup time of online applications also reduces the rescheduling flexibility and makes it more complex.

2) Container Runtime

The Container Runtime must support business demands, such as real-time interaction, fast response, and low service response time (RT). Most systems interact with each other in real-time and are extremely sensitive to latency when online containers are running. Slightly larger latency results in a poor business experience.
Resource characteristics, such as network consumption, I/O consumption, and computing consumption: When instances with the same characteristics coexist, they are prone to compete for resources.

Online container runtime is sensitive to business and computing power, posing a great challenge to the scheduling quality.

3) Complex Business Models Specific to Alibaba Online Applications

High and Low Traffic Peaks: All online business services have high and low peaks. For example, Ele.me's peaks are at noon and in the evening.
Burst Traffic: The traffic may not show regular patterns due to the complexity of the business. For example, livestreaming businesses may experience a sharp increase in traffic due to an emergency. Elasticity is the technical requirement behind traffic bursts. One of the most typical cases was the elasticity demands for DingTalk during COVID-19 in 2020.
Resource Redundancy: Once developed, online services define redundant resources for disaster recovery purposes. However, from the perspective of Alibaba's global business, many long-tail applications are not sensitive to costs and utilization due to their small scale. As a result, massive computing power is wasted.

4) Unique Large-Scale O&M Demands

Complex Model Deployment: For example, application unit deployment, multi-DC disaster recovery, and complex low traffic scheduling, gray release, and formal multi-environment deployment need to be supported.
Large-Scale Peak Features during Big Promotions and Flash Sales: Alibaba's big promotions, such as Double 11, Double 12, and red envelope activities during Spring Festival, take place each year. The application pressure and resource consumption of the entire procedure will multiply along with the increase of peak traffic during the big promotions. This requires the powerful large-scale scheduling capability of the scheduler.
Site Construction during Big Promotions: The time for big promotions is planned in advance. The retention time of cloud resources must be reduced as much as possible to save on the procurement costs of cloud resources. The scheduler needs to build the website before big promotions as quickly as possible and return resources quickly to Alibaba Cloud after the big promotions. This requires high-efficiency, large-scale scheduling, and more time is reserved for the business.

2. One-Time Scheduling: Basic Scheduling Capability

The following table describes the most common scheduling capabilities for online scheduling:

Basic application requirements refer to those corresponding to application scale-out, including POD specifications and OS: In the ASI scheduler, it is abstracted as the common label matching scheduling.
Disaster Discovery and Dispersion Require Locality Scheduling: ASI has obtained a lot of detailed information through various means, such as network core and ASW shown above.
Advanced Strategy: ASI will standardize and universalize business requirements as much as possible, but some services still have specific requirements for resources and runtime. These requirements include specific infrastructure environments, such as hardware and specific requirements for containers (like HostConfig and kernel parameters.)
Scheduling Rule Center: The specific business requirements for strategies require a corresponding powerful scheduling rule center. The center guides the scheduler to apply the correct scheduling rules. The data in the scheduling rule center comes from learning or expert O&M experience. The scheduler applies these rules to the scaling allocation of each POD.

3. Inter-Application Orchestration Strategy

Due to the limited number of cluster nodes, applications that potentially interfere with each other have to coexist on the same node. Thus, inter-application orchestration strategies are expected to ensure the optimal operation of each host node and POD.

In scheduling practices in production, business stability always comes first. However, resources are also limited, making it difficult to balance optimal resource cost and business stability. In most cases, the balance can be perfectly achieved by inter-application orchestration strategies. By defining inter-application strategies for coexistence, such as CPU intensive, network intensive, and I/O intensive applications, and peak model characteristics, the cluster is fully scattered. Adequate constraints and protection are available for the application coexistence on the same node, so the interference probability between different pods can be minimized.

Furthermore, the scheduler adopts more technical means at runtime, such as network priority control and refined CPU orchestration control. Thus, the potential impact of inter-application runtime is avoided.

There are some challenges brought by inter-application orchestration strategies. For example, in addition to building inter-application orchestration capabilities, the scheduler needs to fully understand the operation characteristics of each business.

4. Refined CPU Orchestration

Refined CPU orchestration is an interesting topic in the online scheduling field, including CPUSet scheduling and CPUShare scheduling. In other scheduling scenarios like offline scheduling, it is not so important or difficult to understand. However, online transaction scenarios, theoretical inference, laboratory scenarios, and data obtained from stress testing during big promotions have proved that accurate CPU scheduling is of great importance.

In short, refined CPU orchestration is a core adjustment to ensure the maximum and most stable usage of the CPU core.

Refined CPU orchestration is so important that ASI has fully understood and applied it in the past few years, as is shown in the following table (only including CPUSet refined orchestration scheduling.)

Description: Take an x86-based physical machine or an X-dragon server with 96 cores (96 logical cores) as an example. It has 2 sockets, each of which has 48 physical cores with two logical cores, respectively. ARM architecture is different from x86 architecture.

Due to the layered cache design in the CPU architecture, the optimal allocation works like this: For the same physical core, one logic core is allocated to the core online transaction application, such as Carts2 (shopping cart business.) The other core is allocated to a non-core application that is not busy. By doing so, during daily operation or Double 11 peak hours, Carts2 can take full advantage, which has proved effective every time in production and stress testing.

If both logical cores are allocated to Carts2 at the same time, the maximum utilization of resources will be reduced significantly due to the same service peak, especially for the same POD.

Theoretically, two core transaction applications, such as Carts2 (shopping cart business) and tradePlatform2 (orders), should avoid sharing these two logic cores. However, at the micro-level, the peaks of Carts2 and tradePlatform2 are different, so the impact is low. Although such CPU allocation does not seem optimal, it still needs to be maintained because of limited physical resources.

When the numa-aware is available, users may want to use L3 Cache as much as possible to improve computing performance. For that purpose, more cores in the same POD should not be used cross-socket.

When CPUShare is used, the allocation of Request and Limit also matters a lot. When CPUSet and CPUShare coexist, scheduling is more complicated. For example, the scale-out or disenabling and the potential requirements of CPUSet containers involve CPU rescheduling of all PODs in the entire machine. However, in the emerging GPU heterogeneous scheduling scenarios, some tips are also available for concurrent CPU and GPU allocation.

5. Large-Scale Orchestration Scheduling

Large-scale orchestration scheduling is mainly applied to site construction, site migration, or large-scale migration scenarios. Alibaba's frequent site construction during big promotions and super-large-scale site migration are typical examples. Considering costs, it's expected to quickly create hundreds of thousands of PODs in the shortest possible time with minimal labor costs.

The randomness and unpredictability of the sequential requests from multiple tasks result in the shortcomings of centralized scheduling in large-scale fields. Without large-scale orchestration, Alibaba's large-scale site construction often requires a complex process from self-scaling of services to repeated rescheduling. This consumes massive manpower and takes several weeks. Fortunately, large-scale orchestration scheduling ensures a resource allocation rate of more than 99% while achieving hour-level large-scale delivery.

General Scheduling Capability

1. Rescheduling

The central scheduler achieves one-time optimal scheduling, but it is completely different from the expected cluster-dimension global optimal scheduling. Rescheduling also includes global central rescheduling and single-machine rescheduling.

Why is central rescheduling a necessary supplement for one-time scheduling? Please see the following examples:

Many POD instances with long lifecycles exist in ASI scheduling clusters. As time goes by, problems, such as numerous resource fragments and uneven CPU utilization, inevitably occur.
The allocation of large-core POD requires dynamic scheduling capabilities, such as real-time eviction of small-core pods to release resources. Pre-planned global rescheduling may also be required to spare a few large cores on many nodes.
Considering the insufficient resource supply, the one-time scheduling on a POD cannot be optimal, with certain defects or imperfections. However, cluster resources change dynamically, thus dynamic migration, namely, rescheduling, is feasible at a later time. This will bring a better business runtime experience.

The algorithms and implementation of central rescheduling are often very complicated. It requires a deep understanding and full coverage of various rescheduling scenarios, rescheduled DAG graphs with clear definition, and dynamic performance.

Many scenarios also require single-machine rescheduling, including SLO optimization of refined CPU orchestration and single-machine rescheduling optimization driven by QoS data.

It should be emphasized that the implementation of single-machine rescheduling must solve the problem of safety risk control to avoid uncontrollable explosion range. Considering the lack of risk control capability on the single-machine side, we recommend choosing a central unified trigger under strict protection and control rather than the node autonomy mode. There are many inevitable node autonomy scenarios in Kubernetes. For example, when the pod yaml file is changed, kubelet will perform the corresponding changes. In the past, ASI spent several years continuously enhancing each potential risk control point. ASI also iteratively built a Defender system for hierarchical risk control management, including core button, high risk, and medium risk. For potential risks, the interaction with the central Defender is required before operation executing on the single-machine side. Thus, disaster events are avoided through security prevention and control. The scheduler should also implement strict security defense. Otherwise, autonomous node operation is not allowed.

2. Kernel Scheduling

A busy host that runs many tasks in parallel cannot avoid kernel-state resource competition among multiple tasks in operation. It is evitable even though central scheduling and single-machine scheduling. They have jointly ensured the optimal resource allocation, such as CPU allocation and I/O dispersion. The competition is especially fierce in the well-known offline hybrid deployment scenarios. This requires collaboration between central scheduling, single-machine scheduling, and kernel scheduling. For example, the scheduling coordinates the priority of various resources of the task and then hands over to kernel scheduling for execution.

This corresponds to multiple kernel isolation technologies, including CPU (scheduling priority BVT and Noise Clean mechanism), memory (memory collection and OOM priority), and network (network priority and I/O).

Today, the isolation mechanism between Guest Kernel and Host Kernel based on the secure containers enables us to avoid the competition of kernel-state resources easily.

3. Elastic Scheduling and Time-Sharing Scheduling

The logic of elastic scheduling and time-sharing scheduling enables better resource reuse in different dimensions.

The ASI scheduler, together with Alibaba Cloud infrastructure, uses the strong elasticity provided by ECS. In Ele.me scenarios, it allows resources to return to the cloud during off-peak hours and applies for resources again during peak hours.

The built-in elastic Buffer of the large ASI resource pool can be used. The host resources of the ASI resource pool come from Alibaba cloud resources. The elastic technology of the Alibaba Cloud IaaS layer can also be used. The balance between the two is a very controversial topic.

The time-sharing scheduling of ASI performs optimal resource reuse with great cost decrease. Online transaction POD instances are disabled on a large scale every night. The released resources are used for ODPS offline tasks and then used for online applications again every morning. This classic scenario maximizes the value of the offline hybrid deployment technology.

The core of time-sharing scheduling is resource reuse and construction and management of large resource pool dependent. This is the combination of resource operation and scheduling technologies. This requires the scheduler to accumulate massive jobs in different forms and a large number of tasks; the more, the better.

4. Vertical Scaling Scheduling/X+1/VPA/HPA

Vertical scaling scheduling is a second-level delivery technology that partially solves the burst traffic problem. It is also the key to control risks of peak pressure at midnight during big promotions. Computing resources can be delivered in seconds through shuffling algorithms by adjusting the resources of existing POD vertically and scheduling CPUs accurately and reliably. Vertical scaling scheduling is related to VPA technology, so vertical scaling scheduling is also one of the scenarios of VPA.

In a sense, "X+1" horizontal scaling scheduling can be considered as one of the HPA scenarios, but it is manually triggered. "X+1" emphasizes the ultimate efficiency of resource delivery, which requires great improvements in R&D efficiency. An online POD can start and provide services within "X" minutes. All operations, except application startup, must be completed in "1" minute.

Vertical scaling scheduling and "X+1" complement each other and jointly escort various peaks.

ASI is also implementing more VPA and HPA scenarios. For example, VPA technology can provide more free computing power for red envelope activities during Spring Festival. It can save a lot of costs.

The implementation of scheduling technologies, such as VPA and HPA, in more scenarios will improve continuously in the future.

5. Hierarchical [Differentiated SLO] Resource Scheduling

Differentiated SLO scheduling is one of the core parts of the scheduler. This section is similar to the Scheduling Resource Type section above. Given the complexity of differentiated SLO, it will be introduced in the last section of this chapter.

The ASI scheduler also accurately defines Service Level Objectives (SLO), Quality of Service (QoS), and Priority.

1) SLO

SLO refers to service level objectives. ASI provides differentiated SLO through different QoS and Priority, and the pricing of different SLO varies. Users can decide to choose which type of SLO-guaranteed resources according to different business characteristics. For example, offline data analysis tasks can choose low-level SLO for a lower price, while important business scenarios can choose high-level SLO that costs more.

2) QoS

QoS is responsible for resource quality assurance. QoS defined in the community includes Guaranteed, Burstable, and Best Effort QoS. QoS defined in ASI is not completely mapped to the community. In the community, it is completely mapped by Request / Limit. ASI defines QoS in another dimension, including LSE, LSR, LS, and BE, to describe Alibaba's scenarios clearly, such as CPUShare and hybrid deployment). Different levels of resource assurance are divided clearly. Different businesses can choose different levels of QoS according to the latency sensitivity.

3) PriorityClass

PriorityClass and QoS are two different concepts. PriorityClass focuses on the importance of the task.

The resource allocation strategies and the importance of the task, namely, QoS and PriorityClass, can be combined in different ways. They still require a certain correspondence. For example, a PriorityClass named Preemptible can correspond to Best Effort QoS for most tasks.

Different scheduling systems have different definitions for PriorityClass. For example:

ASI priority defines System, Production, Preemptible, and Production PriorityClass.
In Hippo, there are System, ServiceHigh, ServiceMedium, ServiceLow, JobHigh, JobMedium, and JobLow PriorityClass.

Globally Optimal Scheduling

1. Scheduling Simulator

The scheduling simulator is similar to Alibaba's full-procedure stress testing system. It verifies new scheduling capabilities in a simulated environment through real online traffic replaying or simulated traffic replaying. Thus, it continuously optimizes various scheduling algorithms and metrics.

Another common use of the scheduling simulator is to perform offline simulations of difficult online problems to locate various problems with high efficiency and no harm.

To some extent, the scheduling simulator is the basis for globally optimal scheduling. It allows repeated optimizations on various algorithms, technical frameworks, and technical procedures in the simulation environment. By doing so, it optimizes global metrics, such as the global distribution ratio, scheduling performance in different scenarios, and scheduling stability.

2. Elastic Scheduling Platform (ESP)

To achieve globally optimal scheduling, ASI built a brand-new ESP for the scheduler. ESP aims to establish a scheduler-centered, all-in-one, and closed-loop scheduling efficiency system, based on scheduling data guidance, core scheduling capabilities, and product-based scheduling operations.

Many similar modules have been developed, such as scheduling SLO inspection, scheduling tools, and two-layer scheduling platforms, for different scenarios. ESP, together with more two-layer scheduling capabilities, can provide globally optimal scheduling. They also focus on enhancing business stability, reducing resource costs, and improving user experience to provide more considerate services.

More Scheduling Capabilities

This article systematically introduces the basic concepts, principles, and various scenarios of ASI Scheduler. Unfortunately, many details of scheduler cannot be introduced thoroughly. Many in-depth scheduling capabilities, such as heterogeneous machine scheduling, scheduling profiling, fair scheduling, priority scheduling, shifting scheduling, preemptive scheduling, disk scheduling, Quota, CPU normalization, GANG Scheduling, scheduling tracing, and scheduling diagnosis were not introduced in this article. ASI's powerful scheduling framework structure and optimizations, scheduling performance optimizations, and other technologies were also not covered.

By 2019, ASI scaled the original Kubernetes single cluster to a cluster with tens of thousands of nodes. It continues to maintain a large number of large-scale computing clusters within Alibaba Group, thanks to the ACK powerful Kubernetes O&M system. ASI also accumulates industry-leading production practices of Kubernetes multiple clusters. In these large-scale Kubernetes clusters, ASI provides computing power for complex tasks continuously via its enhanced container scheduling technology.

Over the past few years, Alibaba Group has implemented comprehensive migration and evolution from ASI control to ACK in the scheduling field during Alibaba Group's comprehensive cloud migration. In the future, more cloud scheduling capabilities will be provided continuously and enhanced in complex, rich, and large-scale business scenarios of Alibaba.