Community Blog Unified Scheduling System First Implemented on a Large Scale While Supporting Alibaba’s Businesses During Double 11

Unified Scheduling System First Implemented on a Large Scale While Supporting Alibaba’s Businesses During Double 11

This short article discusses how the Unified Scheduling System impacted Double 11 in 2021.

By Aliware

1. Background

The unified scheduling project 1.0 supported the Double 11 Global Shopping Festival successfully in 2021. It has upgraded and optimized the whole process from container scheduling to fast scale-in and scale-out. The team of this project has 100 core members. They have gone through approval, POC, project review and design, closed development and testing, and promotion. Finally, the project was launched successfully.

It is one of the core projects of Alibaba. The Alibaba Cloud Container Team, Big Data Team, Resource Efficiency Team, and the Ant Group Container Orchestration Team have been developing this project for over a year. Finally, they fully upgraded the hybrid deployment technology to today's unified scheduling technology.

Today, unified scheduling has fully unified the scheduling of Alibaba's e-commerce, search and promotion, MaxCompute, and Ant businesses. It has unified pod scheduling and high-performance task scheduling, achieved complete unification and scheduling coordination of resource views, and realized the hybrid deployment and increased utilization of various complex business forms. It can fully support large-scale resource scheduling in dozens of data centers with millions of containers and tens of millions of kernels worldwide.

Cloud-Native Product Family

2. Comprehensive Upgrade of Unified Scheduling Technology

The essence of cloud computing is to turn small computing fragments into larger resource pools and fully utilize peak-load shifting to provide the extreme energy efficiency ratio. Alibaba's exploration of technology is never-ending. This is driven by our desire to protect the environment, advance technological development, and improve efficiency. Technicians in Alibaba hope to transform the computing power of data centers into out-of-the-box infrastructures like water or electricity resources.

Alibaba used hybrid deployment technology in the past to eliminate the split of multi-resource pools to maximize the complementary advantages of peak-load shifting between businesses. Multiple scheduling brains in different computing fields shared resources. The hybrid deployment technology has brought resources unification and improved utilization, but its nature of multiple schedulers is its bottleneck.

Alibaba continues to pursue the construction of a new generation of scheduling technology that can support more complex tasks with undifferentiated hybrid deployment, extreme elasticity, and complementarity. This way, Alibaba can achieve global optimal scheduling and provide computing power with higher quality. This year, we reached a new stage in technology. We worked together and led many teams to launch a new unified scheduling project based on Alibaba Cloud Container Service for Kubernetes (ACK).

Container Product Family

The large-scale unified scheduling that made its debut during Double 11 this year centrally managed the computing, storage, and network resources through a set of scheduling protocols and system architecture. It marked a breakthrough in the industry, featuring resources with ultra-large scale, high efficiency, and auto scaling. Unified scheduling has reduced the procurement of tens of thousands of servers with the use of offline hybrid deployment, online-offline hybrid deployment, and fast scale-in and scale-out. It has also reduced our cost by hundreds of millions yuan and improved efficiency.

This year, large-scale data intelligence was introduced to enrich scheduling capabilities further. It provides real-time load awareness, automatic specification recommendation (using VPA), differentiated SLO workload scheduling, CPU normalization, time-division multiplexing, and HPA that supports periodic prediction. It also provides more dimensions of cost optimization technology and high-reliability assurance in container runtime.

With the new-generation unified scheduling, Alibaba e-commerce, search, big data, and many other platforms, as well as different types of complex computing resources, request resources in a consistent way. Unified quota management and resource planning provided by unified scheduling make it possible to borrow hundreds of thousands of core resources within seconds. Based on unified scheduling, Alibaba Cloud and Ant Group have also realized the integration of scheduling technologies, and the technological ecosystem of Ant Group has been upgraded to unified scheduling. The scheduling platform brings more possibilities to the future. For example, we can use numerous means, such as price leverage, to make Alibaba's internal businesses use the resources of each data center more rationally. This will ensure that the resource usage in the data center is balanced as much as possible to improve the energy efficiency ratio of the data center.

ACK provides enhanced support for standard Kubernetes, with higher performance, higher throughput, and lower response latency to build a stable and reliable ultra-large-scale single cluster. It has steadily supported ultra-large-scale clusters with over one million kernels and 12,000 nodes, laying a solid foundation for unified scheduling of production and operations in large resource pools. Many types of complex resources of Alibaba have also been integrated and upgraded based on ACK.

In addition to the classic scenarios of Alibaba, such as e-commerce, search, and big data, unified scheduling has also empowered new technological innovations significantly. Decision-making has a high demand for real-time computing in livestreaming e-commerce. For example, a live studio in Double 11 may have than 90 million online viewers and real-time data generated by viewers, such as browsing and trading. The data must be analyzed at the second-level. This year, Alibaba upgraded the real-time computing engine Blink to a new-generation engine based on unified scheduling, which reduced cost and improved performance, stability, and user experience significantly. Compared with Yarn, the performance of operations on large-scale tasks increased by 40%, and the error recovery efficiency improved by 100%. Unified scheduling saved hundreds of thousands of CPUs for Alibaba during Double 11. It can realize a globally zero hot spot when the CPU resource usage of the cluster exceeds 65%. This ensures the timeliness of stream pushing during each livestream.

In terms of Serverless, the function service was implemented on a large scale in Alibaba for the first time. Serverless supported more than ten business scenarios, such as Taobao search recommendation, data processing, and frontend SSR during Double 11. Function Compute (FC) can make full use of large-scale fragmented resources of the cluster using the unified scheduling technology. This solves the cost problem of idle resources during off-peak hours in Serverless scenarios. Based on the on-demand loading of ACK images and network stack optimization, the cold start time of function instances is less than 150 milliseconds, and the pool technology ensures that the cold start rate of Function Compute containers is less than 5%. This is the key to ensuring the success of Double 11.


3. Prospect

In the future, ACK will spread Alibaba's experience of unified scheduling to the entire industry to support more new computing load ecosystems and the architecture evolution of new technology forms. By that time, we will use cloud computing everywhere to fully empower more enterprises and bring more low-carbon dividends.

0 0 0
Share on

Alibaba Cloud Community

920 posts | 208 followers

You may also like