Introduction to Alibaba's large-scale Flink cluster operation and maintenance system

1. Evolution history and O&M challenges

Ali's real-time computing has experienced rapid development in the past 10 years, and can be generally divided into three eras:

1.0 Era: From 2013 to 2017, three real-time computing engines coexisted. The familiar Jstorm and Blink were both called streaming computing at that time.

Era 2.0: In 2017, the Group merged three major real-time computing engines, and Blink became the only real-time computing engine by virtue of its excellent performance and efficient throughput, realizing a unified system. In the next four years, all real-time computing services of the group were migrated to Blink. Ali's real-time computing business experienced the fastest growth, and the platform scale also increased from a thousand to ten thousand. Real-time computing is all on Blink.

3.0 Era: With the acquisition of the parent company of Flink in Germany two years ago, Ali China and the German team jointly created a new VVP platform based on the new cloud-native base and equipped with the new open source engine of Flink. On Double 11 in 2021, the new VVP platform steadily supported Double 11 with substantial performance improvements, announcing that Ali’s real-time computing has entered a new 3.0 era.

At present, Ali's real-time computing has millions of computing power, tens of thousands of physical machines, and tens of thousands of jobs, truly forming a super-large-scale real-time computing platform. Moreover, in the process of rapid business development, the overall architecture of the platform is undergoing a large-scale evolution from Hadoop Flink under the cloud to cloud-native K8s plus Flink.

Facing such a behemoth of real-time computing, O&M also faces different challenges as the times change:

The first stage is platform operation and maintenance. The core is to help SRE solve the platform operation and maintenance of ultra-large-scale volume, which is the problem of Flink Cluster cluster operation and maintenance;
The second stage is application operation and maintenance. The core is to help a large number of real-time computing users on the cluster to solve the complex problem of Flink job operation and maintenance on the application side;
The third stage is that with the advent of the 3.0 era, the cluster base is fully cloud-native, and global data is also standardized along with cloud-native. How to quickly evolve and improve the operation and maintenance capabilities to cloud-native and intelligent has become our new challenge.

2. Cluster operation and maintenance Flink Cluster

On the one hand, a very typical business is running on the Flink platform, which is the GMV media transaction turner on the day of the Double 11 promotion, which is also a well-known large screen for transaction volume. This business has very high requirements for stability. In addition to the GMV flop, Flink also hosts all important real-time computing services within Ali, including real-time scenarios of core e-commerce businesses such as Alimama, advertising metering and billing, search recommendations, and machine learning platforms. These real-time scenarios are both important and real-time sensitive, and stability is the number one challenge.

On the other hand, due to the huge scale of the platform, which involves tens of thousands of exclusive machines, deployment in multiple regions, and the increase in the complexity of platform deployment brought about by the growth of the platform, local abnormalities will become the norm again. The second biggest challenge to stability.

Facing the dual challenges of important and sensitive business, large-scale platform and complex structure, how to maintain the stability of the cluster is a big problem.

In the beginning, the Flink cluster used the number of failures to measure stability, but in fact the granularity is very low, because there are many stability anomalies that do not meet the failure duration standard, which cannot be reflected in the final number of failures, resulting in stability problems. blind spot. Later, we created several sets of SLA availability based on minute-level availability to measure the stability of the entire cluster.

SLI is the golden indicator used to calculate SLA. It represents the availability of Flink Cluster. Because cluster is a virtual logical concept, we define Flink job status to represent SLI. The Flink job status itself is very complicated, but we can simply abstract three statuses: scheduling, running normally, and running abnormally. Each job can calculate these three statuses, and then aggregate them to the cluster level to form the proportion of jobs. Once abnormal If the ratio exceeds a certain threshold, it means that the cluster is unavailable, so the SLI is measured and then the unavailable time for the whole year is calculated.

The final SLA availability measurement can be expressed as a simple mathematical formula, SLA availability = number of SLA exceptions * average duration of each SLA exception, to achieve minute-level availability to finely measure cluster stability.

With fine quantification, the next step is the path of improvement. You can also start from the above formula to optimize two factors: respectively, to prevent the stability and reduce the number of SLAs; at the same time, to quickly restore the SLA, Shorten the duration of SLA, and ultimately improve the overall availability.

The first is the SLA exception prevention part. The key idea is to do a good job in cluster inspections, actively discover abnormal hidden dangers, and eliminate hidden dangers in time, thereby reducing the number of SLA abnormalities.

What are the hidden dangers that lead to SLA exceptions? For example, a bunch of super-large jobs start suddenly, causing the load of hundreds of machines in the cluster to be high or the disks to be full, causing a large number of job heartbeat timeouts; another example is a certain Flink version that has a major stability problem or defect that affects nearly a thousand online jobs. assignments. These seemingly unpopular failure scenarios actually occur almost every day in a super-large cluster and rich business scenarios. This is an inevitable challenge when the platform develops to a certain scale. Moreover, the larger the cluster size, the easier it is for the butterfly effect to occur, and the impact area is often larger. In addition, the complexity and time-consuming of each cluster exception location is very long. How to eliminate these SLA exceptions?

Our idea is to create a Flink Cluster exception self-healing service, by regularly scanning the behavior data of the full amount of online operations, such as job delay, failover, and backpressure, and then performing abnormal analysis and decision-making on these massive data to find hidden dangers. In general, there are two types of exceptions:

One is caused by the user's own job behavior, and the user is notified to change the corresponding job, such as OOM caused by unreasonable resource allocation, delay caused by job backpressure, etc.;
Another type of abnormality is caused by problematic versions on the platform side, and the platform side will carry out large-scale active upgrades to eliminate these problematic versions.
Finally, both the platform side and the user side will form a closed loop of SLA exception self-healing, thereby reducing the number of SLA exceptions.

In the abnormal self-healing service, the most complicated thing is the identification and decision-making of the underlying rules. After a lot of accumulation, we have accumulated dozens of the most frequent exception rules and governance solutions on the business side to fully automatically identify and eliminate previously "invisible" hidden dangers, and truly achieve stability prevention.

According to the formula for SLA exceptions, in addition to preventing to reduce the number of SLAs, another means is to shorten the length of the exception after the SLA occurs.

The challenge is that there are nearly 10,000 jobs in one online cluster, but all cluster-level faults are difficult to locate and take a long time to recover. In addition, the number of clusters is large and the distribution is wide, and the probability of failure increases. The two superimpose, Several failures a year have almost become the norm, and the overall stability is very passive. We need to turn from passive to active. If we can quickly switch traffic to achieve cluster-level disaster tolerance in failure scenarios, SLA abnormal recovery can not only be shortened, but also increase its certainty.

The disaster recovery system is mainly divided into three parts:

First, where to cut, real-time computing requires the network to be at the millisecond level, and tens of milliseconds across cities will definitely not meet the real-time requirements. Therefore, on the platform-side deployment architecture, the computing is deployed in the same city with two computer rooms, two-two disaster recovery, and mutual master and backup flow switching layout, which solves the problem of failure scenarios and has a place to cut.

Second, the resource capacity is limited. It is impossible to have disaster recovery resources as a budget for such a large platform, so a trade-off is required. How to distinguish the priority of high-priority business and low-priority business? The platform has established a set of priority standards for Flink jobs based on business scenarios, and is equipped with an automated management system for the entire process from application to governance to rectification and downgrading. It finely prioritizes on the business side to ensure truly high-quality business quality and quantity. Under the condition of limited resources, we should focus on maintaining high-quality business in order to exchange resources for resources.

The last step is the most complicated, how to transparently cut off the job. The core idea is to reuse storage and ensure transparent switching of computing to ensure the senselessness of business.

Flink jobs are all long-lived, with state intermediate calculation results. First of all, in the cluster deployment architecture, the computing and storage clusters must be separated in physical deployment. When the computing cluster fails, such as an abnormality in the infrastructure, all Flink jobs can be relocated to another disaster recovery cluster through flow switching, but the state storage still points to the old storage cluster, and it can be restored from the original state point. Realize a truly transparent migration and be indifferent to users.

In addition to daily stability, Double 11 is a big test of stability. Flink’s special protection for Double 11 can be summarized in 4 blocks and 8 words, which are stress testing, current limiting, downgrading, and hotspots. Behind each piece, we have deposited a mature guarantee system.

The first piece of stress testing refers to the stress testing platform, which firstly provides users with the ability to clone production to shadow operations with one click, and secondly provides a large number of large-scale and accurate pressure creation, control, and voltage stabilization capabilities, and provides job automation Performance tuning, and the final step of producing a fully automated one-stop stress testing solution with one-click launch.
The second downgrade refers to the downgrade platform, because at the 0-point peak of the big promotion, it is necessary to quickly downgrade low-priority businesses to achieve reasonable control of the water level.

The third block is current limiting. There are also some medium-quality or high-quality businesses that are not allowed to be downgraded in the state of big promotion, but can accept a short delay. Therefore, the platform also realizes the isolation and limitation of job Pod resources based on the Cgroup of the Linux kernel. So as to achieve the effect of accurate current limiting of job granularity calculation.

The fourth block is the hot machine, which is also the most complicated point of the big promotion. From the perspective of the cluster, there are differences between the resources sold by the cluster and the resources used by users. For example, a Flink job applies for 10 CPUs, but actually uses 5 CPUs, and the peaks and troughs will lead to unbalanced water levels at the cluster level. .

The first figure above shows that the resource level of all machines at the cluster scheduling level is very average, and the CPU and memory are almost on the same line. However, the physical water levels of all the machines actually running on the cluster are uneven, because the scheduling is not aware of physical usage, so as the cluster water level continues to increase, such as the arrival of the zero-point peak of the big promotion, the hotspot machines in the cluster will go higher. High de-translation, the resources of some machines in a certain dimension will reach a performance bottleneck, such as CPU usage of 95% or higher, which leads to hot machines.

In a distributed system, all on-board services are stateful and related. Local hotspot machines will not only affect the stability of the cluster, but also become a bottleneck for cluster performance improvement and cause cost waste. That is to say, hotspot machines will It is the short board of cluster stability and water level improvement.

The resolution of hotspot machines is a very difficult problem, and generally needs to go through 4 processes:

The first step is to discover the hotspot machines, including the CPU, memory, network, and disk of the hotspot machines. The difficulty is that the threshold of the hotspot machines comes from the rich experience of SRE online.
The second step is analysis. We have made a series of machine diagnostic tools to locate hotspot processes, including CPU to process, IO to process. The difficulty lies in requiring users to have an in-depth understanding and analysis of the principles of the entire Linux system.
The third step is business decision-making and strategy. From hotspot machine process association to business data to make decisions, different priorities can accept different strategies.
The last step is to really solve the hotspot machines. Low priority is downgraded or balanced, and medium and high priorities are used to reduce hotspot machines through runoff.

The things involved behind this process include the understanding of business such as priority, resources, and configuration portraits, the understanding of scheduling principles such as resource allocation strategies, scheduling strategies, and in-depth investigation and analysis of the system kernel, as well as business experience. And strategy - whether to limit or downgrade. The definition and analysis of the entire link is a very complicated technical problem.


What we are doing is to settle down all the complete solutions for hotspot machines, and build a Flink Cluster AutoPilot based on K8s cloud native to realize fully automatic self-healing of hotspot machines.

From the perspective of deployment form, AutoPilot's service is fully managed based on K8s, lightweight deployment is carried out according to the cluster dimension, and management and operation and maintenance are convenient through configuration files. In the execution phase, K8s is used to ensure the final state and final consistency. From the perspective of AutoPilot's technical capabilities, it abstracts the comprehensive analysis process of hotspot machines into six stages, including the definition, perception, analysis, decision-making, execution and observability of the whole process of hotspot machines to realize the whole hotspot Fully automated self-healing and high observability of the machine improve the stability of the cluster and reduce costs.

In the past few years, around the three core values of operation and maintenance stability, cost, and efficiency, SRE has accumulated a large number of operation and maintenance capabilities and a better operation and maintenance platform on the ultra-large-scale cluster operation and maintenance of Flink Cluster. However, with the advent of the big wave of cloud-nativeization, how the operation and maintenance capability becomes more standardized based on cloud-native, and how to establish a more unified standard for the interactive interface, operation mode, execution mode and observability of the operation and maintenance process will all be Become our key development direction in the future. Flink Cluster AutoPilot will become the carrier of cloud native new technologies to carry the continuous evolution and upgrade of the operation and maintenance system.

3. Application O&M Flink Job

With the general trend of real-time computing, the number of Flink users and jobs has experienced rapid growth, and now the number of jobs on the platform has reached tens of thousands. But as we all know, the operation and maintenance of Flink jobs is a very complicated problem. Here are some of the most frequent inquiries from daily users, such as why my job starts slowly, why failover, why back pressure, why delay, how to adjust resource allocation to reduce costs ? These seemingly simple questions are actually very complex.

Flink's job operation and maintenance difficulties have two aspects: on the one hand, there are many full-link components in the distributed system, and the dependencies are very complicated. On the other hand, Flink itself, especially when it comes to the RunTime level, has a very complicated principle. Therefore, we hope to use our rich knowledge of operation and maintenance, including the calling process of the whole link of the system, an in-depth understanding of the working principles of each component, as well as rich troubleshooting experience in daily and double 11 promotions, as well as excellent troubleshooting Ideas are all transformed into data and rule algorithms, and precipitated into operation and maintenance product functions.

This product has two main functions, one is Flink Job Adviser, which is used to discover and diagnose job exceptions; the other is Flink Job Operator, which is used to repair job exceptions. The two work together to solve the problem of Flink operation and maintenance.

The picture above is the final effect of Flink Job Adviser presented to the user. Users only need to enter the job name or link, @ a robot, and the Adviser service will be called.

For example, in Case1, the job cannot be started due to insufficient resources. The advisor will give a diagnosis result, which is due to insufficient resources of a certain job, and attach improvement suggestions to let the user go to the console to expand the corresponding number of resources.

For example, in Case2, a certain job of the user failed over, and he wants to know why. Through the association of global data, the result given by Adviser is due to the machine offline or hardware failure self-healing on the platform side. It is recommended that users do not need to do anything, just wait for the automatic recovery.

Another example is Case 3. Due to the unreasonable memory configuration of user jobs, OOM occurs frequently, resulting in failover. Adviser will advise users to adjust the memory configuration of the corresponding computing nodes to avoid new failovers.

There are dozens of abnormal diagnosis capabilities for complex scenarios behind the Filnk job Adviser, forming a huge empirical decision tree. It can not only locate the abnormality that is occurring, but also has the ability to prevent the abnormality. It mainly consists of three parts:

In the pre-event part, forecasts are made based on the operation indicators of the job and the global events of the system, and hidden risks are discovered in advance to achieve the effect of prevention. For example, there are failovers or version problems found in the job. discover these problems.
In the middle part, diagnosis is made for the whole life cycle of job operation, including start-stop problems, such as start-up error reporting, slow start-up, stop error reporting, etc., as well as insufficient running performance, delay, error reporting during operation, data consistency, issues of accuracy.
In the post-event part, users are supported to do a full backtracking of historical operations. For example, you want to see the reason for the failover at midnight last night.


In the specific implementation of the decision tree, several typical nodes with complexity are selected for sharing.

The first is to check the status of the entire life cycle of a job. A job is submitted from the console to resource allocation, then to the operating environment, dependent download, to the creation of Top, to the loading of upstream and downstream, and finally to data processing. The entire link is a For a very complicated process, Adviser collects and analyzes the time-consuming and full-scale events of key nodes in a unified manner, and finally can diagnose and locate abnormalities in any state of the job.
The second is the problem of job running status and performance. It mainly detects abnormalities for various real-time monitoring indicators, or discovers and analyzes abnormalities through the judgment of experience values and threshold values. For example, if the job is delayed, then use the node to find the node where the backpressure is located, and then find the node where the TM is located, then analyze the abnormality of the machine, and finally find that a certain machine may have a high load. In this way, the derivation of the entire link evidence chain can be formed, and the related drill-down analysis can be done to locate the real root cause.
The third is the highest frequency problem, where errors are reported during the operation of the job. The core idea is to collect logs of various components, such as submitted logs, scheduled logs, failover logs, and logs with JM and TM, and use these massive exception logs through log clustering algorithms, including natural language processing and actual extraction, to extract Turn some unstructured logs into structured data, then combine similar items for compression, and finally use SRE and R&D to make cause annotations and suggestions, forming a complete set of expert experience.
The earliest implementations of decision trees were static rules, but with the complexity of scenarios, especially the explosion of data and the emergence of personalized scenarios, static rules can no longer meet our needs. For example, the delay of each job is individual Normalization and error reporting can no longer be maintained through regular matching. We are actively trying to introduce various AIs to solve these personalized problems.

After locating the exception through the Filnk job Adviser, the Filnk job Operator is needed to fix the exception and form a closed loop.

Operator capabilities are mainly composed of 4 parts:

The first capability is upgrade, which transparently upgrades the problem version of the job and hot updates the configuration to solve hidden dangers and abnormalities in the stability of the job such as code and configuration.
The second capability is optimization, which is based on Ali's internal Autopilot to configure and optimize the performance of jobs, so as to help users solve performance and cost problems.
The third capability is migration. Jobs are migrated across clusters transparently, which mainly helps users achieve efficient job management in large-scale job scenarios.
The last one is self-healing repair. According to the various risks and rules diagnosed by Adviser, it is equipped with the self-healing ability of one-click repair.

With the development of real-time computing, operation and maintenance has also experienced the evolution and upgrading from human flesh, tool-based, platform-based, intelligent to cloud-native. Computing management and control products to solve the problem of ultra-large-scale real-time computing operation and maintenance.

In the whole system, there are two operation and maintenance objects, cluster and application, in the middle. The goals and value of operation and maintenance in the periphery have always been around the three goals of stability, cost, and efficiency. The carrier of the operation and maintenance system, technology, and products is real-time computing control, through real-time computing control to serve the upper-level real-time computing users, industry research, SRE, and ourselves. At the same time, the technical core of operation and maintenance management and control is making every effort to evolve towards intelligence and cloud nativeization.

To sum it up in one sentence, with intelligence and cloud native as the technical core, build real-time computing operation and maintenance management and control products to solve the three major problems of stability, cost, and efficiency encountered in ultra-large-scale Flink cluster operation and maintenance and application operation and maintenance.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us