Continuous release and evolution of high-availability big data computing platforms
Created#More Posted time:Apr 24, 2017 16:36 PM
Abstract: Alibaba's big data computing platform needs to run continuously on tens of thousands of machines or clusters every day. The platform carries Alibaba's core analysis and calculation tasks and is subject to high reliability and SLA requirements. However, we also need to constantly improve the system performance, reduce costs and provide more functionality to meet the growing business demands, which requires continuous upgrading of the in-service system.
Alibaba's big data computing platform chief architect Lin Wei delivered a speech titled "Continuous Release and Evolution of High-availability Big Data Computing Platforms" at the SDCC 2016 (Software Developer Conference China) from November 18 to 20, 2016. This article focuses on how big data systems carry out system iterations, as well as online tests and necessary conditions for online tests because it is impossible to establish an equivalent testing environment for a large-scale system.
Alibaba's big data computing platform needs to run continuously on tens of thousands of machines or clusters every day. The platform carries Alibaba's core analysis and calculation tasks and is subject to high reliability and SLA requirements. However, we also need to constantly improve the system performance, reduce costs and provide more functionality to meet the growing business demands, which requires continuous upgrading of the in-service system. Ensuring the high availability of the system during iterations is a huge challenge for Alibaba's big data computing platform. Here we mainly share the challenges for releases and iterations on a large-scale computing platform and Alibaba's solutions in the MaxCompute system.
MaxCompute (big data computing service) is a fast, fully managed data warehouse solution for petabytes/exabytes of data. It can be expanded to support up to tens of thousands of servers and support cross-region disaster recovery. As Alibaba's internal core big data platform, MaxCompute supports a daily job scale of millions. MaxCompute provides users with a comprehensive data importing scheme and a variety of classic distributed computing models, which can solve the massive data computing issues for users more quickly, effectively reduce the cost of enterprises, and safeguard data security.
The whole system is capable of storing hundreds of exabytes of data, handling millions of jobs every day, and supporting tens of thousands of single clusters and multi-cluster cross-region deployment. We have more than 8,000 data development engineers internally working on data development on this platform. We can make the system performance double that in the community, with the cost being only one third of that of Amazon.
The overall architecture of MaxCompute is shown in the figure. The bottom layers are the distributed storage system and scheduling system to manage resources in all the clusters in a unified way, including CPUs and memory disks. On top of them is an execution engine layer to support different kinds of computing methods. We provide a uniform language so that data engineers can seamlessly integrate a variety of computing methods. Meanwhile, we also provide compatible open-source interfaces to connect external existing ecologies.
We want to make MaxCompute a data computing service, not a solution. The so-called service, first of all, needs to provide a unified data warehouse to connect access to data of different applications and users and to break the data barriers between various departments within Alibaba, so that all the data can be concentrated and available for cross-region and cross-department access. In this way, related data is associated together to generate some chemical reactions, so that we can dig the value behind the data. Moreover, a shared big data computing service that is highly reliable and highly available throughout year is needed for fine-grained unified resource scheduling, so that various businesses can complement resources between each other to cut down the cost and improve the usage efficiency. Eventually, the service form should be able to liberate users from O&M and monitoring and hand over these tasks to the computing system, so as to greatly reduce the threshold for using big data computing services. On the contrary, the corresponding solution only provides the installer of the big data computing system, you need to find and pull the corresponding resources, build the O&M and monitoring systems on your own, and manage the platform upgrading independently. However, these user-defined clusters (or clusters composed of virtual machines) are always isolated and cannot aggregate various user data together to launch computing in a larger scale.
Challenges for continuous improvements and releases of MaxCompute
MaxCompute service must run uninterrupted. From the high availability perspective, we hope that the system never gets updated, because updates mean risks. In this way, MaxCompute can better serve customers continuously and be able to provide users with 99.99% or even 99.99999999% of reliability for computing tasks. However, our business is bound to be growing all the time, and there will be new requirements for the computing platform every day, which requires the computing platform to develop accordingly. At the same time, the business growth speed is far faster than the system using purchased machines, which also promotes the necessity for our system to constantly improve its core performance to match the business growth. The above two reasons force the computing platform to constantly change. What's more difficult, the computing platform is different from other services the core nodes of which are basically standalone and some traffic can be switched to new machines for verification through load balancing or other means. However, the computing platform carries distributed calculations and some tasks need to use thousands or tens of thousands of machines, and their computing nodes are closely coupled. As a result, it is not feasible to use traditional load balancing or other means to verify new versions. Moreover, because the computing platform manages tens of thousands of machines, a bad change will cause huge damage. Then how can we balance the stability and changes? Controlling the risk of changes is vital to the success of a computing platform.
For example, we are like an aircraft flying at high speed. In the course of flight we should ensure safety and stability, and replace the flying engine with a more powerful one to make the aircraft fly higher and faster. What are our challenges in big data computing? Afterwards we will talk about Alibaba's resolutions to these challenges.
1. MaxCompute processes millions of jobs every day. How can it ensure that new features will not cause an online fault?
2. MaxCompute has been emphasizing security from Day 1. How does it deal with the contradiction between testability and security?
Inside Alibaba, the vast majority of computing tasks are batch processing tasks. Batch processing basically follows such a procedure: users write query optimization for distributed relational algebra in our language, commit the code to the backend through the front-end, and generate a physical execution plan using the compiler and the optimizer, and bind the execution plan with the backend execution engine. Then the execution plan is scheduled to the clusters behind with tens of thousands of servers for computation.
What is the compiler playback tool? We have millions of jobs every day and they are changing every day. When new features emerge, we always need to enhance our language, such as supporting some iterative computing syntax and providing new UDFs. Moreover, we should have a very strong driving demand internally to constantly improve the expressive ability of the optimizer and performance optimization level to meet the requirements of the quickly growing businesses. How can we ensure that there are no big regressions during the upgrading process? If we verify our new computing engines one user task after another manually, it would take nearly four years for even experts who can identify a risk within two minutes. What should we do? Our solution is to use the parallel computing capabilities of our own large-scale computing platform to verify and test the compatibility, compile the query as a MaxCompute UDF, and then execute a parallel DAG to execute in parallel the compilation optimization for millions of queries. Then the potential risks of the new feature can be presented in the intelligent analysis result.
So how can we achieve self-verification? Here there is a prerequisite that we need to transform the compiler accordingly.
Make the compiler conform to the AST-based visitor model and compiled into an AST syntax tree. Then we will traverse the tree over and over again to bind information to nodes of the tree or transform the node. In a normal compilation process, we have to analyze the syntax, bind the type, analyze the semantics, and bind the metadata statistics data, and then generate a logical plan for the optimizer to optimize. This is a very standard compiler model. Through this mode we can add our custom plug-ins very conveniently so that we can collect some information during the compilation process and the information can be used for subsequent statistical analysis.
The figure above shows the specific self-verification process.
• Use the flexible MaxCompute data processing language to construct analysis tasks;
• Utilize the super-large scale of computing capability of MaxCompute to analyze mass user tasks in parallel;
• Utilize the flexible UDF support and sound isolation schemes of MaxCompute to pull the compiler to test in the UDF for compilation and then perform detailed results analysis;
• The whole process is under the protection of MaxCompute's sound security system to protect users' intellectual property from being disclosed to developers.
What do we do in the daily work?
It is very simple. We will verify the new version and some jobs involve compilation. However, we will also accurately guide and find the query that triggers a new optimization rule and verify whether the query optimization is in line with expectations. Finally, we will carry out overall data analysis on the semantic layer. For example, we find an API and want to remove the API from the online environment. How can we know how many people are using this business? We will issue a warning to corresponding users to encourage them to remove obsolete syntax. We need you to use a newer and better API to transform the original task. What's more, we can also analyze the query comprehensively to identify the development focus ahead, and evaluate the improvement degrees of the optimization under query optimizations. With this system, all our development and verification activities on such a large scale will leverage the big data analysis capabilities at our disposal.
MaxCompute Playback addresses the quick verification and comparison during the optimization and compilation of a large number of jobs. How can we ensure that the MaxCompute optimizer and runner are properly executed, and how can we avoid the correctness issues in fast iterations, so as to avoid major accidents while ensuring data security?
The traditional solution is to build an equivalent testing cluster offline for the verification, but this approach is not feasible in our big data scenario. The reason is that: First, this is a big data scenario. It is usually required to establish a testing cluster of the same scale for testing the cluster scheduling or scalability, which is a waste of resources. Second, the testing cluster does not have the corresponding task load. The testing cluster can run a test case, but it is not possible to elevate the waterline of the production cluster to a very high level, such as reproducing the tasks running on tens of thousands of servers. Because the consumption of resources is very alarming, and there is no way to reproduce equivalent load. Instead, we can only view the tasks one by one, and cannot construct a corresponding testing scenario. Finally, the data security issue. We need to pull data from the production cluster to the testing cluster for testing purposes in a desensitized form. But the desensitization is prone to man-made negligence, leading to data leakage. At the same time, desensitized data is not equal to user data and it may violate the expectation of the user application, resulting in user application crashes and failures to achieve the testing goal. At the same time, the overall testing process is lengthy, involving copying data, establishing the environment and running the test. This seriously affects the efficiency of the entire development, thus not a satisfactory solution.
We adopt the online verification tests, allocating 99% of the resources for running online jobs and merging the servers originally for testing into the production cluster. Leveraging the scheduler's ability, 1% of the server resources are provided for the programmers who can submit a private version to use the 1% of resources for testing tasks.
Then what are the prerequisites for online testing and verification?
1. Resource isolation. Resource isolation. We need to well perform the resource isolation, because we do not want online tests to impair the reliability of the online production jobs. We have done a lot of work on isolation of system resources. In terms of CPU and memory, we enhance the cgroup implementation to support more and more flexible resource control. In terms of disks, we manage disks in a unified way and provide a storage priority level control mechanism. In terms of gigabit networks, we strengthen the traffic control and so on. Of course, the 1% here is actually elastic. Suppose we have no testing tasks, the 1% of resources may be used for online tasks to make full use of our resources.
2. Security. We need to improve multi-tenant support and user data security management, so that system developers who submit the test task cannot touch the user data. In addition, the results of our testing tasks will be stored in disks and compared with the results of subsequent MD5 or other automated verification means to ensure task correctness.
This approach, compared with the traditional one, allows us to better protect user data, because no human intervention is required for data desensitization, avoiding the possibility of human error. At the same time, this approach reproduces the real-world scenario to the largest extent possible to perform execution performance analysis reliably. Because all the background noises are identical, the effects of our various improvements in scheduling and scales can be well verified.
Online debugging of new scheduling algorithms may change the environment, making assessment difficult. We usually have the following experiences to share in distributed scenarios:
• Immeasurability. If we change an algorithm online and schedule a new task to the online environment, the load will change. Because of the difference from the old environment, the new scheduling algorithm already changes the load balance in comparison, which will generate a series of impacts. Finally we will find that the attempt to observe something has impacted the something to be observed. This is the unpredictability of scheduling.
• The halo of minority. We may often find this is the case: when the new scheduler gradually becomes mainstream, the optimizing performance gets worse. This is usually because of the unfair competition between the new optimizer and the old one as the performance improves.
How can we verify the new scheduling algorithm? How can we tune the new algorithm in a distributed scenario and simulate the online workload recorded into the simulator? Because the workload is recorded online in the old method, all the workloads have a precedence relationship between each other when the new algorithm is run. The previous workload will become a later workload, and such an error gap will keep widening, and sometimes you even have no way to figure out the general trend. That's why we are using the flighting method for online verification. But in order to avoid the halo of minority, we need to first adjust the parameters of the new scheduler so that the new and old schedulers can use resources in a fair manner, and then carry out the verification. The remaining parameters will be adjusted after the new optimizer becomes the mainstream.
Gated launch and fine-grained rollback of MaxCompute
In the above section we have talked about how to use parallel analysis capabilities to verify the querier and optimizer, how to use the flighting tool to verify online execution, how to ensure the generation of a correct result during execution and how to verify the scheduler algorithms. After these steps, we will prepare to launch the product. We support very fine-grained gated launches to control the online risks, so that the release can be rolled back quickly without human intervention in case danger is detected. We first classify tasks by importance, and apply the new version proportionately. If any issue occurs during the process, the release will be rolled back to the beginning quickly.
With these technologies, the overall development process can be divided into development, regression and online release. All development engineers can carry out online verification on their own, and submit their own private versions, without affecting the online version. The flighting can be implemented with 1% of the online resources. After verification, the regression test can start. We will adopt gated launch in our release process to control the release risk. Developers can wait for the regression release of the previous version to start the development of the next version. In this way, quick iterations can be achieved and the distributed big data platform can be continuously released and evolved.