abstract: In the 2019 Yunqi Conference Big Data & AI session, Xu Shenglai, researcher of Alibaba Cloud Intelligent Computing Platform Division, Guan Tao and Senior Expert, shared the Alibaba Cloud Apsara big data platform technology disclosure supported by AI. This article mainly describes three parts. First, original technology optimization and system integration have broken the linear relationship between data growth and cost growth. Second, from the cloud-native big data platform to the global cloud data warehouse, alibaba began to move from the native system to the global system mode. The third is the big data and AI twin systems, which describe how to better support AI systems and optimize big data systems through AI systems.
Live playback >>>
the following is a summary of the wonderful video content.
When it comes to Alibaba's big data, we have to mention that the Apsara big data platform led by Dr. Wang Jian 10 years ago has sharpened a sword in ten years, today, the Apsara big data platform has become the foundation of Alibaba's 10-year best practices in big platform construction and is the cornerstone of Alibaba's big data production. The Apsara big data platform is used by tens of thousands of data and algorithm development engineers in Alibaba Group every day, carrying 99% of Alibaba's data business construction. At the same time, it has been widely used in big data construction in various fields such as urban brain, digital government, electricity, finance, new retail, intelligent manufacturing, intelligent agriculture, etc. When it comes to Alibaba's big data, we have to mention that the Apsara big data platform led by Dr. Wang Jian 10 years ago has sharpened a sword in ten years, today, the Apsara big data platform has become the foundation of Alibaba's 10-year best practices in big platform construction and is the cornerstone of Alibaba's big data production. The Apsara big data platform is used by tens of thousands of data and algorithm development engineers in Alibaba Group every day, carrying 99% of Alibaba's data business construction. At the same time, it has been widely used in big data construction in various fields such as urban brain, digital government, electricity, finance, new retail, intelligent manufacturing, intelligent agriculture, etc. In 2015, we began to notice that the massive growth of data has brought higher and higher requirements to the system. As the demand for deep learning grows, the processing capability of data and data is the key problem that restricts the development of artificial intelligence. We are talking to our customers about a real problem facing each CIO/CTO-if the data increases by 10 times, what should I do? You can see the figures in the figure very clearly. The simple paili Tao system is supported by PB data. The Alime customer service system has 20 PB data, the personalized recommendation system that everyone uses on Taobao every day requires more than 100 PB of data in the background to support the decision-making in the background. It is very common to increase the data by 10 to 100 times. From this perspective, what does a 10-fold data growth usually mean? First, it means 10 times the cost growth. If we consider that the growth is not uniform, there will be peaks and valleys, which may require 30 times elasticity. Second, due to the rise of artificial intelligence, the continuous growth of two-dimensional structured relational data brings unstructured data. Half of the continuous growth comes from this unstructured data, in addition to processing this two-dimensional data, how can we do a good job in the calculation of various data fusion? Third, Alibaba has a large mid-end team. If our data has increased 10 times, will our team have increased 10 times? If the data increases by 10 times and the complexity of data relationships exceeds 10 times, does the labor cost exceed 10 times, our Apsara platform will focus on these three key issues after 2015.
when Alibaba's big data has reached the scale of 100,000, we have entered the no man's land of technology. Most companies may not meet such challenges, but for Alibaba's size, this challenge is always before us. As you can see, after the establishment of our entire system in 2015, we began to do various Benchmark, such as the 100TB Sorting in 2015 and the CloudSort in 2016 to see the cost performance, in 2017, we chose Bigbench. The figure shows the latest data released by US. In 2017, 2018 and 2019, the performance was doubled every year. At the same time, we doubled the performance of the products that ranked second in the scale of 30TB, and half of the costs are saved, which is an optimization trend in which our computing power continues to rise. So, how is the continuous upgrade of computing power achieved? The following figure shows the triangle theory of system upgrade that we often use. The underlying computing model is the efficient operator layer and storage layer, which is the basic optimization of the very underlying layer, you need to find the optimal execution plan, that is, operator combination, and then a new direction, that is, how to achieve dynamic adjustment and self-learning optimization. Let's first look at the ultimate optimization of a single operator and engine framework. We use a framework that is difficult to write and maintain, but because it is close to physical hardware, therefore, it brings more ultimate performance pursuit. For many systems, the performance improvement of 5% may not be critical, but for Apsara technology platforms, the performance improvement of 5% is the scale of 5,000 units, which is about 2 to 0.3 billion of the cost. As shown in the figure, a simple small example is given to optimize the perfection of a single operator. In the shuffle sub-scenario, Non-temporal Store is used to optimize the cache strategy in shuffling, this strategy improves performance by 30%. In addition to the computing module, it also has a storage module, which is divided into four quadrants. One four quadrant is the storage of the data itself compression, data growth the most direct cost is storage costs rise, what do we do better compression and coding and indexing? This is related to the first-four quadrants; The second-three quadrants are related to performance savings. Our storage layer is actually based on the open-source ORC standard, we have made a lot of improvements and optimizations, including many standard changes in the white box. Our reading performance is 50% faster than that of open-source Java ORC, we are the biggest contributor to the ORC community in the past two years, contributing more than 20,000 lines of code. This is our Optimization at the operator layer and storage layer, which is the underlying architecture.However, on another level, it is difficult for a single operator and some operator combinations to meet the needs of some scenarios. Therefore, we mention flexible operator combinations. For example, we have four Join modes, three Shuffling modes, three job running modes, multiple hardware support and multiple storage media support. On the right side of the figure, how to dynamically distinguish the Join mode to make the operation efficiency higher. This dynamic operator combination is the second dimension of our optimization. From engine optimization to self-learning tuning, we have spent more time in the past year. We are considering how to use artificial intelligence and self-learning technology to build big data systems, you can imagine learning to ride a bike. At first, you can't ride well, and the speed is slow, even sometimes you will fall down. Through learning slowly, your ability will be better and better. For a system, can we do it in the same way? When a new job is submitted to this system, the optimization of the job is relatively conservative. For example, if a little more resources are given, the execution plan I choose will be relatively conservative, so that we can at least run through, and then collect information and experience after running through, and then feed back the experience to optimize the data. Therefore, we propose a self-learning regression optimization based on historical information, the underlying layer is an architecture diagram as shown in the figure. We put historical information OFFLINE system to do various statistical analyses. When the job comes, we feed the information back to the system and let the system learn by itself. Generally, when a similar job runs for about 3 to 4 times, it enters a relatively excellent process, which refers to job running time and system resource saving. This system was launched three years ago within Alibaba. Through this system, we have increased the water level of Alibaba from 40% to more than 70%. In addition, the right side of the figure is also a self-learning example. How do we distinguish between hot data and cold data? You can set it by yourself before. You can use a common configuration to configure it, later, we found that we adopted a dynamic method based on the operation mode, and the effect would be better. This technology was launched last year and saved more than 0.1 billion RMB for Alibaba last year. In terms of the preceding examples, performance optimization was performed at the engine and storage levels, which also reduced user costs. On September 1, 2019, the overall storage cost of Apsara big data platform has been reduced by 30%. At the same time, we have released new specifications based on native computing, which can save up to 70% of the costs. All of the above are engine-level optimizations. With the inclusive optimization of AI, more and more AI developers and even many people do not have the ability to code. Alibaba has 100,000 employees, more than 10,000 employees develop on the Apsara big data platform every day. From this perspective, not only system optimization is important, but also platform and development platform optimization is crucial. The computing engine is invisible to everyone. We must use it in the simplest way. Let's take a look at Maxcompute computing engine first. First, we need to have users. How do users use them? Resource isolation is required, that is, each user uses an account on the system, and the account corresponds to the permission, so that the whole set of things are connected. How do my users use it today? Which parts are used? This is the first part. The second part is development. IDE is developed to write code. After the code is finished, submit it. After submission, there is a scheduling problem. What is the sequence of so many resource tasks? The scheduling system is in charge of whether to interrupt the problem. Our tasks may run in different places, data integration can be used to pull data to different regions so that the data can be run on the entire platform. After all our tasks are run, we need to have a monitoring function, at the same time, our operation needs to be automated and O & amp; M, and then we will analyze data or BI reports, we cannot forget that machine learning is also integrated on our platform. Finally, the most important thing is data security. This whole thing forms the outer edge of a big data engine and the big data engine itself. This is a complete big data system that we call a single engine, we had this system in 2017. What will we do in 2018? In 2018, we connected multiple engines based on a single engine. Our entire development link needs to be closed-loop. Data integration can drag data between different data sources, after the data is developed, the traditional way is to use the data engine to drag it away. What we do is to make the data be a cloud service, this service can directly provide users with the desired data without dragging the entire data away, because data is consumed in storage, network, and consistency during transmission, all these things consume users' costs. We hope that users can get what they want through data services. If there are custom applications on the data service, you need to build a data center, build a web service, and then get the data. This is also very troublesome, therefore, we provide a cloud development platform for hosted web applications, allowing users to directly see all data services. In this direction, we can build any data intelligence solution. By 2019, we will expand the concept to another level. First, it is the user interaction layer for users, but the user interaction layer is not only development, so we will divide users into two categories, some are called data producers, such as write tasks, write scheduling, and O & amp; M. These are data producers. Who do the data producers do? For data consumers, our data is distributed everywhere, and everything serves data consumers at the interactive layer of governance, in this way, we interpret the Apsara big data platform from a new perspective. In addition to engine storage, we have global data integration to pull, unified scheduling can switch between different engines to work collaboratively, and unified metadata management, in addition, we support data producers and data consumers accordingly, so this is the overall big data platform product architecture.
our entire platform is cloud-native. What are the cloud-native technologies? The Apsara big data platform insisted on cloud-native data 10 years ago. Cloud-native means three things: first, out-of-the-box use, no cost, this is very different from the traditional way of buying hardware. Second, we have a second-level self-adaptive elastic expansion. We can buy as much as we use. Third, because it is a framework on the cloud, many of our O & M and security tasks are automatically completed by the cloud, so it is safe and O & M-free. In terms of system architecture, Apsara big data includes traditional CPU, GPU clusters, and pingtouge chip clusters, and then our Fuxi intelligent scheduling system and metadata system, in addition, we provide a variety of computing capabilities. Our most important goal is to make 100,000 servers physically distributed in different regions feel like one computer through cloud native design. Today, we have met the design requirements 10 years ago and have stronger service expansion capabilities to support the development of data progress in 5 to 10 years. We fully utilize the cloud-native design concept to support the fast and large-scale elastic load requirements of big data and machine learning. We support 0 to 100 times of elastic capacity expansion. Since last year, 60% of the data processing capacity during the Double 11 shopping festival came from the processing capacity of the big data platform. When the Double 11 shopping festival reached its peak, we will bounce back the big data resources to the online system to handle the problem. From another perspective, we have the elasticity capability. Compared with the physical IDC mode, we can save 80% of the cost. According to the billing mode of the job, we provide second-level auto scaling. At the same time, no fees are charged. Compared with self-built IDCs, the comprehensive cost is only 1/5. In addition to adhering to the native, we recently found that, With the development of artificial intelligence, the data in voice views is increasing, and the processing capability needs to be strengthened. We need to change from a two-dimensional big data platform to a global data platform. As shown in the figure, there is a popular concept in the industry called data Lake. We need to bring a variety of customer data together for unified query and management. However, for real enterprise-level service practices, we see some problems. First, data sources are uncontrollable and diverse for customers, and to a large extent, all data cannot be managed in a unified system and engine. In this case, what is the need for greater capabilities? Today, we can use different data sources to perform unified computing, query, analysis, and management. Therefore, we propose an updated concept called logical data Lake. For users, there is no need to physically migrate its data, but we can also perform Federated Computing and query, which is the core concept of the logical data Lake. To support this issue, we have a unified metadata management system and scheduling system that allow different computing engines to work together and finally gather all the work on global data governance, it also provides a programming platform for data developers to directly generate data or customize their own applications. Then, in this way, we expand the original single-dimensional big data platform to do big data processing, to a global data governance, this data can actually contain simple big data, it can also contain databases or even some OSS files, which are processed in the entire platform. The following figure shows the product architecture of Apsara Big Data. The following is the storage and computing engine. You can see that in addition to the storage that comes with the computing engine, there are other Open OSS, in addition, data collected by IOT and database data are integrated with global data. After integration, unified metadata management and unified hybrid task scheduling are performed, in this way, the development layer and the data comprehensive management layer can be integrated to manage the entire big data.
when it comes to big data, we will definitely think of AI. AI and big data are dual-generation. for AI, it needs big data to empower, that is, bigdata for AI. The following demo shows how to do this. For AI development engineers, they usually use interactive notebook to develop AI, because it is intuitive, but how to develop big data in an interactive way, and bind it to AI. Let's take a look at this simple example. The following figure shows the DSW platform. We can directly use a magic command to connect to the existing maxcompute cluster. After selecting a project, we can directly input SQL statements. These are all intelligent. Then, we execute the feature. After the result comes out, we can analyze the feature accordingly, including changing the horizontal and vertical coordinates of these features to make different charts, at the same time, we can even edit and process the generated results directly from the web to excel, pull the data back after processing, or switch to GPU or CPU for deep learning and training, after the training, we will transform the entire code into a model. We will import the model to a corresponding place and provide a Web service, this service is also our online inference service. After the whole process is completed, we can even connect to data applications and build them on the managed WEB. This is the big data platform that provides AI with data and computing power. Big data and AI are twin systems. AI is a tool layer that can optimize everything. We hope that Apsara's big data platform can empower AI. At the very beginning, we wanted to build an available system. It is still available to face the elastic load of double 11. Through these years of efforts, we are pursuing the ultimate performance. We can break the linear relationship between data growth and cost growth. We also hope it is intelligent, we want more data development engineers to support it, we need more complex manpower investment to understand it, and we want stronger big data to optimize big data systems.We propose a concept called Auto Data Warehouse. We hope to make big Data more intelligent. It can be divided into three phases:
- the first stage is the computing level and the efficiency level. We try to find the first-level principle of computing. We look for which jobs in millions to millions of levels are similar, so we can merge them, in this way, the cost can be saved, and when you have tens of millions of tables, which tables are the best to create indexes globally, and how to do cold and hot data layering and adaptive coding.
- The second stage is resource planning. AI and Auto Data Warehouse can help us optimize resources better, including three execution modes. Which one is better, it can be learned through learning, including job running prediction and automatic pre-Alarm. This system ensures the core of key Alibaba jobs that you can see or cannot see, for example, every time after a period of time, you will check the Zhima Credit Score. At nine o'clock every morning, Alibaba's merchant system will settle accounts with the downstream system and the central bank, these baselines are a line composed of hundreds of jobs. They start to run from early morning to finish at eight in the morning every day. Various situations may occur in the system for various reasons, some machines may fail. We have built an automatic prediction system to predict whether the system can be completed at key points in time. If it cannot be completed, more resources will be added to ensure the completion of key tasks. These systems ensure the circulation of key data that we cannot see in daily life, as well as important resource elasticity such as double 11.
- The third stage is intelligent modeling. How much overlap does the data overlap with the existing data? How many associations do these data have? When the data is hundreds of tables, DBA can perform manual tuning. Currently, Alibaba's internal system has more than ten million tables, we have very good developers who understand the complete logical relationships in tables. These automatic tuning and modeling can help us to do some auxiliary work in these areas.
This is Auto Data Warehouse system architecture diagram, from multi-cluster load balancing to automatic cold storage, to invisible job optimization in the middle, and then to automatic identification of upper-layer privacy Data, this is a technology developed together with Ant Financial. When private data is automatically displayed on the screen, the system automatically detects and codes it. We have published three of our three technologies, including automatic privacy protection, automatic merging optimization of duplicate subqueries, and automatic disaster recovery for multiple clusters, if you are interested, you can read the relevant papers on the website.