By Guan Tao, an Alibaba Cloud Intelligent Computing Platform Division Researcher
Released by DataWorks Team
The concept of the middle platform has been a hot topic since its birth in 2016. It has had a profound impact on the digital transformation of the Internet and financial industries.
As the inventor and forerunner of the middle platform concept, Alibaba has explored the capacity building and data application of the middle platform with 12 years of practice. In the process of continuous upgrading and reconstruction, Alibaba's middle platform construction has experienced scattered data analysis, middle platform capability integration, and global data intelligence.
In the growing wave of middle platform construction in the financial industry, many financial institutions are still confused about the direction of middle platform construction and data asset management. Alibaba's experience with middle platform construction should be a reference for financial institutions.
Recently, at the 2021 Alibaba Cloud Financial Data Intelligence Summit, Guan Tao, an Alibaba Cloud Intelligent Computing Platform Division Researcher, shared the platform technology. Platform technology is one of the three core elements of how Alibaba builds the data middle platform. He introduced four typical stages of data platform development, four technical challenges to support the middle platform business, and four technical data platforms trends.
In the successful practices of Alibaba's middle platform, methodology, organization, and platform capability are the three core elements of the data middle platform. Building the platform capability is the most critical and difficult task among them. Alibaba has been exploring building a powerful base for the data middle platform actively. Alibaba is consolidating the base constantly to build future-oriented capabilities.
A powerful data platform is essential as the base to build a data middle platform.
The four stages of the development of Alibaba's data platform are the same as the four stages of the development of Alibaba's data middle platform to some extent. Among the four stages, you can see that Alibaba extracts the commercial value of its data and aggregates the original split data systems. You can also see the new ideas for the capitalization of computing data and the efficient application of data and organizational changes faced in the process of data platform governance.
From 2009 to 2012, Alibaba's e-commerce business entered the boom period. During that time, many famous business teams emerged, such as Taobao, 1688, AliExpress, and Etao. Each business is a full-scenario data-driven business with strong data requirements.
Alibaba began to take a hard look at the importance of building a next-generation data platform in 2010. At the same time, Alibaba started two parallel projects. One project was Yunti 1, based on the open-source Hadoop technology system. Multiple Hadoop clusters were built by multiple service teams, with a cluster scale of 4,000 servers.
The other project was Yunti 2 (ODPS, now called MaxCompute), which was developed as an exclusive product of Alibaba, with a cluster size of about 1,200 servers. The business of loans for small and medium-sized enterprises of Ant, Muyangquan, was the first business to apply this project. The process of launching Yunti 2 was called manual cloud computing and step-by-step trial calculation. Academician Wang Jian once read Into Thin Air on the CCTV program The Reader in 2018, describing the situation and belief of developing a data platform exclusively at that time.
The two projects competed and cooperated with each other and explored the development track of Alibaba's data platform in parallel. During this period, almost all business data was built vertically and fast-moving forward in the form of independent small closed loops in their own business forms.
From 2012 to 2015, more new businesses emerged while Alibaba's e-commerce business was developing rapidly. In 2013, Cainiao was founded, and the all-in wireless strategy was launched. In 2014, the Group invested in Amap, founded a joint venture company with Yintai, and founded Ali Travel. In 2015, it launched DingTalk and LST, founded Koubei, and held stocks of Ali Health.
During this period, Alibaba's business flourished, forming 12 business departments and nine different platform systems, each of which has a different architecture. The process of data digitization involves multiple sets of data systems across multiple business units (BUs.)
The construction of a unified data platform was imminent because of the increasingly serious data island issue and increasing data costs. This was the starting point of Alibaba's data middle platform.
At the same time, Yunti 1 and Yunti 2 were also undergoing major changes.
On March 28, 2013, Yunzheng, the architect of the Technical Support Department of Alibaba Group, sent an email to the Senior Management Team of the group. The email said, "According to the data increment and future business growth, the storage and computing capabilities of Yunti 1 and Yunti 2 will reach a bottleneck on June 21 this year." By then, many businesses would be unable to expand due to technical limitations.
This means that the data platform can no longer promote Yunti 1 and Yunti 2 projects in parallel at the same time. One must be selected. How can the company break the limit of 5,000 nodes in Hadoop if Yunti 1 is selected? When financial business is involved, how does the open-source system ensure the security and availability of big data? How can the company solve the problem if the cross-IDC solution is not provided in the industry? If business interaction is frequent, how can the company ensure stable data interaction across IDCs?
A series of technical difficulties were gradually pushing the data platform to the road of exclusive developments.
In the end, multiple technical departments within Alibaba Group combined their ideas and chose Yunti 2 to challenge the 5K peak. In just a few months, Yunti 2 increased the server number from 1,500 to 5,000 and broke the limit of a single physical data center. It passed the 10-fold stress test and supported cross-cluster computing and high availability. It laid a solid technical foundation for Alibaba's big data development for years to come.
After the technical breakthrough of the 5K project was completed, new pressure came one after another. Rapid business development led to a sharp increase in data volume. Issues, such as unified data management, centralized data security assurance, and unified and open capabilities, became the core concerns of data platforms.
To this end, Alibaba launched a well-known project in the group to synchronize data from all business departments to a unified big data platform for unified management. This project took two years to complete and involved all Alibaba business departments. During the process, Alibaba promoted the productization of general data platform capabilities and provided financial-grade platform capabilities gradually.
At that time, Alibaba's process of building a data platform was a process of comprehensively unifying data and building and migrating the first ultra-large-scale data middle platform in China.
From 2015 to 2018, the methodology of Alibaba's data middle platform was established. This opened the curtain on the construction of the data middle platform. In 2015, Alibaba Group announced the launch of the middle platform strategy and began to build a more flexible organizational and business mechanism featuring a large middle platform and small frontend in the DT era. Each operator in Alibaba can develop data-based operation strategies that cover users' lifecycles. Business consultants began to explore the implementation of data-based businesses, and more businesses began to be real-time.
However, the rapid growth of data and computing and the high consumption of resources have brought about the problem of data governance. Alibaba's teams began to think about how to implement the methodology of the data middle platform to the platform layer so that the data platform can support the construction of the data middle platform.
Data governance and data capitalization are the core of these problems. To address these problems, we need a platform system to implement the relevant methodologies to truly realize unification. On the data platform side, DataWorks provides all-in-one capabilities for collaborative development and governance of large-scale data. MaxCompute supports clusters with 100,000 servers, covering all BUs of Alibaba Group and daily operations of more than 200,000 employees. Together, they support the sustainable development of each business.
After 2018, the entire data platform system of Alibaba has matured, and the platform and the business sides have cooperated very well. The business side recognizes the value of the data platform. Business departments and technical departments rely on each other. The data middle platform serves business in a positive cycle, which marks the successful construction of the data middle platform.
Alibaba has migrated all of its internal systems to the cloud since 2018 and implemented the association between the on-cloud data middle platform and business. During Double 11, core systems were migrated 100% to the cloud and fully cloud-native. Alibaba Cloud withstood the world's largest traffic peak with 538,000 transactions per second. The data middle platform covers all the BUs of Alibaba Group. Operators detect and analyze problems in time to make real-time operational decisions. New businesses, such as short video and live broadcasts, continue to emerge.
As you can see, the construction of Alibaba's data construction is successful and still developing at high speed.
The intelligent data warehouse of MaxCompute makes Double 11 seem like a daily event. Lake House has become the next-generation architecture for big data platforms gradually. The data middle platform based on DataWorks serves the business and supports hundreds of data applications within Alibaba Group. Through full-procedure data governance, the businesses of Alibaba Group can grow rapidly while the cost barely increases.
One core indicator of the success of data middle platform construction is data efficiency rather than system efficiency or platform efficiency.
Alibaba measures data efficiency in four aspects: scale and elasticity, data cost, data correctness and maintainability, and data utilization.
Using this core indicator, methodology, organization, and platform capability are the three core elements for the success of the data middle platform. If you want to build a data platform, what methods are available, and what are the difficulties? This article describes four aspects of the business but does not involve storage and computing engine challenges.
For data assets, the first question is: what are the data assets of an enterprise? Each BU of Alibaba has an exclusive data asset panorama. We use one panorama to manage 99.9% of the computing data assets of Alibaba. This way, the storage and computing costs of each department can be fully quantified and shown to managers directly.
The second question is: how can we view assets? For an enterprise, are assets figures about the cost? Based on the perspective of data assets, Alibaba allows managers to know where the data comes from, who the data serves, and who is the best partner. At the same time, it can also meet the needs of data flow audit.
The third question is: how can we achieve scale operations on assets? How can this asset system be replicated quickly facing the merge, acquisition, and innovation of new business? A modeling tool for the data middle platform is provided in tools, such as DataWorks. It provides standard drawings for data middle platform construction, divides business by domains, and builds models intelligently. This allows new businesses to reuse mature data architectures quickly to achieve scale operations on assets.
For data quality, the first question is: how can we define the quality in advance? Reconciliation is a common concept in the financial industry, and Alibaba's data needs to be reconciled as well. We proposed the concept of quality rules to reconcile data tables containing more than 10 million entries. With more than 7,000,000 quality rules and over 10,000 new rules every day, how can we match them manually? Alibaba has developed 37 rule templates. After matching rules based on intelligent recommendations, the adoption rate reached 75%.
The second question is: how can we ensure quality in the process? More than 7,000,000 quality rules consume a lot of computing resources. What can we do? How can we reduce costs? We have built a data quality scheduling engine and an ETL engine using intelligence technology. When data changes, quality monitoring is triggered in real-time, and priority policies are used to perform idle operations.
The third question is: how can we achieve quality automation after data changes? The rules are fixed, but the data is alive. What should we do if we encounter periodic fluctuations and changes? We integrate many AI technologies when constructing data quality. We use machine learning to learn how data is generated and predict dynamic thresholds intelligently to match periodic fluctuations through algorithms.
For data security, we need to focus on issues, such as reducing usage costs, improving usability, covering the entire data lifecycle, controlling permissions, masking data, and identifying sensitive data for data tracking. Alibaba has accumulated more than 20 different security management rules that will eventually help the platform meet individual compliance requirements when the business grows rapidly.
When data governance enters deep water, how should the growth rate of data costs not exceed that of the business? How can we mobilize the governance enthusiasm of all employees and cultivate cost awareness? At Alibaba, data governance is the cooperation of the engine, platform, and people. The engine has high requirements on computing power and cost and breaks the linear relationship between fast-growing data computing and cost growth continuously. The platform becomes the core indicator of the data governance campaign for all teams within the group by the storage and computing health scores. People are promoted to conduct data governance and management and build a technical operation system for data governance using full-procedure tools on the platform. The cost and value at the platform layer are displayed clearly in cost reports.
During the 12 years of data platform construction, Alibaba has developed productization capabilities of the data middle platform in several aspects, such as data assets, quality, security, and governance.
In the future, as the base of the middle platform, the data platform will transform from data intelligence to intelligent data. Lake House can meet the needs of flexible architecture upgrades, Intelligent Data Warehouse can solve the management challenges of ultra-large-scale data, and Smart Query lowers the threshold for data analysis significantly. The cloud-native, large-scale, standardized, and inclusive AI technology makes it the ultimate choice for big data and accelerates the integration of big data and AI.
As the next-generation data platform architecture, Lake House can upgrade architectures under complex conditions flexibly. A data warehouse is designed for processing more defined, economic, and efficient enterprise-level data. Enterprises can build their own data middle platforms with a set of methodologies and supporting tools for engine optimization and data management. However, the entry threshold is very high, the cost is expensive, and there is a threshold for use. Data Lake originated from the open-source system. It is flexible and easy to use with lower costs. Enterprises can build their own data lakes easily. However, in addition to the unified data storage, enterprises need to refine management further to ensure data can be governed, managed, and maintained at low costs.
How can we integrate the flexibility of data lakes and enterprise-level capabilities of data warehouses at the architecture layer? Alibaba has developed a Lake House architecture that unifies storage and metadata. It connects data systems and uses intelligent data warehouse technology to classify, store, and process different data and obligations automatically.
The traditional DBA model cannot solve the management challenges posed by ultra-large-scale data. Alibaba has more than 10 million tables. Many core data development engineers each are responsible for tens of thousands of tables, so they cannot perform fine-grained governance and modeling. This system cannot be expanded manually. Therefore, in the future, more AI technologies will be integrated into big data systems to enter the autonomous driving era.
Alibaba is trying to build an ultra-large-scale knowledge graph based on data. The knowledge graph is used to translate data into semantics. Then, it uses technologies, such as natural language processing (NLP), to integrate with users and form a bridge. For example, a user can get a set of data automatically if they enter a question: How many people are using the Internet in Beijing? Alibaba is trying to apply intelligent queries based on natural languages to a large amount of data on a large scale. With this feature, more non-professional data workers can perform data analysis tasks independently.
Data needs intelligent acceleration, and AI is the best choice for big data. We know it is very difficult to use AI well. From the rise of data, data extraction, model training, and model tuning to model deployment and service, the entire procedure is very long. If 50,000 people can use data directly, the number of people that can use AI correctly may not exceed 5,000. The issue of how to empower the business side with AI technology alongside data is called AI engineering.
The preceding content only mentioned four typical stages of the base construction of Alibaba's data middle platform, four major technical challenges, and four technical trends of the data platform in general. These are just part of Alibaba's data middle platform. Over the past 12 years, Alibaba has accumulated a lot of technologies in data platform construction. These platform capabilities also promote the evolution of the data middle platform to intelligence and will keep evolving to serve Alibaba and society at large.
Alibaba Cloud and Salesforce Announce Plan to Launch Salesforce Social Commerce
1,011 posts | 247 followers
FollowAlibaba Clouder - January 22, 2020
Alibaba Cloud MaxCompute - March 3, 2020
Alibaba Cloud MaxCompute - March 2, 2020
Alibaba Clouder - May 11, 2020
Alibaba Clouder - April 20, 2018
Qiyang Duan - May 28, 2020
1,011 posts | 247 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreThis solution helps you easily build a robust data security framework to safeguard your data assets throughout the data security lifecycle with ensured confidentiality, integrity, and availability of your data.
Learn MoreMore Posts by Alibaba Cloud Community