Cloud native accelerates the data lake into the 3.0 era - Apsara 2022

Date: Oct 1, 2022

The evolution of data lake

Data Lake 1.0 before 2019 Apsara Conference 2022

• Storage: separation of storage and computing, stratification of hot and cold data, mainly based on the Hadoop ecosystem Apsara Conference 2022
• Management: There is no official management service, and users handle the management tasks such as expansion and shrinkage, disk operation and maintenance by themselves Apsara Conference 2022
• Computing: Initial realization of cloud-native computing, but lack of computing flexibility and diversity Apsara Conference 2022

Apsara Conference 2022 concept of data lake must be familiar to everyone. When the concept of data lake was mentioned before Apsara Conference 2022, it was based to a certain extent on the simple idea of ​​separating storage and computing, which can flexibly scale storage scale and configure computing resources flexibly according to computing needs. At that time, storage could basically be standardized as a service, and computing could be planned separately from storage. How to better manage upper-layer data and computing flexibility was relatively lacking.

Data Lake 2.0 2019~2021
• Storage: Centered on object storage, unified storage carries production business, large-scale, high-performance Apsara Conference 2022
• Management: Provide vertical lake management systems for OSS/EMR, etc., lack of linkage between products Apsara Conference 2022
• Computation: Computational flexibility, users can perform computation scaling according to the load Apsara Conference 2022

Based on the foundation of Data Lake 1.0, we have further built many capabilities. Especially after storage standardization, like Alibaba Cloud Object Storage OSS, it has become a very standard underlying storage solution for data lakes. Its stability, scale, and performance provide a good foundation for the base of the data lake. You can do some single clusters on it, such as pulling up a cluster such as EMR to manage and control some data, but it is still a relatively preliminary state. As long as there is a computing cluster, the data of the data lake can be referenced in the computing cluster to manage the metadata. At the same time, more elastic computing becomes more possible because of cloud-native approaches. Among the three indicators of storage, computing, and management, storage is the fastest; computing diversification is better; management is gradually being built Apsara Conference 2022.

Data Lake 3.0 2021
• Storage: Centered on object storage, build enterprise-level data, fully compatible, multi-protocol, unified metadata Apsara Conference 2022
• Management: One-stop lake construction and management for lake storage + computing, to achieve intelligent "lake construction" and "lake management" Apsara Conference 2022
• Computing: Computing is not only cloud-native and elastic, but also real-time, AI-based, and ecological

When it comes to data lake 3.0, the basic thinking is that there are further developments in the three indicators of storage, calculation, and management. Storage, needs to do more compatibility, better consistency, and better durability. More importantly, in terms of management, the data lake is not only a pile of data that is gathered by hundreds of rivers and thrown there, but can be managed in an orderly manner. What data is stored on the lake, how the data is used, how often it is used, and what is the quality of the data, these issues that are often considered in the traditional data warehouse field also exist in the data lake. The lake should also have a complete and mature management system like a warehouse. Apsara Conference 2022 As for the calculation, it is not only a calculation of the elasticity of the volume, but also a process of calculation diversification. In the past, we were doing more ETL, but now we are starting to do real-time computing, AI computing, and the combination of many ecological computing engines and lakes. The above are some of the core issues that Data Lake 3.0 needs to solve Apsara Conference 2022.

Storage upgrade from "cost center" to "value center"
• Smooth migration to the cloud--100% compatible with HDFS, smooth migration of existing data to the cloud Apsara Conference 2022
• Reduce the difficulty of operation and maintenance--full service form, reduce the difficulty of operation and maintenance
• The ultimate cost-effectiveness--cold and hot stratification, the number of files in a single bucket is trillions, and the cost is reduced by 90%
• Accelerates AI innovation - data flows on demand, greatly reduces computing wait time, and manages efficiently

Based on the underlying storage such as object storage OSS, we have achieved a very smooth migration to the cloud, reducing the difficulty of operation, maintenance and management. A unified and standard storage state enables many technologies to settle. For example, hot and cold tiering, in the case that users do not need to care, automatically rely on the allocation of cold storage and hot storage in OSS, thereby reducing storage costs. Including in the field of AI, many people may be unfamiliar with different storage forms and prefer traditional file systems like CPFS. The connection between CPFS and OSS provides many new functions in storage, which can solve users' migration troubles.

Intelligent upgrade of "Building Lake", "Managing Lake" and "Managing Lake"
• Data intelligence into the lake
One key to the lake for multiple data sources, supporting offline/real-time entry into the lake
• Metadata service for data computing
Serving metadata to meet the metadata management of single table and millions of partitions
• Unified data rights management
Docking with multiple engines, supporting fine-grained data access control such as libraries/tables/columns
• Lake and warehouse integrated data governance
Unified data development and full-link data governance of data lake and data warehouse

We spent more than a year building a new product, Alibaba Cloud Data Lake Formation (DLF), to better manage the data lake in terms of building, managing, and managing the lake. The first thing to focus on is how to get data into the lake in a more standardized and systematic manner, not only to write a bunch of scripts, but also to manage it better, and to gather diverse data into the data lake in an easier way. The second is the metadata service. In the data warehouse, the metadata is built together with the data warehouse. When building a data lake, the storage is placed in OSS. For metadata management, especially the combination of metadata services and higher-level tools such as BI, DLF provides a more service-oriented and standardized metadata management this layer. The data authority, data quality, etc. brought by metadata better govern this layer. The connection between Dataworks and the data lake also enables us to do better data governance. In an enterprise, there are many forms of data, some in the data lake and some in the warehouse. You may have heard the term LakeHouse in the industry. It is often said that a warehouse is built on the lake. In fact, the needs of an enterprise are not only to build warehouses on the lake from 0, because there are many traditional data warehouses, including many well-organized data warehouses like excel sheets are actually useful. Therefore, how to better link the flexibility of the lake with the structure of the warehouse supports some of the tools and methodologies we use in managing the lake, managing the lake, and building the lake.

The upgrade from "single computing" to "full-scene intelligent computing"
• Real-time data lake
Real-time data into the lake, minute-level real-time update
• Lake and warehouse integration
Open up lakes and warehouses, improve enterprise data business capabilities, and intelligent flow of data
• Data Science
From BI to AI scenarios, support deep learning and heterogeneous computing frameworks
• Multi-ecosystem of computing engine
Support Databricks, Cloudera and other diversified computing and analysis capabilities

How can the data lake be better real-time? Real-time data lake functionality is achieved through open source components like Hudi. How to better integrate the needs of data science? For example, in the field of AI, people often use some python-based and programming-based development experiences that data scientists prefer. How to combine it with the underlying data lake storage and management system? How to combine very mature enterprise-level ecological products like Databricks and Cloudera with our underlying data lake? These are some of the enterprise-level capabilities that we have been building in the past year, or some capabilities that make it easier for our developers and engineers to use the data lake. How to do storage? How to do management? How to do more diverse calculations? These are the core points of the development of the data lake to the 3.0 stage.

Thousands of Enterprises and Alibaba Cloud Work Together to Open Data Lake 3.0 Best Practices
• 6000+ data lake customers
• Exabyte-scale data lake capacity
• Minute-level data enters the lake in real time
• TB level but data lake throughput

On Alibaba Cloud, many enterprises are using data lakes. It uses a very large amount of storage and a very diverse calculation. During use, such a product was polished together. Since 2019, the continuous iteration of the data lake is inseparable from the trust of partners. Thank you everyone.

phone Contact Us