Developer Content

On March 31, 2022, at the Alibaba Cloud Global Data Lake Summit, Alibaba Cloud brought a massive upgrade plan of "Data Lake 3.0" to the audience from the three aspects of "lake management, lake storage and lake computing". At the cloud habitat conference, which took more than 200 days to complete, Alibaba Cloud storage's ability to access the data lake was upgraded again.

The data lake is a unified storage platform that stores various types of data in a centralized manner, provides elastic capacity and throughput, can cover a wide range of data sources, and supports multiple computing and processing analysis engines to directly access data. It can realize fine-grained authorization, audit and other functions such as data analysis, machine learning, data access and management.

More and more enterprises choose Data Lake as the solution for enterprise data storage and management. At the same time, the application scenarios of the data lake are also constantly developing. All walks of life build the data lake on the cloud. From the simple analysis at the beginning, to the Internet search promotion and in-depth analysis, and the large-scale AI training in recent two years, they are all based on the data lake architecture.

1、 Separation of storage and calculation, intelligent layering of data cold and hot

At present, there are many Alibaba Cloud customers whose cloud data lake size has exceeded 100PB, so it can be predicted that the data analysis architecture based on the data lake is an irresistible trend in the future. So why do we need such an architecture?

Alex Chen, a researcher of Alibaba Group and senior product director of Alibaba Cloud Intelligence, believes that the reason is that enterprises are producing data all the time, and these data need to be analyzed to activate its value. Data analysis can be divided into real-time analysis and exploratory analysis. Real time analysis uses known data to answer known questions; Exploratory analysis uses known data to answer unknown questions, so you need to save all the data in advance, which will undoubtedly increase many storage costs.

In order to reduce storage costs, Alibaba Cloud has chosen the storage and computing separation architecture, which provides independent scalability. Customers can enter the lake with data, and the computing engine can be expanded on demand. This decoupling method will achieve higher cost performance. Alibaba Cloud Object Storage OSS is the unified storage layer of the data lake, which can interface with various business applications and computing analysis platforms.

At the cloud habitat conference, Alibaba Cloud Storage officially released the deep cold archive type of object storage OSS, which is the lowest cost cloud storage type in the industry at a price of 0.0075 yuan/GB/month. Using the life cycle rule based on the last access time, the server can automatically identify hot and cold data according to the last access time, and realize hierarchical storage of data. Even if there are multiple objects in a bucket, you can perform lifecycle management on each object and file according to the last modification time or access time.

Objects store objects of OSS archive or cold archive type, which can be read only after unfreezing (Restore). It usually takes several minutes to unfreeze archive type objects. It usually takes several hours to unfreeze cold archive type objects according to different unfreezing priorities, which brings great trouble to some users.

In order to enable users to directly read the archive/cold archive storage, the object storage OSS has added the archive direct read capability, so that data can be directly accessed without unfreezing. At the same time, using data lifecycle management strategy and OSS deep cold archive type to reduce costs and increase efficiency can reduce the cost of the entire data lake by 95%.

2、 Multi protocol compatibility, one data supporting multiple applications

With the development of AI, IoT and cloud native technology, the demand for unstructured data processing is becoming stronger and stronger. The trend of using object storage on the cloud as the unified storage is becoming more and more obvious. The Hadoop system has gradually evolved from HDFS to cloud storage such as S3 and OSS on the cloud as a data lake system for unified storage. Now, Data Lake has entered the 3.0 era. In terms of storage, it takes object storage as the center to achieve full compatibility of multiple protocols and unified metadata management; In terms of management, the one-stop lake construction and management oriented to lake storage+computing can achieve intelligent "lake building" and "lake governance".

Peng Yaxiong, a senior product expert of Alibaba Cloud Intelligence, pointed out that the data lake 3.0 architecture provides a fully compatible HDFS serviceability. Users no longer need to set up metadata management clusters to easily migrate self built HDFS to the data lake architecture. At the same time, the native has multi protocol access capability and unified management of multiple metadata, realizing the seamless integration of HDFS and object storage bottom layer, allowing data to flow into, manage and use efficiently and uniformly across multiple ecosystems, and helping users accelerate business innovation. The 100Gbps/PB read/write capability can further improve data processing efficiency.

The engine of data analysis architecture is iterative. In AI and autopilot scenarios, a piece of data needs to be shared by multiple applications. As the unified storage base of the cloud data lake, OSS provides low-cost and reliable mass data storage. The file storage CPFS and the object storage OSS achieve deep integration. When high-performance operations such as reasoning and simulation are needed, CPFS can realize fast access and analysis of data in OSS, and achieve on-demand data flow and block level Lazyload.

In addition, file storage CPFS supports mounting and accessing file systems through POSIX clients or NFS clients, and supports mutual access through these two clients, so that massive small files can be accessed without pressure.

3、 On cloud and off cloud connectivity, business agility and innovation

With the vigorous development of cloud computing, more and more IT system infrastructure is transferred to the cloud, and data is far away from the enterprise data center. According to statistics, 80% of the data is generated outside the data center. At this time, enterprise data can be transferred to their own data centers or to the cloud through RESTful APIs or HTTP or VPN methods.

When building an enterprise data lake, you can first use the data lake to build a DLF to complete data entry and metadata management, and then use the log service SLS to deliver the global data to the OSS in the data lake in real time, and then give full play to the capabilities of OSS to achieve the hot and cold stratification of data, so that the overall data lake solution can achieve the goal of cost reduction and efficiency increase.

To facilitate data management, the cloud and local data centers need not only a unified namespace, but also data interoperability. In the case of data interworking, computing power can be downgraded from the online to the cloud at any time and distributed on demand. Of course, the premise to achieve these is that the data of traditional applications and emerging applications (such as IOT, BigData, AI) can be integrated. Seamless cloud deployment through hybrid cloud IT architecture has become the new normal of enterprise applications. Hybrid cloud storage will become a bridge between the local data center and the public cloud, and has become an integral part of the overall plan of the data lake.

The data lake is a big data architecture for the future. Only a data lake that integrates file objects, intelligently layers hot and cold data, and communicates data on and off the cloud is a data lake with broad prospects. At present, Alibaba Cloud 3.0 data lake solutions have been implemented in the Internet, finance, education, games and other cutting-edge technologies, and have been widely used in industries with massive data scenarios, such as artificial intelligence, the Internet of Things, and automatic driving. In the future, Alibaba Cloud hopes to work with its partners to infiltrate the cloud native data lake into thousands of businesses and promote more enterprises to achieve digital innovation.

Innovation Motive Force behind Data Lake 3.0 Cost Reduction and Efficiency Enhancement

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse