Cloud-lake symbiosis, the next generation of the data lake is coming?
Introduction: Application-oriented presentation of data value, Alibaba Cloud's innovative practice on data lakes, supports rapid data insights and data output iterations.
Data lake is not a new concept. Recently, it has been mentioned by more and more people, and it has become a new Internet celebrity, showing a phenomenon of thousands of people.
At this year's Yunqi Conference, when the cloud-native data lake system was officially released online, it attracted the attention of enterprises. If it is not for the special period of 2020, the scale of the offline "Data Lake Summit Forum" held on October 23 is estimated to be several times larger.
At the scene, we could hear enterprise users candidly complain about the pitfalls they have stepped on. They are looking forward to the evolution of the architecture behind the latest big data technology innovation, and how these changes will bring about the improvement of their own business core competitiveness and agility.
At this time, the industry's first cloud-native enterprise-level data lake solution released by Alibaba Cloud has become their new choice. This solution will be used on a large scale this year's Double 11, supporting Alibaba's economy and millions of customers to fully migrate to the cloud.
The polarization of data value
In 2020, the amount of data will continue to grow explosively, and digital transformation will once again become a hot spot in the industry. We can personally feel the social effects brought about by the "new infrastructure" based on cloud computing, big data, and AI.
Data needs deeper value mining. In Chen Qikun's view, the value of data presents the characteristics of polarization. One is timely discovery and real-time analysis to quickly promote business development; laws, analyze its value in a unified way, and provide reference for business development.
The previous computing and storage coupled architecture will present a very low resource utilization rate. Data is continuously accumulating and growing, but computing power requirements may be peaks and valleys. In order to store more data, purchase more computing. When expanding the capacity, they must be expanded together, which ultimately leads to suboptimal stability, inability to expand the two resources independently, and suboptimal usage costs.
Of course, in the traditional architecture, the original data is uniformly stored on the HDFS system, and the engines are mainly Hadoop and Spark. Due to the limitation of the open source software itself, the traditional technology cannot meet the needs of enterprise users in terms of data size, storage cost, query performance, and elastic computing. Architecture upgrades and other aspects of the demand.
Redefine the Next Generation Data Lake
Although the concept of data lake has existed for a long time, the key that has been mentioned recently is application requirements. With the evolution of enterprise business, lower data storage costs, more refined data asset management, sharable metadata, and more Based on the real-time data update frequency and more powerful data access tools, Alibaba Cloud officially released the cloud-native enterprise-level data lake solution.
Data lake unified storage replaces HDFS with cloud object storage OSS to increase data scale, reduce storage costs, and realize the separation of computing and storage architecture;
The data lake construction (DLF) service provides unified metadata and unified authority management, and supports access to multiple engines;
The cloud nativeization of computing engines such as Spark on EMR can make better use of elastic computing resources;
Dataworks, a data development and governance platform on the cloud, solves problems such as data lake metadata governance, data integration, and data development.
1.jpg
In Chen Qikun's view, Alibaba Cloud's cloud-native data lake solution redefines the next-generation data lake system and has more enterprise characteristics.
First of all, it must carry the core production environment of mobile Internet and IoT services. For enterprises, the production environment of new Internet applications must be an enterprise-level production environment. It is impossible to move the PB-level data generated by mobile applications or social media applications to analysis engines for real-time analysis. Big data analysis must be performed in a production environment.
Secondly, there must be a data lake that carries exabyte-level data. Using Alibaba Cloud Object Storage OSS as big data storage, instantaneous rename of large files, cache acceleration, etc. are not a problem.
At the same time, in order to achieve real-time analysis of data that is strongly coupled with the business, elastic computing power is required, as well as the guarantee of elastic performance SLA. Alibaba cloud object storage OSS is the unified storage layer of the data lake. Because of the separated storage and computing architecture, it can Choose different computing engines and store data of any scale at the same time, which is very suitable for enterprises to build data lakes based on OSS.
In addition, in this forum, Alibaba Cloud also released the OSS accelerator, which is different from the self-built cache based on traditional clusters. The OSS accelerator is elastically scalable. It can provide a throughput of 200MBps per TB, linear expansion, and can be activated at any time. At the same time, based on the OSS intelligent metadata architecture, the OSS accelerator provides consistency that traditional caching solutions do not have. When a file on OSS is updated, the accelerator can automatically identify it to ensure that the engine reads the latest data.
Furthermore, it must be safe storage and unified management to ensure business security and data security. Alibaba Cloud’s full-link encryption, multi-layer protection on the cloud, and built-in defense functions can ensure the security of data on the cloud. In addition to the globally deployed clusters, end-to-end CRC, and hardware capabilities for proactive troubleshooting, Internet applications The production environment ensures business security.
Manage it, use it, and use it well
Where is the data, where is the analysis, how to store and analyze the data, extract the law and value from the data, Jia Yangqing, vice president of Alibaba Group and head of Alibaba Cloud Intelligent Computing Platform Division, believes that it can be managed and used, Well-used, this is the core of Alibaba Cloud's data lake system, which comes from the real needs of customers on site.
3.jpg
Data management refers to building a data lake through OSS. By managing metadata, we can know where the data is. In the data lake scenario facing massive data in the future, object storage OSS is very suitable for enterprises to build massive, efficient and secure data lake.
The use of data requires a variety of computing engines, whether it is a traditional, open-source engine or a horizontal computing engine built by Alibaba Cloud through its own applications, which can be connected to business applications and various computing and analysis platforms, making it easier for users to use data.
The docking of the data lake is mainly reflected in the two aspects of metadata and storage engines. Metadata is shared by all users and provides a unified metadata access interface. Each engine uses a customized metadata access client to access metadata. Metadata The service provides tenant isolation assurance and authentication authentication services for each user.
Alibaba Cloud Data Lake OSS and data warehouse MaxCompute can quickly realize the integrated lake and warehouse solutions that enterprises want, realize the seamless flow between data lakes and data warehouses, unified intelligent management and scheduling, and bridge the differences between data storage and computing. At the level of the platform, the service capability of the platform has been greatly improved, and good data can be truly realized.
Fully Evolving to Cloud Native
Li Feifei, vice president of Alibaba Group and head of the Alibaba Cloud Intelligent Database Product Division, believes that from the dimensions of traditional self-built data analysis systems, traditional big data platforms, traditional data warehouses, and traditional analytical databases, to extreme flexibility, low cost, and The cloud-native database era defined by these three key words of service.
Specifically, it integrates technologies such as serverless, separation of storage and computing, resource pooling, and containerized deployment to provide cloud-native data services, reducing the threshold and learning costs for customers.
4.jpg
Different from traditional big data solutions, serverless technology provides one-click lake construction, management, lake construction, and computing and analysis integration services. DLA is used to connect with OSS to provide open storage services and open analysis and computing services. Multiple data sources can be accessed through The method of one-click lake building automatically discovers and manages the original data, uses OSS to provide low-cost, high-performance, and strong security cloud-native storage capabilities, and uses data lake management and cache acceleration to utilize community capabilities, The capability of caching acceleration, integrating Spark and Crystal engines to provide interactive query and complex ETL calculation and analysis.
Using the serverless method to call computing resources, enterprises can truly achieve automatic management and automatic discovery of multiple heterogeneous data when using DLA, and allocate resources according to demand and quantity, and reduce costs as much as possible.
5.jpg
Right now, the IT system has changed from a cost center to an innovation center. The symbiosis of cloud and lake is the architecture of the next-generation data lake 2.0. The well-known English learning platform Liulishuo launched an efficient AI English teacher in 2016. Liulishuo self-developed APP In the customized section, it is launched in the form of artificial intelligence courses. The adaptive course system based on AI deep learning provides users with systematic English learning solutions. As of June 30, 2020, the number of recorded sentences has accumulated to about 50.4 billion. , the user's voice practice time has accumulated to 3.7 billion minutes.
Faced with such a large voice data challenge, Liulishuo designed its architecture based on OSS on Alibaba Cloud to ensure a simple and efficient data storage solution. Based on Alibaba Cloud's data lake architecture, it efficiently built a data lake system to support the entire data iteration.
Based on the Alibaba Cloud data lake solution, a well-known domestic social game company collected and processed global data in real time through the log service SLS, and delivered it to OSS for unified storage. Utilize the massive elastic capacity of OSS for hot and cold stratification, connect OSS with EMR and DLA, build a big data architecture with separation of storage and calculation, realize real-time analysis of intelligent recommendation links for tens of millions of daily active players, real-time channel statistics, and refined operations to help the company improve 30% user retention rate.
At present, thousands of companies have built cloud data lakes on Alibaba Cloud. Data lakes should be evolving and scalable infrastructure for big data storage, processing, and analysis; Full acquisition, full storage, multi-mode processing, and full lifecycle management of speed, any scale, and any type of data; and through interactive integration with various external heterogeneous data sources, it supports various enterprise-level applications.
Focusing on the future, if you are a cloud-native enterprise, you can enjoy the dividends of big data analysis; for more enterprises, there are different stages of going to the cloud, and you need to connect the data lake on the cloud and the data off the cloud, through hybrid cloud storage or hybrid cloud The product connects the customer's offline data with the data of the public cloud, manages them in the cloud, and layers them in a unified way, and connects different computing engines on the cloud. In the data-driven era, Alibaba Cloud will help customers iterate quickly and innovate collaboratively.
Data lake is not a new concept. Recently, it has been mentioned by more and more people, and it has become a new Internet celebrity, showing a phenomenon of thousands of people.
At this year's Yunqi Conference, when the cloud-native data lake system was officially released online, it attracted the attention of enterprises. If it is not for the special period of 2020, the scale of the offline "Data Lake Summit Forum" held on October 23 is estimated to be several times larger.
At the scene, we could hear enterprise users candidly complain about the pitfalls they have stepped on. They are looking forward to the evolution of the architecture behind the latest big data technology innovation, and how these changes will bring about the improvement of their own business core competitiveness and agility.
At this time, the industry's first cloud-native enterprise-level data lake solution released by Alibaba Cloud has become their new choice. This solution will be used on a large scale this year's Double 11, supporting Alibaba's economy and millions of customers to fully migrate to the cloud.
The polarization of data value
In 2020, the amount of data will continue to grow explosively, and digital transformation will once again become a hot spot in the industry. We can personally feel the social effects brought about by the "new infrastructure" based on cloud computing, big data, and AI.
Data needs deeper value mining. In Chen Qikun's view, the value of data presents the characteristics of polarization. One is timely discovery and real-time analysis to quickly promote business development; laws, analyze its value in a unified way, and provide reference for business development.
The previous computing and storage coupled architecture will present a very low resource utilization rate. Data is continuously accumulating and growing, but computing power requirements may be peaks and valleys. In order to store more data, purchase more computing. When expanding the capacity, they must be expanded together, which ultimately leads to suboptimal stability, inability to expand the two resources independently, and suboptimal usage costs.
Of course, in the traditional architecture, the original data is uniformly stored on the HDFS system, and the engines are mainly Hadoop and Spark. Due to the limitation of the open source software itself, the traditional technology cannot meet the needs of enterprise users in terms of data size, storage cost, query performance, and elastic computing. Architecture upgrades and other aspects of the demand.
Redefine the Next Generation Data Lake
Although the concept of data lake has existed for a long time, the key that has been mentioned recently is application requirements. With the evolution of enterprise business, lower data storage costs, more refined data asset management, sharable metadata, and more Based on the real-time data update frequency and more powerful data access tools, Alibaba Cloud officially released the cloud-native enterprise-level data lake solution.
Data lake unified storage replaces HDFS with cloud object storage OSS to increase data scale, reduce storage costs, and realize the separation of computing and storage architecture;
The data lake construction (DLF) service provides unified metadata and unified authority management, and supports access to multiple engines;
The cloud nativeization of computing engines such as Spark on EMR can make better use of elastic computing resources;
Dataworks, a data development and governance platform on the cloud, solves problems such as data lake metadata governance, data integration, and data development.
1.jpg
In Chen Qikun's view, Alibaba Cloud's cloud-native data lake solution redefines the next-generation data lake system and has more enterprise characteristics.
First of all, it must carry the core production environment of mobile Internet and IoT services. For enterprises, the production environment of new Internet applications must be an enterprise-level production environment. It is impossible to move the PB-level data generated by mobile applications or social media applications to analysis engines for real-time analysis. Big data analysis must be performed in a production environment.
Secondly, there must be a data lake that carries exabyte-level data. Using Alibaba Cloud Object Storage OSS as big data storage, instantaneous rename of large files, cache acceleration, etc. are not a problem.
At the same time, in order to achieve real-time analysis of data that is strongly coupled with the business, elastic computing power is required, as well as the guarantee of elastic performance SLA. Alibaba cloud object storage OSS is the unified storage layer of the data lake. Because of the separated storage and computing architecture, it can Choose different computing engines and store data of any scale at the same time, which is very suitable for enterprises to build data lakes based on OSS.
In addition, in this forum, Alibaba Cloud also released the OSS accelerator, which is different from the self-built cache based on traditional clusters. The OSS accelerator is elastically scalable. It can provide a throughput of 200MBps per TB, linear expansion, and can be activated at any time. At the same time, based on the OSS intelligent metadata architecture, the OSS accelerator provides consistency that traditional caching solutions do not have. When a file on OSS is updated, the accelerator can automatically identify it to ensure that the engine reads the latest data.
Furthermore, it must be safe storage and unified management to ensure business security and data security. Alibaba Cloud’s full-link encryption, multi-layer protection on the cloud, and built-in defense functions can ensure the security of data on the cloud. In addition to the globally deployed clusters, end-to-end CRC, and hardware capabilities for proactive troubleshooting, Internet applications The production environment ensures business security.
Manage it, use it, and use it well
Where is the data, where is the analysis, how to store and analyze the data, extract the law and value from the data, Jia Yangqing, vice president of Alibaba Group and head of Alibaba Cloud Intelligent Computing Platform Division, believes that it can be managed and used, Well-used, this is the core of Alibaba Cloud's data lake system, which comes from the real needs of customers on site.
3.jpg
Data management refers to building a data lake through OSS. By managing metadata, we can know where the data is. In the data lake scenario facing massive data in the future, object storage OSS is very suitable for enterprises to build massive, efficient and secure data lake.
The use of data requires a variety of computing engines, whether it is a traditional, open-source engine or a horizontal computing engine built by Alibaba Cloud through its own applications, which can be connected to business applications and various computing and analysis platforms, making it easier for users to use data.
The docking of the data lake is mainly reflected in the two aspects of metadata and storage engines. Metadata is shared by all users and provides a unified metadata access interface. Each engine uses a customized metadata access client to access metadata. Metadata The service provides tenant isolation assurance and authentication authentication services for each user.
Alibaba Cloud Data Lake OSS and data warehouse MaxCompute can quickly realize the integrated lake and warehouse solutions that enterprises want, realize the seamless flow between data lakes and data warehouses, unified intelligent management and scheduling, and bridge the differences between data storage and computing. At the level of the platform, the service capability of the platform has been greatly improved, and good data can be truly realized.
Fully Evolving to Cloud Native
Li Feifei, vice president of Alibaba Group and head of the Alibaba Cloud Intelligent Database Product Division, believes that from the dimensions of traditional self-built data analysis systems, traditional big data platforms, traditional data warehouses, and traditional analytical databases, to extreme flexibility, low cost, and The cloud-native database era defined by these three key words of service.
Specifically, it integrates technologies such as serverless, separation of storage and computing, resource pooling, and containerized deployment to provide cloud-native data services, reducing the threshold and learning costs for customers.
4.jpg
Different from traditional big data solutions, serverless technology provides one-click lake construction, management, lake construction, and computing and analysis integration services. DLA is used to connect with OSS to provide open storage services and open analysis and computing services. Multiple data sources can be accessed through The method of one-click lake building automatically discovers and manages the original data, uses OSS to provide low-cost, high-performance, and strong security cloud-native storage capabilities, and uses data lake management and cache acceleration to utilize community capabilities, The capability of caching acceleration, integrating Spark and Crystal engines to provide interactive query and complex ETL calculation and analysis.
Using the serverless method to call computing resources, enterprises can truly achieve automatic management and automatic discovery of multiple heterogeneous data when using DLA, and allocate resources according to demand and quantity, and reduce costs as much as possible.
5.jpg
Right now, the IT system has changed from a cost center to an innovation center. The symbiosis of cloud and lake is the architecture of the next-generation data lake 2.0. The well-known English learning platform Liulishuo launched an efficient AI English teacher in 2016. Liulishuo self-developed APP In the customized section, it is launched in the form of artificial intelligence courses. The adaptive course system based on AI deep learning provides users with systematic English learning solutions. As of June 30, 2020, the number of recorded sentences has accumulated to about 50.4 billion. , the user's voice practice time has accumulated to 3.7 billion minutes.
Faced with such a large voice data challenge, Liulishuo designed its architecture based on OSS on Alibaba Cloud to ensure a simple and efficient data storage solution. Based on Alibaba Cloud's data lake architecture, it efficiently built a data lake system to support the entire data iteration.
Based on the Alibaba Cloud data lake solution, a well-known domestic social game company collected and processed global data in real time through the log service SLS, and delivered it to OSS for unified storage. Utilize the massive elastic capacity of OSS for hot and cold stratification, connect OSS with EMR and DLA, build a big data architecture with separation of storage and calculation, realize real-time analysis of intelligent recommendation links for tens of millions of daily active players, real-time channel statistics, and refined operations to help the company improve 30% user retention rate.
At present, thousands of companies have built cloud data lakes on Alibaba Cloud. Data lakes should be evolving and scalable infrastructure for big data storage, processing, and analysis; Full acquisition, full storage, multi-mode processing, and full lifecycle management of speed, any scale, and any type of data; and through interactive integration with various external heterogeneous data sources, it supports various enterprise-level applications.
Focusing on the future, if you are a cloud-native enterprise, you can enjoy the dividends of big data analysis; for more enterprises, there are different stages of going to the cloud, and you need to connect the data lake on the cloud and the data off the cloud, through hybrid cloud storage or hybrid cloud The product connects the customer's offline data with the data of the public cloud, manages them in the cloud, and layers them in a unified way, and connects different computing engines on the cloud. In the data-driven era, Alibaba Cloud will help customers iterate quickly and innovate collaboratively.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00