What Is a Data Lake | Basic Architecture of a Data Lake

The concept of data lake is very hot at present. Many people are discussing how to build a data lake, whether Alibaba Cloud has a mature data lake solution, whether Alibaba Cloud's data lake solution has actually been implemented, and how to understand data lake and data. What is the difference between a lake and a big data platform, etc., this series of articles will analyze the data lake.

This article includes the following:
1. What is a data lake
2. Basic architecture of a data lake
3. Cloud-based data lake framework

The next article will introduce the difference between a data lake and a data warehouse.

First. What is a data lake

Before planning to build a data lake, it is very important to understand what a data lake is, to clarify the basic components of a data lake project, and then to design the basic architecture of the data lake.

A data lake is a unified storage pool that can be connected to multiple data input methods. You can store structured, semi-structured, and unstructured data of any scale. The data lake can seamlessly connect with various computing and analysis platforms, directly process and analyze data, break down silos, and gain insight into business value. At the same time, the data lake provides the ability to convert between hot and cold layers, covering the entire life cycle of data.

There are many definitions of data lakes, but they basically revolve around the following characteristics:
(1) The data lake needs to provide sufficient data storage capacity, which stores all the data in an enterprise/organization.
(2) The data lake can store massive amounts of data of any type, including structured, semi-structured, and unstructured data.
(3) The data in the data lake is the original data, which is a complete copy of the business data. The data in the data lake retains the way they were in the business system.
(4) The data lake needs to have perfect data management capabilities (perfect metadata), which can manage various data-related elements, including data sources, data formats, connection information, data schema, and permission management.
(5) The data lake needs to have diversified analysis capabilities, including but not limited to batch processing, stream computing, interactive analysis, and machine learning; at the same time, it also needs to provide certain task scheduling and management capabilities.
(6) The data lake needs to have complete data life cycle management capabilities. It is not only necessary to store the original data, but also to be able to save the intermediate results of various analyses and processing, and to completely record the analysis and processing process of the data, which can help users to trace the generation process of any piece of data incomplete and detailed manner.
(7) The data lake needs to have perfect data acquisition and data release capabilities. The data lake needs to be able to support a variety of data sources and obtain full/incremental data from related data sources; then standardize the storage. The data lake can push the results of data analysis and processing to a suitable storage engine to meet different application access requirements.
(8) Support for big data, including ultra-large-scale storage and scalable large-scale data processing capabilities.

Therefore, a data lake should be an evolving and scalable infrastructure for big data storage, processing, and analysis; data-oriented, to achieve full acquisition and full storage of any source, any speed, any scale, and any type of data, multi-mode processing, and full life cycle management; and through the interaction and integration with various external heterogeneous data sources, it supports various enterprise-level applications.

Second. Basic architecture of data lake

A data lake has a flat architecture because the data may be unstructured, semi-structured, or structured and collected from various sources within an organization, whereas a data warehouse stores data in files or files in the folder. Data lakes can be hosted on-premises or in the cloud.

Due to its architectural characteristics, data lake can be massively scaled up to exabytes. This is important because when you create a data lake, you often don't know how much data you need to hold. Traditional data storage systems cannot scale in this way.

This architecture can greatly facilitate data scientists as they can mine and explore enterprise data, and share and cross-reference data (including heterogeneous data from different domains) to ask questions and find new analyses. They can also analyze the data in the data lake using big data analytics and machine learning.

While data doesn't have a fixed pattern before it's stored in a data lake, with data governance, you can still effectively avoid data swamps. Data should be marked as metadata when stored in the data lake to ensure subsequent access.

Third. Cloud-based data lake framework

The pillars of a data lake include scalable and durable data storage, mechanisms to collect and organize data, and tools to process and analyze data and share findings. Therefore, we focus on the key technologies that should be included in any modern data lake to support any type of data that big data means.

The cloud has unlimited resources. Cloud-based services are especially suitable for data lakes because it provides us with unlimited resources, which means cloud infrastructure can provide almost unlimited resources on-demand in minutes or seconds without worrying about anything. Organizations pay only for what they use, enabling users and workloads of any size to be dynamically supported without compromising performance.

Save money and focus on cloud technology for data. Cloud-based services provide any organization with a cloud-built solution that avoids expensive hardware, software, and other infrastructure, upfront investments, and the cost of maintaining, updating, and securing on-premises systems.

Cloud technology comes with natural integration points. It is estimated that up to 80% of the data you want to analyze comes from business application data, operational data stores, clickstream data, social media platforms, IoT things, and real-time streaming data. Integrating this data into the cloud is much easier and less expensive than building an on-premises data center.

Built-in using NoSQL. It describes a technology that can store and analyze data in updated forms, such as those generated from computers and social media, to enrich and expand an organization's data analysis. It is well known that traditional data warehouses cannot accommodate these data types well. As a result, newer systems have emerged in recent years to handle these semi-structured and unstructured data forms, such as JSON, Avro, and XML.

Support existing skills and expertise. Data Lake supports the functionality needed to efficiently store and process any type of data, data management, data transformation, integration, visualization, business intelligence, and analytical tools that can easily communicate with SQL data warehouses. The entrenched role of standard SQL also means that a large number of people have SQL skills. It enables other programming languages ​​to extract and analyze data.

The inherent advantages of the cloud in terms of cost, scale, performance, ease of use, and security should be clearly recognized because of their impact on overall data lake plans and outcomes.

If you want to know more about how cloud and data lake coexist, please attend the "2022 Alibaba Cloud Global Online Data Lake Summit" to learn about the latest trends!

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00