Write in the front: recently, the concept of data Lake is very hot. Many frontline students are discussing how to build a data Lake? Does Alibaba Cloud have a mature data lake solution? Is there any actual implementation of Alibaba Cloud's data lake solution? How to understand data lake? What are the differences between data Lake and big data platform? What kind of data Lake solutions have the top cloud computing players launched? With these questions in mind, we tried to write such an article, hoping to throw a brick to attract jade and arouse people's thinking and resonance. Thanks to Nanjing for writing the case of section 5.1 for this article, and thanks to the review of Xibi.
This article includes seven sections: 1. What is Data Lake; 2. Basic features of data Lake; 3. Basic architecture of data Lake; 4. Data Lake solutions of various manufacturers; 5, typical data Lake application scenarios; 6. Basic process of data Lake construction; 7. Summary. Limited by personal level, fallacy is inevitable. Students are welcome to discuss, criticize and correct them, and give advice without hesitation.
1. What is Data Lake
data Lake is a hot concept at present. Many enterprises are building or planning to build their own data lakes. However, before planning to build a data Lake, it is crucial to figure out what a data Lake is, define the basic composition of a data Lake project, and then design the basic architecture of the data Lake. For more information about what data Lake is, see the following definition.
Wikipedia is defined as follows:
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails documents, PDFs) and binary data (images, audio, video). A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little Value
A data Lake is a system or storage that stores data in the natural/original format. It is usually an object block or file. Data Lake is usually a single storage of full data in an enterprise. Full data includes original data copies generated by the original system and conversion data generated for various tasks. Various tasks include reports, visualization, advanced analysis, and machine learning. Data Lake includes structured data (rows and columns), semi-structured data (such as CSV, logs, XML, and JSON), unstructured data (such as email, documents, and, PDF, etc.) and binary data (such as images, audio, video). Data swamp is a degraded and lack of management data Lake. Data swamp is either inaccessible to users or cannot provide sufficient value.
The definition of AWS is relatively simple:
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics-from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Data Lake is a centralized repository that allows you to store all structured and unstructured data on any scale. You can store data as is (you do not need to structure the data first) and run different types of analysis-from control panel and visualization to big data processing, real-time analysis, and machine learning, to guide better decisions.
Microsoft's definition becomes more vague. It does not clearly define what is Data Lake, but skillfully defines the functions of the Data Lake:
Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics. Azure Data Lake works with existing IT investments for identity, management and security for simplified data management and governance. It also integrates seamlessly with operational stores and data warehouses so you can extend current data applications. We've drawn on the experience of working with enterprise customers and running some of the largest scale processing and analytics in the world for Microsoft businesses like Office 365, Xbox Live, Azure Windows, Bing, and Skype. Azure Data Lake solves many of the productivity and scalability challenges that prevent you from maximizing the value of your data assets with a service that's ready to meet your current and future business needs.
Azure's data Lake includes all the capabilities that allow developers, data scientists, and analysts to store and process data more easily. These capabilities enable users to store data of any scale, type, data can be generated at any speed, and can be analyzed and processed across platforms and languages. Data Lake can help users accelerate application data, eliminate the complexity of data collection and storage, and support batch processing, streaming computing, and interactive analysis. Data Lake can work with existing IT investments in data management and governance to ensure data consistency, manageability, and security. It can also seamlessly integrate with existing business databases and data warehouses to help expand existing data applications. Azure Data lake draws on the experience of a large number of enterprise users and supports large-scale processing and analysis scenarios in some Microsoft businesses, including Office 365, Xbox Live, Azure, Windows, Bing, and Skype. Azure solves many efficiency and scalability challenges. As a service, users can maximize the value of data assets to meet current and future needs. There are many definitions of data lakes, but they basically focus on the following features.
1. Data Lake needs to provide sufficient data storage capacity, which stores all data in an enterprise/organization.
2. Data Lake can store massive amounts of data of any type, including structured, semi-structured, and unstructured data.
3. The data in the data Lake is the original data and a complete copy of the business data. The data in the data Lake remains the same in the business system.
4. The data Lake needs to have comprehensive data management capabilities (complete metadata) and can manage various data-related elements, including data sources, data formats, connection information, data schemas, permission management, etc.
5. Data lake requires diversified analysis capabilities, including but not limited to batch processing, streaming computing, interactive analysis, and machine learning. At the same time, it also requires task scheduling and management capabilities.
6. Data Lake needs to have comprehensive data lifecycle management capabilities. Not only the raw data needs to be stored, but also the intermediate results of various analysis and processing needs to be saved, and the analysis and processing process of the data needs to be completely recorded, which can help users trace the generation process of any data in detail.
7. Data Lake needs to have comprehensive data acquisition and data publishing capabilities. The data Lake must be able to support various data sources, obtain full/incremental data from related data sources, and then standardize storage. Data Lake can push data analysis and processing results to appropriate storage engines to meet different application access requirements.
8. Support for big data, including ultra-large-scale storage and scalable large-scale data processing capabilities.
To sum up, I think data Lake should be an evolving and scalable infrastructure for storing, processing, and analyzing big data, full acquisition, full storage, multi-mode processing, and full lifecycle management of data of any speed, size, and type. It also supports various enterprise-level applications through interactive integration with various external heterogeneous data sources. Figure 1. Data Lake basic capabilities
two points need to be pointed out here: 1) Scalability refers to the scalability of scale and capability, that is, the data Lake should not only be able to increase with the increase of data volume, provide sufficient storage and computing capabilities; And provide new data processing modes as needed. For example, the business may only need batch processing capabilities at the beginning, but with the development of the business, interactive ad hoc analysis capabilities may be required. As business effectiveness requirements continue to improve, various capabilities such as real-time analysis and machine learning may be required. 2) data-oriented index data Lake is simple and easy to use for users to help users get rid of complex IT infrastructure operations and maintenance work, focus on business, focus on models, algorithms, and data. Data Lake faces data scientists and analysts. At present, cloud native should be an ideal way to build a data Lake, which will be discussed in detail in the "basic architecture of the data Lake" section later.
II. Basic features of data Lake
after having a basic understanding of the concept of data lake, we need to further understand the basic characteristics of data Lake, especially compared with big data platforms or traditional data warehouses, what are the characteristics of the data lake. Before detailed analysis, let's take a look at a comparison table from AWS official website (table Quoted from: https://aws.amazon.com/cn/big-data/datalakes-and-analytics/what-is-a-data-lake/) the above table compares the differences between data lakes and traditional data warehouses. I think we can further analyze the characteristics of data lakes from the data and computing levels. Data:
1) "fidelity". Data Lake stores an identical copy of the data in the business system. Unlike data warehouses, a copy of raw data must be stored in the data Lake. No matter the data format, data mode, or data content is modified. In this regard, data Lake emphasizes the preservation of "original" business data. At the same time, the data Lake should be able to store data of any type/format.
(2) flexibility: one point in the preceding table is "write-type schema" v.s. "Read-type schema". In fact, it is essentially a matter of the stage at which the data schema is designed. For any data application, schema design is essential. Even for some databases that emphasize "no schema" such as mongoDB, in its best practices, it is still recommended that records adopt the same/similar structure as far as possible. The logic behind the write schema is that before writing data, you need to determine the data schema according to the service access method, and then import the data according to the established schema, the benefits are good data and business adaptation, but this also means that the early cost of ownership of data warehouses is relatively high, especially when the business model is not clear and the business is still in the exploration stage, the flexibility of data warehouses is insufficient.
The underlying logic behind the "Read schema" emphasized by data lake is that business uncertainty is normal: we cannot expect business changes, so we maintain some flexibility, postpone the design to enable the entire infrastructure to fit data on demand. Therefore, I personally think that "Fidelity" and "flexibility" come down in one continuous line: since there is no way to estimate the business changes, we simply keep the original data state. Once needed, data can be processed as needed. Therefore, data Lake is more suitable for innovative enterprises and enterprises with rapid business changes. At the same time, users of the data Lake have higher requirements accordingly. Data scientists and business analysts (with certain visualization tools) are the target customers of the data Lake.
3) "manageable": the data Lake should provide perfect data management capabilities. Since data requires fidelity and flexibility, at least two types of data exist in the data Lake: raw data and processed data. Data in the data Lake is constantly accumulated and evolved. Therefore, data management capabilities are required. At least the following data management capabilities are required: data sources, data connections, data formats, and data schemas (databases, tables, columns, and rows). At the same time, data Lake is a unified data storage place in a single enterprise/organization. Therefore, it also needs to have certain permission management capabilities.
4) "traceability": Data Lake is a storage place for full data in an organization/enterprise. It needs to manage the whole lifecycle of data, including data definition, access, storage, the whole process of processing, analysis and application. To implement a powerful data lake, you must be able to trace the process of accessing, storing, processing, and consuming any data in it, the complete process of data generation and flow can be clearly reproduced.
In terms of computing, I think data Lake has a wide range of computing capabilities, which depends entirely on the computing requirements of the business.
5) rich computing engines. From batch processing, streaming computing, interactive analysis, and machine learning, all types of computing engines are included in the data Lake. In general, batch computing engines are used for data loading, conversion, and processing. Streaming computing engines are used for real-time computing. For some exploratory analysis scenarios, you may also need to introduce an interactive analysis engine. With the closer integration of big data technology and artificial intelligence technology, various machine learning/deep learning algorithms are continuously introduced, for example, the TensorFlow/PyTorch framework supports reading sample data from HDFS, S3, and OSS for training. Therefore, for a qualified data Lake project, the scalability/pluggability of the computing engine should be a basic capability.
6) multi-modal storage engine. Theoretically, the data Lake itself should have a built-in multi-modal storage engine to meet the data access requirements of different applications (considering factors such as response time, concurrency, access frequency, and cost). However, in the actual use process, data in the data Lake is usually not frequently accessed, and related applications are mostly used for exploratory data applications, to achieve an acceptable cost performance, data Lake construction usually uses relatively cheap storage engines (such as S3/OSS/HDFS/OBS), and works with external storage engines when necessary, meet diverse application requirements.
3. Basic architecture of data Lake
data Lake can be considered as a new generation of big data infrastructure. To better understand the basic architecture of the data Lake, let's first look at the evolution of the big data infrastructure architecture.
1) Phase 1: offline data processing infrastructure represented by Hadoop. As shown in the following figure, Hadoop is a batch data processing infrastructure with HDFS as the core storage and MapReduce (MR) as the basic computing model. Based on HDFS and MR, a series of components are generated to continuously improve the data processing capabilities of the entire big data platform, such as HBase for online KV operations, HIVE for SQL operations, and PIG for workflows. At the same time, as the performance requirements for batch processing are getting higher and higher, new computing models are constantly proposed, and computing engines such as Tez, Spark, and Presto are generated, MR models have gradually evolved into DAG models. On the one hand, the DAG model increases the abstract concurrency capability of the computing model: Each computing process is decomposed, tasks are logically split according to the aggregation operation points in the computing process, and tasks are split into stages one by one, each stage can be composed of one or more tasks. Tasks can be executed concurrently to improve the parallel capability of the entire computing process. On the other hand, to reduce intermediate result writing operations during data processing, computing engines such as Spark and Presto use the memory of computing nodes to cache data to improve the efficiency of the entire data process and system throughput. Figure 2. Hadoop architecture diagram
2) Phase 2: lambda architecture. As data processing capabilities and processing requirements are constantly changing, more and more users find that no matter how the batch processing mode improves performance, it cannot meet some processing scenarios with high real-time requirements. Therefore, the streaming computing engine emerges as the times require, for example, Storm, Spark Streaming, and Flink. However, as more and more applications go online, it is found that batch processing and stream computing can meet most application requirements. For users, in fact, they do not care about the underlying computing model. Users hope that both batch processing and stream computing can return processing results based on a unified data model. Therefore, the Lambda architecture is proposed, as shown in the following illustration. (To save trouble, lambda architecture and Kappa architecture diagram are both from the network) figure 3. Lambda architecture diagram
the core concept of Lambda architecture is "flow and batch integration". As shown in the preceding figure, the entire data flow flows from left to right to the platform. After entering the platform, two parts are split. One part is in batch mode and the other part is in streaming mode. Regardless of the computing mode, the final processing results are provided to the application through the service layer to ensure access consistency.
3) Phase III: Kappa architecture. Lambda architecture solves the problem of data consistency read by applications, but the processing link of "stream-batch separation" increases the complexity of research and development. Therefore, some people put forward whether a system can be used to solve all the problems. Currently, Stream Computing is a popular method. The natural distributed features of Stream Computing ensure better scalability. By increasing the concurrency of Stream Computing and the time window of streaming data, batch processing and stream processing are unified. Figure 4. The Kappa architecture shows the evolution from the traditional hadoop architecture to the lambda architecture and from the lambda architecture to the Kappa architecture, the evolution of the big data platform infrastructure gradually includes various data processing capabilities required by applications. The big data platform has gradually evolved into a full data processing platform for enterprises/organizations. In current enterprise practice, except that relational databases rely on independent business systems, almost all other data is considered to be integrated into the big data platform for unified processing. However, the current big data platform infrastructure focuses on storage and computing, ignoring the asset management of data, this is exactly one of the focuses of data Lake as a new generation of big data infrastructure.
1) "river" emphasizes mobility. "The sea contains all rivers", and the river will eventually flow into the sea, while enterprise-level data needs to be accumulated for a long time, therefore, it is more appropriate to call "lake" than "River"; At the same time, the lake water is naturally layered and meets the requirements of different ecosystems, which is in line with the construction of a unified data center for enterprises, the requirements for storing management data are the same. Hot data is stored on the upper layer to facilitate application use at any time. Warm data and cold data are stored in different storage media in the data center, balance data storage capacity and cost.
2) the reason why it is not called "sea" is that the sea is boundless and boundless, while "lake" has boundaries, which is the business boundary of enterprises/organizations; therefore, data lake requires more data management and permission management capabilities.
3) another important reason for the "lake" is that the data Lake needs fine governance. A data Lake lacking control and governance will eventually degenerate into a "data Swamp", therefore, applications cannot effectively access data and the data stored in the data is worthless.
The evolution of the big data infrastructure reflects one point: within enterprises/organizations, data has become an important asset; In order to make better use of data, enterprises/organizations need to store data assets as they are for a long time; 2) carry out effective management and centralized governance; 3) provide multi-mode computing capabilities to meet processing requirements; 4) and provides unified data views, data models, and data processing results. Data Lake comes into being in this context. In addition to the basic capabilities of the big data platform, data Lake emphasizes data management, governance, and asset-based capabilities. In terms of specific implementation, the data Lake needs to include a series of data management components, including: 1) data access; 2) data relocation; 3) Data Governance; 4) quality management; 5) assets directory; 6) access control; 7) task management; 8) task orchestration; 9) metadata management, etc. The following figure shows the reference architecture of a data lake system. For a typical data Lake, the same as the big data platform is that it also has the storage and computing capabilities required to process ultra-large-scale data and can provide multi-mode data processing capabilities; the enhancement is that the data Lake provides more comprehensive data management capabilities, which are embodied in:
1) more powerful data access capabilities. The data access capability is reflected in the definition and management of various external heterogeneous data sources and the ability to extract and migrate data related to external data sources, the data extracted and migrated includes the metadata of the external data source and the actual stored data.
2) stronger data management capabilities. Management capabilities can be divided into basic management capabilities and extended management capabilities. Basic management capabilities include metadata management, data access control, and data Asset Management, which are necessary for a data lake system, later, we will discuss how vendors support basic management capabilities in the data Lake solutions section. Extended management capabilities include task management, process orchestration, and capabilities related to data quality and data governance. Task management and process orchestration are mainly used to manage, orchestrate, schedule, and monitor various tasks that process data in the data lake system. Generally, data Lake builders provide such capabilities by purchasing/developing customized data integration or data development subsystems/modules. Customized systems/modules can read data Lake metadata, to integrate with the data lake system. Data quality and data governance are more complex issues. Generally, the data lake system does not directly provide relevant functions, but opens various interfaces or metadata, for enterprises/organizations with ability to integrate with existing data governance software or do customized development.
3) shared metadata. The computing engines in the data Lake are deeply integrated with the data in the data Lake. The integration is based on the metadata of the data Lake. In a good data lake system, when processing data, the computing engine can directly obtain data storage location, data format, data mode, data distribution and other information from metadata, and then directly process data, without manual/programming intervention. Furthermore, a good data lake system can also control the access of data in the data Lake, and the control strength can be achieved at different levels such as databases, tables, columns, and rows.
Figure 5. The data Lake component reference architecture should also point out that the "centralized storage" in the preceding figure is more concentrated in business concepts, in essence, it is hoped that the data within an enterprise/organization can be precipitated in a clear and unified place. In fact, data Lake storage should be a type of on-demand distributed file system, in most data Lake practices, we recommend that you use distributed systems such as S3, OSS, OBS, and HDFS as the unified storage of data lakes.
We can switch to the data dimension and view the data processing method in the data Lake from the data lifecycle perspective. The entire data lifecycle in the data Lake is shown in figure 6. Theoretically, the data in a well-managed data Lake will permanently retain the original data, and the process data will be continuously improved and evolved to meet business needs. Figure 6. Data lifecycle diagram in data Lake
4. Data Lake solutions of various vendors
as a trend, data Lake has launched its own data Lake solutions and related products. This section analyzes the data Lake solutions launched by mainstream vendors and maps them to the data Lake reference architecture to help you understand the advantages and disadvantages of various solutions.
4.1 AWS data lake solution
figure 7. AWS data lake solution figure 7 is the data lake solution recommended by AWS. The entire solution is built based on AWS Lake Formation. AWS Lake Formation is essentially a management component that works with other AWS services to build an enterprise-level data Lake. The preceding figure shows four steps: data inflow, data precipitation, data computing, and data application. Let's look further at its key points:
1) data inflow. Data inflow is the beginning of data Lake construction, including metadata inflow and business data inflow. Metadata inflows include creating a data source and capturing metadata. A data resource directory is generated and corresponding security settings and access control policies are generated. The solution provides a special component to obtain metadata about external data sources. The component can connect to external data sources, detect data formats and schemas, and create metadata belonging to the data Lake in the corresponding data resource directory. The inflow of business data is completed through ETL.
In terms of specific product forms, metadata capture, ETL, and data preparation are abstracted separately by AWS to form a product called AWS GLUE. AWS GLUE share the same Data resource directory with AWS Lake Formation, which is clearly stated in the document on AWS GLUE's official website: "Each AWS account has one AWS Glue Data Catalog per AWS region".
Supports heterogeneous data sources. The data lake solution provided by AWS supports S3, AWS relational databases, and AWS NoSQL databases. AWS uses components such as GLUE, EMR, and Athena to support the free flow of data.
2) data precipitation.
Amazon S3 is used as the centralized storage of the entire data Lake, with on-demand expansion and pay-as-you-go billing.
3) data calculation.
The entire solution uses AWS GLUE for basic data processing. The basic computing mode of GLUE is ETL tasks in various batch modes. The starting modes of tasks include manual triggering, scheduled triggering, and event triggering. We have to say that AWS's various services are well implemented in the ecosystem. In event triggering mode, we can use AWS Lambda to expand development and trigger one or more tasks at the same time, this greatly improves the custom development capability triggered by tasks. In addition, various ETL tasks can be well monitored through CloudWatch.
4) Data application.
In addition to providing basic batch computing modes, AWS provides various external computing engines to support various computing modes. For example, it provides SQL-based interactive batch processing capabilities through Athena/Redshift; EMR provides various Spark-based computing capabilities, including stream computing and machine learning capabilities that Spark can provide.
(5) permission management.
AWS's data Lake solution provides comprehensive permission management through Lake Formation, with a granularity of database-table-column ". However, one exception is that when GLUE accesses Lake Formation, the granularity is only two levels: "Library-Table". This also illustrates from another aspect that the integration of GLUE and Lake Formation is closer, GLUE has greater access to data in the Lake Formation.
Lake Formation permissions can be further divided into data resource directory access permissions and underlying data access permissions, which correspond to metadata and actual stored data respectively. The access permissions of stored data are further divided into data access permissions and data storage access permissions. The data access permission is similar to the access permission for databases and tables in a database. The data storage permission further refines the access permission for specific directories in S3 (including display and implicit). As shown in figure 8, user A cannot create A table in the bucket specified by S3 without the permission to access data.
I think this further reflects that the data Lake needs to support various storage engines. In the future, the data lake may not only support core storage such as S3, OSS, OBS, and HDFS. Based on the application access requirements, include more types of storage engines. For example, S3 stores raw data, and NoSQL stores data that is suitable for access in key-value mode after processing, the OLAP engine stores data that needs to be queried by various reports or adhoc in real time. Although all kinds of materials emphasize the difference between data Lake and data warehouse, in essence, data Lake should be the concrete implementation of a kind of integrated data management idea, "Lake warehouse integration" is also likely to be a development trend in the future. Figure 8. Permission separation diagram of AWS data lake solution
in summary, the AWS data lake solution is highly mature, especially considering metadata management and permission management, which connects the upstream and downstream relationships between heterogeneous data sources and various computing engines, allows data to move freely. In terms of stream computing and machine learning, AWS has a complete solution. In terms of Stream Computing, AWS has launched a dedicated Stream computing component Kinesis. The Kinesis service in the Kinesis data Firehose can create a fully managed data distribution service that processes data in real time through Kinesis data Stream, you can use Firehose to easily write data into S3 and support format conversion, such as converting JSON to Parquet. The best part of AWS's overall solution is that Kinesis can access the metadata in GLUE, which fully reflects the ecological completeness of AWS's data lake solution. Similarly, in machine learning, AWS provides SageMaker service, SageMaker can read training data from S3 and write the trained model back to S3. However, one point to note is that in AWS's data lake solution, stream computing and machine learning are not fixed. They are only used as computing capacity extensions to facilitate integration.
Finally, let's return to the data Lake component reference architecture in figure 6 to see the component coverage of AWS's data lake solution, as shown in Figure 9.
Figure 9. Mapping of AWS data lake solution in Reference Architecture
in summary, AWS's data lake solution covers all features except quality management and data governance. In fact, the work of quality management and data governance is strongly related to the organizational structure and business type of an enterprise, which requires a lot of customized development work. Therefore, general solutions do not include this part, it is also understandable. In fact, there are also excellent open-source projects that support this project, such as Apache Griffin. If you have strong requirements for quality management and data governance, you can customize the development.
4.2 Huawei data lake solution
figure 10. Huawei data lake solution Huawei's data lake solution information is from Huawei's official website. Relevant products available on the official website include Data Lake Exploration (Data Lake Insight,DLI) and intelligent Data Lake operation platform (DAYU). DLI is equivalent to a collection of AWS Lake Formation, GLUE, Athena, and EMR(Flink & Spark). The overall architecture diagram of DLI was not found on the official website. According to my own understanding, I tried to draw one, which is mainly a comparison with AWS solutions, so the form is as consistent as possible, if you know Huawei DLI very well, please give your advice.
Huawei has a complete data lake solution. DLI provides all the core functions of data Lake construction, data processing, data management, and data application. DLI features the completeness of analysis engines, including SQL-based interactive analysis and Spark and Flink-based stream-batch integrated processing engines. On the core storage engine, DLI is still provided through the built-in OBS, which basically matches the capabilities of AWS S3. Huawei's data lake solution is better than AWS in upstream and downstream ecosystems. For external data sources, it supports almost all data source services currently provided on Huawei cloud.
DLI can be connected with Huawei's CDM (Cloud Data Migration Service) and DIS (data access service): 1) with DIS,DLI can define various data points, these points can be used in Flink jobs as a source or sink;2) DLI can even access data from IDCs and third-party cloud services with CDM.
To better support advanced data Lake functions such as data integration, data development, data governance, and quality management, Huawei cloud provides the DAYU platform. The DAYU platform is the implementation of Huawei's data Lake governance and operation methodology. DAYU covers the core process of data Lake governance and provides corresponding tool support. It even provides suggestions on building data governance organizations in Huawei's official documents. The implementation of DAYU's data governance methodology is shown in Figure 11 (from Huawei cloud official website).
Figure 11 DAYU data governance methodology process
as you can see, the DAYU data governance methodology is actually an extension of the traditional data warehouse governance methodology on the data Lake infrastructure: From the perspective of data models, it still includes the source layer, the multi-source integration layer, the detail data layer is exactly the same as the data warehouse. Quality rules and transformation models are generated based on data models and metric models. DAYU connects with DLI and directly calls relevant data processing services provided by DLI to complete data governance. The entire data lake solution of Huawei cloud covers the lifecycle of data processing, explicitly supports data governance, and provides model and metric-based data governance process tools, in the data lake solution of Huawei cloud, it has gradually begun to evolve towards the integration of lakes and warehouses.
there are many data products on Alibaba Cloud. Because I am currently working in the data BU, this topic focuses on how to use the product of the database BU to build a data Lake. Other cloud products will be slightly involved. Alibaba Cloud's database-based data Lake solutions focus more on data Lake analysis and Federated analysis. The Alibaba Cloud Data Lake solution is shown in Figure 12. Figure 12. Alibaba Cloud Data Lake solution
the entire solution still uses OSS as the centralized storage of the data Lake. Data sources support all Alibaba cloud databases, including OLTP, OLAP, and NoSQL databases. The key points are as follows:
1) Data Access and migration. In the process of building a lake, the Formation component of DLA has the ability of metadata discovery and one-click Lake building. At the time of writing this article, "one-click Lake building" currently only supports full lake building, however, binlog-based incremental Lake building is under development and is expected to be launched in the near future. Incremental Lake building greatly increases the real-time data in the data Lake and minimizes the pressure on the source business database. Note that DLA Formation is an internal component that is not exposed externally.
(2) Data Resource Directory. DLA provides Meta data catalog components for unified management of data assets in the data Lake, whether in the lake or outside ". Meta data catalog is also a unified metadata portal for Federated analysis.
(3) DLA provides two built-in computing engines: SQL computing engine and Spark computing engine. Both SQL and Spark engines are deeply integrated with Meta data catalog to easily obtain metadata information. Based on Spark, DLA supports batch processing, Stream Computing, and machine learning.
4) in the peripheral ecosystem, DLA is deeply integrated with the cloud-native data warehouse (formerly ADB) in terms of external access capabilities, in addition to supporting data access and aggregation of various heterogeneous data sources. On the one hand, DLA processing results can be pushed to ADB at the same time to meet real-time, interactive, and ad hoc complex queries; On the other hand, data in ADB can also use the external functions, data can be easily returned to OSS. Based on DLA, various heterogeneous data sources on Alibaba Cloud can be fully connected and data can flow freely.
5) in terms of data integration and development, Alibaba Cloud's data lake solution provides two options: dataworks and DMS. You can use visualized process orchestration, task scheduling, and task management capabilities. In terms of data lifecycle management, dataworks has more mature data mapping capabilities.
6) DMS provides powerful capabilities in data management and data security. The data management granularity of DMS is divided into database-table-column-row, which supports enterprise-level data security control requirements. In addition to permission management, DMS extends the original database-based devops concept to the data Lake, making the O & M and development of the data lake more refined.
Further refine the data application architecture of the entire data lake solution, as shown in the following figure.
from left to right, data producers generate various types of data (off-cloud, on-cloud, or other clouds) and use various tools to upload data to various common and standard data sources, including OSS, HDFS, and DB. DLA supports data discovery, data access, and data migration. DLA provides data processing capabilities based on SQL and Spark, and provides visualized data integration and data development capabilities based on Dataworks and DMS; in terms of external application service capabilities, DLA provides a standardized JDBC interface, which can directly connect to various Report tools and large screen display functions. The feature of Alibaba Cloud DLA is that it is backed by the entire Alibaba Cloud database ecosystem, including various databases such as OLTP, OLAP, and NoSQL, and provides SQL-based data processing capabilities. For traditional enterprise database-based development technology stacks, the transformation cost is relatively low and the learning curve is relatively flat.
Another feature of Alibaba Cloud's DLA solution is cloud-native lake warehouse integration ". In the era of big data, traditional enterprise-level data warehouses are still irreplaceable in various Report applications. However, data warehouses cannot meet the flexibility requirements of data analysis and processing in the era of big data; therefore, we recommend that the data warehouse should be used as the upper-layer application of the data Lake, that is, the data Lake is the only official data storage place for the original business data in an enterprise/organization; data Lake processes raw data based on various business application requirements to form reusable intermediate results. When the Schema of intermediate results is relatively fixed, DLA can push intermediate results to data warehouses for enterprises/organizations to implement data warehouse-based business applications. In addition to DLA, Alibaba Cloud also provides cloud-native data warehouses (formerly ADB). DLA and cloud-native data warehouses are deeply integrated in the following two aspects. 1) use the same source SQL parsing engine. DLA SQL is fully compatible with ADB SQL syntax, which means that developers can use a technology stack to simultaneously develop data Lake applications and data warehouse applications. (2) supports access to OSS. OSS is directly used as the native storage of DLA. For ADB, you can use external tables to easily access Structured Data on OSS. With external tables, data can be freely transferred between DLA and ADB to achieve a real lake warehouse integration.
The combination of DLA and ADB truly integrates cloud-native lakes and warehouses (what is cloud-native is not discussed in this article). In essence, DLA can be regarded as a data warehouse source layer with extended capabilities. Compared with traditional data warehouses, this sticker layer:(1) it can store all kinds of structured, semi-structured and unstructured data;(2) it can connect with all kinds of heterogeneous data sources;(3) it has the ability to discover metadata, (4) the built-in SQL/Spark computing engine has stronger data processing capabilities to meet diversified data processing requirements;(5) with total data life cycle management capacity. The Lake warehouse integration solution based on DLA and ADB will cover the processing capabilities of the big data platform and data warehouse. Another important capability of DLA is to build a data flow system that extends in all directions and provides external capabilities through database experience. Whether data is on or off the cloud, no matter whether the data is inside or outside the organization; With the help of the data Lake, the data between various systems no longer has barriers and can freely flow in and out. More importantly, this flow is regulated, the data lake completely records the flow of data.
azure's data lake solution includes data Lake storage, interface layer, resource scheduling and computing engine layer, as shown in Figure 15 (from Azure's official website). The Storage layer is built based on Azure object Storage and still supports structured, semi-structured, and unstructured data. The interface layer is WebHDFS. In particular, HDFS interfaces are implemented in Azure object Storage. Azure calls this capability "multi-protocol access on data Lake Storage". Azure implements resource scheduling based on YARN. Azure provides multiple processing engines, such as U-SQL, hadoop, and Spark. Figure 15. Azure Data lake analysis architecture
what makes Azure special is that it is based on the support provided by visual studio for customer development.
1) support for development tools and deep integration with visual studio; Azure recommends using U-SQL as the development language for data Lake analysis applications. Visual studio provides a complete development environment for U-SQL. In order to reduce the complexity of the development of the distributed data lake system, visual studio is encapsulated based on projects. When developing U-SQL, you can create a U-SQL database project. In such projects, you can use visual studio to easily code and debug. In addition, you can also provide a wizard to publish the developed U-SQL script to the generation environment. U-SQL supports Python and R extensions to meet customized development requirements.
3) automatic conversion between different engine tasks. Microsoft recommends U-SQL as the default development tool for Data Lake and provides various conversion tools to support the conversion between U-SQL scripts and Hive, Spark(HDSight & databricks), and Azure data Factory Data Flow.
this article discusses the data lake solution and does not involve any cloud vendor's single product. We make a summary similar to the following table from the aspects of data access, data storage, data computing, data management, and application ecosystem. Due to the length, the data Lake solutions of well-known cloud vendors also include Google and Tencent. According to the official websites of the two companies, the data lake solution is relatively simple, and it is only a conceptual explanation. The recommended implementation solution is oss + hadoop(EMR) ". In fact, data Lake should not be viewed from a simple technology platform. There are various ways to realize data Lake. To evaluate whether a data lake solution is mature, the key depends on its data management capabilities, including but not limited to metadata, data asset directories, data sources, data processing tasks, data lifecycle, data governance, and permission management; and the ability to connect with the peripheral ecosystem.
5. Typical data Lake application cases
5.1 analysis of advertising data
in recent years, the cost of traffic acquisition has become higher and higher. The double increase in the cost of online channel customer acquisition makes all walks of life face severe challenges. Under the background of the rising cost of Internet advertising, the business strategy of paying for traffic and attracting new products is bound to not work. The optimization of the traffic front-end has become a tense. Using data tools to improve the target transformation of traffic after arriving at the site and to refine the operation of all aspects of advertising are the more direct and effective ways to change the current situation. After all, to improve the conversion rate of advertising traffic, we must rely on big data analysis.
In order to provide more decision support basis, more tracking data needs to be collected and analyzed, including but not limited to channels, delivery time, and delivery groups, the click-through rate is used as the data index for data analysis, thus providing better and faster solutions and suggestions to achieve high efficiency and high output. Therefore, facing the requirements of multi-dimensional, multi-media, multi-advertising space and other structured, semi-structured and unstructured data collection, storage, analysis and decision-making suggestions in the field of advertising, data Lake analysis product solutions are warmly favored by advertisers or publishers in the selection of new-generation technologies.
DG is a world-leading enterprise International intelligent marketing service provider, based on advanced advertising technology, big data and operation capabilities, providing customers with global high-quality user acquisition and traffic realization services. DG decided to build its IT infrastructure on the basis of public cloud since its establishment. At first DG chose AWS Yun Pingtai, mainly storing its advertising data in S3 in the form of data Lake, interactive analysis is performed through Athena. However, with the rapid development of Internet advertising, the advertising industry has brought several major challenges. The publishing and tracking system of mobile advertising must solve several key problems:
(1) concurrency and peak issues. In the advertising industry, traffic peaks often occur, and instant clicks may reach tens of thousands or even hundreds of thousands, which requires the system to have very good scalability to quickly respond and process each click.
2) how to realize real-time analysis of massive data. In order to monitor the advertising effect, the system needs to analyze the user's each click and activation data in real time, and transmit the relevant data to downstream media at the same time;
3) the data volume of the platform is increasing rapidly. Daily Business log data is continuously generated and uploaded, and data exposed, clicked, and pushed is continuously processed, the amount of data added per day is about 10-50TB, which puts forward higher requirements for the entire data processing system. How to efficiently complete offline/near real-time statistics of advertising data, and perform aggregation analysis according to the dimension requirements of advertisers.
In response to the above three business challenges, the daily incremental data of DG is rapidly increasing (the current daily data scanning volume reaches more than 100 TB), continue to use the AWS platform to encounter the bottleneck of Athena reading S3 data bandwidth, data analysis lag time is getting longer and longer, to cope with the data and analysis demand growth and sharply increased investment costs, after careful, after careful testing and analysis, we finally decided to migrate all the sites from AWS to Alibaba Cloud. The new architecture diagram is as follows: figure 16. Architecture of the transformed advertising data lake solution
after moving from AWS to Alibaba Cloud, we designed the "use Data Lake Analytics + OSS" ultimate analysis capability to cope with business peaks and valleys. On the one hand, it is easy to deal with temporary analysis from brand customers. On the other hand, using the powerful computing power of Data Lake Analytics, analyzing the monthly and quarterly advertising, accurately calculating how many activities there will be under a brand. Each activity is divided into media, market, channel, the delivery effect of DMP further enhances the sales conversion rate brought by Jiahe intelligent traffic platform to brand marketing. In addition, in terms of the total cost of ownership of advertising and analysis, the Data Lake Analytics elastic service provided by Serverless is charged on demand, without the need to purchase fixed resources, which completely conforms to the resource fluctuation caused by business tide, it meets the requirements of elastic analysis and greatly reduces the O & M and usage costs. Figure 17 data Lake Deployment diagram
in general, after switching DG from AWS to Alibaba Cloud, hardware costs, labor costs, and development costs are greatly reduced. Due to the adoption of DLA serverless cloud services, DG does not need to invest a large amount of money in advance to purchase hardware devices such as servers and storage devices, nor does it need to purchase a large number of cloud services at a time, the scale of its infrastructure is completely on demand: increase the number of services when the demand is high, and decrease the number of services when the demand is reduced, thus improving the utilization rate of funds. The second significant benefit of using the Alibaba Cloud Platform is the performance improvement. During the rapid growth period of DG business and the subsequent access period of multiple business lines, the access volume of DG in mobile advertising system often shows explosive growth, however, the AWS solution and platform encountered a huge data reading bandwidth bottleneck when reading S3 data in Athena, and the data analysis time became longer and longer, alibaba Cloud DLA, together with the OSS team, has made great optimizations and transformations. At the same time, DLA database analysis is based on the computing engine (sharing the computing engine with TPC-DS, the world's No. 1 AnalyticDB). The performance is dozens of times higher than that of the Presto native computing engine, and the analysis performance is greatly improved for DG.
data Lake is a kind of big data infrastructure with excellent TCO Performance. For many fast-growing game companies, a hot game often has a very fast growth in relevant data in the short term. At the same time, the technology stack of the company's R & D personnel is difficult to match the data increment and growth rate in the short term. At this time, data with explosive growth is difficult to be effectively utilized. Data Lake is a technical option to solve such problems.
YJ is a fast-growing game company. The company hopes to conduct in-depth analysis based on relevant user behavior data to guide the development and operation of games. The core logic behind data analysis is that with the expansion of market competition in the game industry, players have higher and higher requirements for quality, and the life cycle of game projects becomes shorter and shorter, which directly affects the input-output ratio of projects, data operations can effectively prolong the lifecycle of a project and precisely control the business trends at various stages. With the increasing traffic costs, how to build an economical and efficient refined data operation system to better support business development is becoming more and more important. The data operation system needs its supporting infrastructure. How to choose such infrastructure is a problem that the company's technical decision makers need to consider. The starting point of thinking includes:
1) be flexible enough. For games, it is often a short time explosion and a sharp increase in data volume. Therefore, whether it can adapt to the explosive growth of data and meet the elasticity demand is a key consideration. Whether computing or storage, you must have sufficient elasticity.
2) have enough cost performance. User behavior data often needs to be analyzed and compared for a long period of time, such as retention rate. In many cases, the retention rate of customers for 90 days or even 180 days needs to be considered. Therefore, how to store large amounts of data for a long time in the most cost-effective way is a key issue to be considered.
3) have sufficient analysis capability and scalability. In many cases, user behavior is reflected in tracking data, which needs to be associated with structured data such as user registration information, login information, and bills. Therefore, in data analysis, at least the ETL capability of big data, the access capability of heterogeneous data sources, and the modeling capability of complex analysis are required.
4) match with the company's existing technology stack, and the follow-up is conducive to recruitment. For YJ, an important point in technology selection is the technical stack of its technicians. Most of YJ's technical teams are only familiar with traditional database development, namely MySQL; And they are short of manpower, there is only one technician doing data operation analysis, and there is no ability to independently build the infrastructure for big data analysis in a short period of time. From the perspective of YJ, it is best that most analysis can be completed through SQL. In addition, in the recruitment market, the number of SQL developers is much higher than the number of big data development engineers. According to the customer's situation, we helped the customer to reform the existing plan. Figure 18. Pre-transformation plan
before the transformation, all the structured data of the customer was stored in a high-specification MySQL, while the player behavior data was collected to log service (SLS) through LogTail, then, log service is delivered to OSS and ES respectively. The problems with this architecture are: 1) behavioral data and structured data are completely separated and cannot be analyzed in linkage; 2) intelligent retrieval function for behavioral data cannot be conducted in depth mining and analysis; 3)OSS is used only as a data storage resource and does not mine sufficient data value.
In fact, we analyze that the customer's existing architecture already has the prototype of the data Lake: full data has been saved in OSS, and now we need to further improve the customer's ability to analyze data in OSS. In addition, the SQL-based data processing mode of data Lake also meets customers' needs for developing technology stacks. To sum up, we have made the following adjustments to the customer architecture to help customers build a data Lake. Figure 19. Data Lake solution after transformation
in general, we did not change the customer's data link flow, but added the DLA component on the basis of OSS to perform secondary processing on OSS data. DLA provides a standard SQL computing engine and supports access to various heterogeneous data sources. After processing OSS data based on DLA, data that is directly available to services is generated. However, the problem with DLA is that it cannot support interactive analysis scenarios with low latency. To solve this problem, we have introduced the cloud-native data warehouse ADB to solve the latency problem of interactive analysis. At the same time, introduce QuickBI at the front end as a visual analysis tool for customers. The YJ solution is a classic implementation case of the Lake warehouse integration solution shown in Figure 14 in the gaming industry.
YM is a data intelligence service provider, providing a series of data analysis and operation services for various small and medium-sized businesses. The following figure shows the technical logic of the implementation. Figure 20. YM intelligent data service SaaS model
the platform provides multi-terminal SDKs for users (merchants provide various access forms such as web pages, apps, and applets) to access various tracking data, the platform provides unified data access and data analysis services in the form of SaaS. Merchants perform fine-grained tracking data analysis by accessing various data analysis services, and complete basic analysis functions such as behavior statistics, customer profiling, customer selection, and advertising monitoring. However, in this SaaS model, there are some problems:
1) due to the diversity of merchant types and needs, the SaaS analysis function provided by the platform is difficult to cover all types of merchants and cannot meet their customized needs. For example, some merchants pay attention to sales volume, some focus on customer operation, and some focus on cost optimization, which is difficult to meet all needs.
2) unified data analysis services cannot meet some advanced analysis functions, such as customer selection based on custom tags and customer customization extensions; in particular, some custom tags depend on custom algorithms of merchants, which cannot meet customers' advanced analysis needs.
3) Data Asset management requirements. In the era of big data, data is an asset of an enterprise or an organization, which has become a consensus. How to make the data belonging to merchants accumulate reasonably and in the long run, it is also a matter for SaaS services to consider.
To sum up, we have introduced the data Lake model in the basic model shown in the preceding figure, allowing the data Lake to serve as the infrastructure for merchants to accumulate data, generate models, and analyze operations. The following figure shows the SaaS data intelligence service mode after data Lake is introduced. Figure 21. Data Lake-based data intelligence services
as shown in Figure 21, the platform provides one-click Lake building service for each user. Merchants use this function to build their own data lakes, on the one hand, the one-click Lake building feature helps merchants synchronize data models (schemas) of all tracking data to the data Lake. On the other hand, synchronize all tracking data belonging to the merchant to the data Lake, and archive daily incremental data into the lake based on the T +1 mode. Based on traditional data analysis services, data Lake-based service models provide users with three capabilities: data capitalization, analysis modeling, and service customization:
1) Data Asset capability. With the data Lake, merchants can continuously accumulate their own data, and the amount of time to store the data and the cost is determined by the merchants themselves. Data Lake also provides data asset management capabilities. Merchants can manage raw data and store processed process data and result data in different categories, the value of tracking data is greatly improved.
2) analysis and modeling capabilities. Data Lake contains not only raw data, but also a schema for tracking data. The tracking data model reflects the abstraction of the business logic of the global data intelligence service platform. In addition to outputting raw data as assets, the data model is also output through the data Lake, with the tracking data model, merchants can have a deeper understanding of the user behavior logic behind the tracking data, helping merchants gain a better insight into customer behavior and obtain user needs.
(3) service customization capabilities. With the data integration and data development capabilities provided by the data Lake, merchants can customize the data processing process based on the understanding of the tracking data model, and continuously iterate the raw data, extract valuable information from the data to obtain value beyond the original data analysis service.
VI. Basic process of data Lake construction
I think data Lake is a better big data processing infrastructure than traditional big data platforms. Perfect data Lake is a technology that is closer to customer business. All data Lake features beyond the big data platform, such as metadata, data asset catalog, permission management, data lifecycle management, data integration, and data development, data governance and quality management are all designed to be closer to business and more convenient for customers to use. Some of the basic technical features that data Lake emphasizes, such as elasticity, independent expansion of storage and computing, unified storage engines, and multi-mode computing engines, are also designed to meet business needs, and provide the business side with the most cost-effective TCO.
The construction process of the data Lake should be closely integrated with the business. However, the construction process of the data Lake should be different from the traditional data warehouse and even the hot data mid-end. The difference is that the data Lake should be built in a more agile way, "using while building, and managing while using". To better understand the agility of data Lake construction, let's take a look at the construction process of traditional data warehouses. The industry has proposed two modes of "bottom-up" and "top-down" for the construction of traditional data warehouses, which are respectively proposed by Inmon and KimBall. The specific process will not be described in detail, otherwise you can write hundreds of pages, here only briefly explain the basic ideas.
1)Inmon proposes a bottom-up (EDW-DM) data warehouse construction mode, that is, data sources of operational or transactional systems are extracted, converted and loaded to the ODS layer of the data warehouse through ETL; Data in the ODS layer, process data based on the pre-designed EDW (enterprise-level data warehouse) paradigm, and then go to EDW. EDW is generally a common data model for enterprises or organizations. It is not convenient for upper-layer applications to perform data analysis directly. Therefore, each business department will, according to its own needs, the data mart layer (DM) is processed from EDW.
Advantages: easy maintenance and high integration; Disadvantages: Once the structure is determined, the flexibility is insufficient, and the deployment cycle is long to adapt to the business. This type of data warehouse is suitable for mature and stable businesses, such as finance.
2)KimBall propose a top-down (DM-DW) data architecture. Extract or load data sources from operational or transactional systems to the ODS layer, use the dimension modeling method to build a multi-dimensional topic Data Mart (DM). Each DM is associated with a consistent dimension to form a common data warehouse for enterprises and organizations.
Advantages: rapid construction, quick return on investment, agile and flexible; Disadvantages: as an enterprise resource, it is not easy to maintain, complex structure, and difficult to integrate data markets. It is often used in small and medium-sized enterprises or the Internet industry.
In fact, the above is only a theoretical process. In fact, whether you construct EDW or DM first, you cannot do without data mapping and the design of data models before data warehouse construction, including the current hit the data Taiwan ", could escape shown in the following illustration basic construction process. Figure 22. Basic process of Data Warehouse/data mid-end construction
1) data Mapping. For an enterprise/organization, the initial work in building a data Lake is to conduct a comprehensive survey and survey of the data within its own enterprise/organization, including data sources, data types, data forms, data mode, total data volume, and incremental data. At this stage, an implicit important work is to further sort out the organizational structure of the enterprise and clarify the relationship between the data and the organizational structure with the help of data mapping. This will lay a foundation for defining the user roles, permission design, and service methods of the data Lake.
2) model abstraction . Sort and classify various types of data according to the business characteristics of enterprises/organizations, divide data into fields, form Metadata for data management, and build a common data model based on metadata.
3) data Access . Determine the data source to be connected based on the result of the first step. Determine the required data access technology capabilities based on the data source, and select the data access technology. The data to be accessed includes at least data source metadata, raw data metadata, and raw data. All types of data are classified and stored according to the results formed in step 2.
4) integrated Governance . Simply put, various computing engines provided by the data Lake are used to process data, form various intermediate data/result data, and properly manage and save them. Data Lake should have comprehensive data development, task management, and task scheduling capabilities, and record the data processing process in detail. In the governance process, more data models and indicator models are required.
The above process is too heavy for a fast-growing Internet enterprise. In many cases, it cannot be implemented. The most realistic problem is the second step of model abstraction. In many cases, the business is trying to make mistakes and exploring. It is impossible to develop a general data model without a data model, this is also one of the important reasons why many fast-growing enterprises feel that data warehouses and data mid-ends cannot be implemented and meet their needs.
Data lakes should be built in a more agile way. We recommend that you use the following steps to build data Lakes. Figure 23. Basic process of data Lake construction
compared with Figure 22, it is still five steps, but these five steps are a comprehensive simplification and "landing" improvement.
1) data Mapping . You still need to understand the basic information of the data, including the data source, data type, data form, data mode, total data volume, and data increment. However, that's all. The data lake is used to store the original data in full, so there is no need to conduct in-depth design in advance.
2) technical Selection . Determine the technical selection of data Lake construction based on the data survey. In fact, this step is also very simple, because there are many common practices in the industry regarding the selection of data Lake technology. There are three basic principles: "Separation of computing and storage", "Flexible", independent extensions ". We recommend that you select a distributed object storage system (such as S3, OSS, and OBS). We recommend that you focus on batch processing and SQL processing capabilities for computing engines, because in practice, these two types of capabilities are the key to data processing. We will discuss the stream computing engine later. We recommend that you give priority to serverless computing and storage. Later, you can gradually evolve in applications. If you really need an independent resource pool, you can build an exclusive cluster.
3) data Access . Determine the data source to be connected, complete full data extraction and incremental access.
4) application governance . This step is the key to the data Lake. I personally changed "integrated governance" to "application governance". From the perspective of data Lake, data application and data governance should be integrated and inseparable. Start with data application and define requirements in application. In the process of data ETL, business available data is gradually formed. At the same time, data models, index systems and corresponding quality standards are formed. Data Lake emphasizes the storage of raw data and the exploratory analysis and application of data. However, this does not mean that data Lake does not need data models. On the contrary, the understanding and abstraction of business, it will greatly promote the development and application of data Lake. Data Lake technology enables data processing and modeling to retain great agility and quickly adapt to business development and changes.
From a technical point of view, data Lake is different from big data platform in that data lake requires relatively complete data management, category management, and, process Orchestration, task scheduling, data tracing, data governance, quality management, and permission management. In terms of computing power, the mainstream data Lake solutions currently support SQL and programmable batch processing (support for machine learning can use Spark or Flink built-in capabilities); in the processing paradigm, almost all adopt the workflow mode based on directed acyclic graph, and provide the corresponding integrated development environment. Currently, various data Lake solutions adopt different methods to support streaming computing. Before discussing the specific method, we first make a classification for convection calculation:
1) Mode 1: Real-time mode. This stream computing mode is equivalent to processing data in the form of "One to One"/"micro batch". It is often used in online services, such as risk control, recommendation, and early warning.
2) Mode 2: stream-like. This mode requires obtaining data that changes after a specified point in time, reading data of a certain version, reading the latest data, and so on. It is a streaming mode. It is commonly used in data exploration applications, for example, analyze daily activities, retention, and conversion within a certain period of time.
The essential difference between the two is that when mode 1 processes data, the data is usually not stored in the data lake, but only flowing in the network/memory. When mode 2 processes data, the data has been stored in the data Lake. To sum up, I personally recommend using the following model: figure 24 the data flow diagram of the data Lake is shown in Figure 24. When the data Lake needs to have the processing capability of Mode 1, Kafka-like middleware should be introduced as the infrastructure for data forwarding. A complete data lake solution should provide the ability to flow raw data to Kafka. The streaming engine can read data from Kafka-like components. After processing data, the streaming computing engine can write the results to OSS, RDBMS, NoSQL, and DW as needed for access. In a sense, the stream computing engine of Mode 1 does not necessarily exist as an integral part of the data Lake. It only needs to be easily introduced when applications need it. However, it should be pointed out here:
1) the streaming engine still needs to be able to easily read the metadata of the data Lake;
2) streaming engine tasks need to be integrated into the task management of the data Lake;
3) streaming tasks still need to be integrated into Unified permission management.
For Mode 2, it is essentially closer to batch processing. Many classic big data components, such as HUDI, IceBerg, and Delta, support classic computing engines such as Spark and Presto. For example, HUDI supports special types of tables (COW/MOR) to access snapshot data (specified version), incremental data, and quasi-real-time data. Currently, AWS and Tencent have integrated HUDI into their EMR services, and Alibaba Cloud DLA is also planning to launch DLA on HUDI capabilities.
Let's return to the first chapter at the beginning of this article. We said that the main users of the data Lake are data scientists and data analysts, and exploratory analysis and machine learning are common operations for this group of people; streaming Computing (real-time mode) is mostly used for online business. Strictly speaking, it is not the need of the target users in the data Lake. However, streaming computing (real-time mode) is an important part of the online business of most internet companies. Data Lake serves as a centralized data storage place within enterprises/organizations, the architecture needs to be scaled out, which can be easily scaled out to integrate streaming computing capabilities.
5) business support . Although most data Lake solutions provide standard APIs, such as JDBC, various popular BI report tools and dashboards can directly access data in the data Lake. However, in practical applications, we recommend that you push the data processed by the data Lake to the corresponding data engines that support online services.
as a new generation of big data analysis and processing infrastructure, data Lake needs to surpass traditional big data platforms. In my opinion, the following aspects are the possible future development directions of data Lake solutions.
1) cloud Native architecture . There are different opinions on what cloud-native architecture is, and it is difficult to find a unified definition. However, in the data Lake scenario, I personally think these three features are as follows:(1) separation of storage and computing, and independent expansion of computing and storage capabilities;(2) support for multi-modal computing engines, SQL, batch processing, streaming computing, machine learning, etc. (3) provide serverless services to ensure sufficient elasticity and support pay-as-you-go.
2) sufficient data management capabilities . Data lake requires more powerful data management capabilities, including but not limited to data source management, data category management, process orchestration, task scheduling, data tracing, data governance, quality management, permission management, etc.
3) big data capabilities and database experience . At present, the vast majority of data analysts only have Database experience. Although the big data platform has strong capabilities, it is not friendly to users. Data scientists and data analysts should pay attention to data, algorithms, models and their adaptation to business scenarios, rather than spending a lot of time and energy on the development of big data platforms. If data Lake wants to develop rapidly, how to provide users with a good use experience is the key. SQL-based database application development has been deeply rooted in the hearts of the people. How to release the capabilities of the data Lake through SQL is a major direction in the future.
4) comprehensive data integration and data development capabilities . The management and support for various heterogeneous data sources, full/incremental migration support for heterogeneous data, and support for various data formats need to be continuously improved. A complete, visual, and scalable integrated development environment is required.
5) deep integration and integration with business . The structure of a typical data Lake architecture has basically become a consensus in the industry: Distributed Object Storage, multi-modal computing engines, and data management. The key to the success of the data lake solution lies in data management, whether it is the management of raw data, data categories, data models, data permissions, or processing tasks, both of them cannot be adapted and integrated with business. In the future, more and more industry data Lake solutions will emerge, forming benign development and interaction with data scientists and data analysts. How to preset industry data models, ETL processes, analysis models, and customized algorithms in data Lake solutions may be a key point in future differentiated competition in the data Lake field.