×
Community Blog Data Lake: Concepts, Characteristics, Architecture, and Case Studies

Data Lake: Concepts, Characteristics, Architecture, and Case Studies

This article provides deep insights into the data lake concept and compares some common solutions available in the market.

By Jingxuan

The concept of data lakes has recently become a hot topic. There are currently heated discussions among frontline personnel on the best way to build a data lake. Does Alibaba Cloud have a mature data lake solution? If so, has this solution been applied in actual scenarios? What is a data lake? What are the differences between a data lake and a big data platform? What data lake solutions are provided by major players in the field of cloud computing? This article attempts to answer these questions and provide deep insights into the concept of data lakes. I would like to thank Nanjing for compiling the cases in Section 5.1 of this article and thank Xibi for his review.

This article consists of seven sections:

  1. What Is a Data Lake?
  2. The Basic Characteristics of a Data Lake
  3. The Basic Architecture of a Data Lake
  4. Alibaba Cloud Data Lake Solutions
  5. Typical Data Lake Scenarios
  6. Basic Data Lake Construction Process
  7. Summary

If you have any questions when reading this article, please do not hesitate to let me know.

1. What Is a Data Lake?

The data lake concept has recently become a hot topic. Many enterprises are building or plan to build their own data lakes. Before starting to plan a data lake, we must answer the following key questions:

  1. What is a data lake?
  2. What are the building blocks of a data lake project?
  3. How can we build basic data lake architecture?

First, I want to look at the data lake definitions provided by Wikipedia, Amazon Web Services (AWS), and Microsoft.

Wikipedia defines a data lake as:

A data lake is a system or repository of data stored in its natural/raw format,[1] usually object blobs or files. A data lake is usually a single store of all enterprise data, including raw copies of source system data and transformed data used for tasks, such as reporting, visualization, advanced analytics, and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs), and binary data (images, audio, video). [2]A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value.

AWS defines a data lake in a more direct manner:

A data lake is a centralized repository that allows you to store all of your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Microsoft's definition of a data lake is more ambiguous. It just lists the features of a data lake.

Azure Data Lake includes all of the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics. Azure Data Lake works with existing IT investments for identity, management, and security for simplified data management and governance. It also integrates seamlessly with operational stores and data warehouses so you can extend current data applications. We've drawn on the experience of working with enterprise customers and running some of the largest scale processing and analytics in the world for Microsoft businesses like Office 365, Xbox Live, Azure, Windows, Bing, and Skype. Azure Data Lake solves many of the productivity and scalability challenges that prevent you from maximizing the value of your data assets with a service that's ready to meet your current and future business needs.

Regardless of the source, most definitions of the data lake concept focus on the following characteristics of data lakes:

  1. A data lake provides sufficient data storage to store all of the data of an enterprise or organization.
  2. A data lake can store massive amounts of data of all types, including structured, semi-structured, and unstructured data.
  3. The data stored in a data lake is raw data or a complete replica of business data. Data is stored in a data lake as it is in a business system.
  4. A data lake provides full metadata to manage all types of data-related elements, including data sources, data formats, connection information, data schemas, and permission management capabilities.
  5. A data lake provides diverse analytics capabilities, including batch processing, stream computing, interactive analytics, and machine learning, along with job scheduling and management capabilities.
  6. A data lake supports comprehensive data lifecycle management. In addition to raw data, a data lake stores the intermediate results of analytics and processing and keeps complete records on these processes. This helps you trace the entire production process of any data record.
  7. A data lake provides comprehensive capabilities for data retrieval and publishing. A data lake supports a wide variety of data sources. It retrieves full and incremental data from data sources and stores the retrieved data in a standard manner. A data lake pushes the results of data analytics and processing to appropriate storage engines, which support access from different applications.
  8. A data lake provides big data capabilities, including the ultra-large storage space and scalability needed to process data on a large scale.

In short, a data lake is an evolving and scalable infrastructure for big data storage, processing, and analytics. Oriented toward data, a data lake can retrieve and store full data of any type and source at any speed and scale. It processes data in multiple modes and manages data throughout its lifecycle. It also supports enterprise applications by interacting and integrating with a variety of disparate external data sources.

1
Figure 1: A schematic drawing of a data lake's basic capabilities

Note the following two points:

  1. Scalability means a data lake is scalable in terms of size and capabilities. Specifically, a data lake not only provides sufficient storage and computing capabilities for an increasing amount of data but also constantly provides new data processing models to meet emerging needs. Business demands always evolve along with business growth. For example, we have seen how business needs have evolved from batch processing to interactive and instant analytics and then to real-time analytics and machine learning.
  2. Data orientation means a data lake is simple and easy to use, helping you focus on your businesses, models, algorithms, and data without having to work on a complex IT infrastructure. Data lakes are designed for data scientists and analysts. Currently, cloud-native is the ideal way to build a data lake. This view is discussed in detail in Section 3: Basic Architecture of a Data Lake.

2. The Basic Characteristics of a Data Lake

This section introduces the basic characteristics of a data lake, especially the characteristics that differentiate a data lake from a big data platform or a traditional data warehouse. First, let's take a look at a comparison table from the AWS website.

2

The preceding table compares the differences between a data lake and a traditional data warehouse. We can further analyze the characteristics of a data lake in terms of data and computing:

1.  Data Fidelity: A data lake stores data as it is in a business system. Different from a data warehouse, a data lake stores raw data, whose format, schema, and content cannot be modified. A data lake stores your business data as-is. The stored data can include data of any format and of any type.

2.  Data Flexibility: As shown in the "Schema" row of the preceding table, schema-on-write or schema-on-read indicates the phase in which the data schema is designed. A schema is essential for any data application. Even schema-less databases, such as MongoDB, recommend using identical or similar structures as a best practice. Schema-on-write means a schema for data importing is determined based on a specific business access mode before data is written. This enables effective adaptation between data and your businesses but increases the cost of data warehouse maintenance at the early stage. You may be unable to flexibly use your data warehouse if you do not have a clear business model for your start-up.

A data lake adopts schema-on-read, meaning it sees business uncertainty as the norm and can adapt to unpredictable business changes. You can design a data schema in any phase as needed, so the entire infrastructure generates data that meets your business needs. Fidelity and flexibility are closely related to each other. Since business changes are unpredictable, you can always keep data as-is and process data as needed. Therefore, a data lake is more suitable for innovative enterprises and enterprises with rapid business changes and growth. A data lake is intended for data scientists and business analysts that usually need highly efficient data processing and analytics and prefer to use visual tools.

3.  Data Manageability: A data lake provides comprehensive data management capabilities. Due to its fidelity and flexibility, a data lake stores at least two types of data: raw data and processed data. The stored data constantly accumulates and evolves. This requires robust data management capabilities, which cover data sources, data connections, data formats, and data schemas. A data schema includes a database and related tables, columns, and rows. A data lake provides centralized storage for the data of an enterprise or organization. This requires permission management capabilities.

4.  Data Traceability: A data lake stores the full data of an organization or enterprise and manages the stored data throughout its lifecycle, from data definition, access, and storage to processing, analytics, and application. A robust data lake fully reproduces the data production process and data flow, ensuring that each data record is traceable through the processes of access, storage, processing, and consumption.

A data lake requires a wide range of computing capabilities to meet your business needs.

5.  Data Rich Computing Engines: A data lake supports a diversity of computing engines, including batch processing, stream computing, interactive analytics, and machine learning engines. Batch processing engines are used for data loading, conversion, and processing. Stream computing engines are uses for real-time computing. Interactive analytics engines are used for exploratory analytics. The combination of big data and artificial intelligence (AI) gave birth to a variety of machine learning and deep learning algorithms. For example, TensorFlow and PyTorch can be trained on sample data from the Hadoop Distributed File System (HDFS), Amazon S3, or Alibaba Cloud Object Storage Service (OSS). Therefore, a qualified data lake project should provide support for scalable and pluggable computing engines.

6.  Multi-Modal Storage Engine: In theory, a data lake should provide a built-in multi-modal storage engine to enable data access by different applications, while considering a series of factors, such as the response time (RT), concurrency, access frequency, and costs. However, in reality, the data stored in a data lake is not frequently accessed, and data lake-related applications are still in the exploration stage. To strike a balance between cost and performance, a data lake is typically built by using relatively inexpensive storage engines, such as Amazon S3, Alibaba Cloud OSS, HDFS, or Object-Based Storage (OBS). When necessary, a data lake can collaborate with external storage engines to meet the needs of various applications.

3. The Basic Architecture of a Data Lake

A data lake is a next-generation big data infrastructure. First, let's take a look at the evolution of the big data infrastructure.

Phase 1: This shows the offline data processing infrastructure, as represented by Hadoop. As shown in Figure 2, Hadoop is a batch data processing infrastructure that uses HDFS as its core storage and MapReduce (MR) as the basic computing model. A series of components have been developed for HDFS and MR. These components continuously improve big data platforms' data processing capabilities, such as HBase for online key-value (KV) operations, Hive for SQL, and Pig for workflows. New computing models are constantly proposed to meet increasing needs for batch processing performance, resulting in computing engines, such as Tez, Spark, and Presto. The MR model has also evolved into the directed acyclic graph (DAG) model. The DAG model improves computing models' abstract concurrency. It splits each computing process by dividing a job into logical stages based on aggregation operations. Each stage consists of one or more tasks, which are executed concurrently to improve the computing process parallelism. To reduce the frequency of writing intermediate results from data processing, computing engines, such as Spark and Presto, cache data in the memory of compute nodes whenever possible. This improves data process efficiency and system throughput.

3
Figure 2: Hadoop Architecture Diagram

Phase 2: Lambda architecture. With constant changes in data processing capabilities and processing demand, you may find it impossible to achieve high real-time performance in certain processing scenarios no matter how you improve the batch processing performance. This problem is solved by stream computing engines, such as Storm, Spark Streaming, and Flink. Batch processing is combined with stream computing to meet the needs of many emerging applications. Lambda provides a data schema that unifies the results returned by batch processing and stream computing, so you do not have to concern yourself with what underlying computing model is used. Figure 3 shows the Lambda architecture. The Lambda and Kappa architecture diagrams were sourced from the Internet.

4
Figure 3: Lambda Architecture Diagram

The Lambda architecture integrates stream computing and batch processing. Data flows through the Lambda platform from left to right, as shown in Figure 3. The incoming data is divided into two parts. One part is subject to batch processing, and the other part is subject to stream computing. The final results of batch processing and stream computing are provided to applications through the service layer, ensuring access consistency.

Phase 3: Kappa architecture. The Lambda architecture allows applications to read data consistently. However, the separation of batch processing and stream computing complicates research and development. Is there a single system to solve all these problems? A common practice is to use stream computing, which features an inherent and highly scalable distributed architecture. The two computing models of batch processing and stream computing are unified by improving the stream computing concurrency and increasing the time window of streaming data.

5
Figure 4: Kappa Architecture Diagram

In short, the big data infrastructure has evolved from the Hadoop architecture to the Lambda and Kappa architecture. Big data platforms process the full data of an enterprise or organization while providing a full range of data processing capabilities to meet application needs. In current enterprise practices, relational databases store data based on independent business systems. Other data is stored on big data platforms for unified processing. The big data infrastructure is specially designed for storage and computing, but it ignores data asset management. A data lake is designed based on a consideration of asset management.

Once, I read an interesting article that raised this question: Why do we use the term "data lake" instead of data river or data sea? I would like to answer this question by making the following points:

  1. A river flows freely and eventually converges with the sea. Enterprise data accumulates over a long period of time, which is analogous to how rain fills a lake. Lakes are naturally stratified to adapt to different ecosystems. This can be compared to the scenario where an enterprise builds a unified data center to store managed data at different layers. Hot data is stored at the data center's upper layer for easy access by applications. Warm and cold data are stored in different storage media in the data center. This achieves a balance between data storage capacity and cost.
  2. A sea appears boundless whereas a lake is clearly bounded. The boundary of a lake is analogous to an enterprise or organization's business boundary. Therefore, a data lake requires sufficient data and permission management capabilities.
  3. A data lake requires fine-grained management. A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended applications or provides little value.

As the big data infrastructure evolves, enterprises and organizations manage data as an important asset type. To make better use of data, enterprises and organizations must take the following measures to manage data assets:

  1. Store data assets as-is over a long term
  2. Implement effective management and centralized governance of data assets
  3. Provide multi-modal computing capabilities to meet data processing needs
  4. Provide unified data views, data schemas, and data processing results for businesses

A data lake not only provides the basic capabilities of a big data platform, but also data management, data governance, and data asset management capabilities. To implement these capabilities, a data lake provides a series of data management components, including data access, data migration, data governance, quality management, asset catalog, access control, task management, task orchestration, and metadata management. Figure 5 shows the reference architecture of a data lake system. Similar to a big data platform, a typical data lake provides the storage and computing capabilities needed to process data at an ultra-large-scale, as well as multi-modal data processing capabilities. In addition, a data lake provides the following more sophisticated data management capabilities:

  1. Improved Data Access Capabilities: A data lake provides data access capabilities that allow you to define and manage a variety of disparate external data sources and extract and migrate data from these sources. The extracted and migrated data can include metadata from external sources and actually stored data.
  2. Improved Data Management Capabilities: A data lake provides basic data management capabilities and extended data management capabilities. Basic data management capabilities are the capabilities required of a data lake and include metadata management, data access control, and data asset management. Section 4, Commercially Available Data Lake Solutions, will discuss the basic data management capabilities provided by different vendors. Extended data management capabilities include job management, process orchestration, and capabilities related to data quality and data governance. The job management and process orchestration capabilities allow you to manage, orchestrate, schedule, and monitor jobs that process data in a data lake system. These capabilities are typically obtained from the data lake developer by purchasing or customizing subsystems and modules for data integration or development. Custom subsystems and modules can be integrated with the data lake by reading related metadata from the data lake. Data quality and data governance capabilities are complex and not directly provided by the data lake system. However, the data lake system provides interfaces or metadata to allow capable enterprises and organizations to integrate the data lake system with existing data governance software or develop custom data lake systems.
  3. Shared Metadata: A data lake provides metadata as the basis for integrating all of its computing engines with the stored data. In an effective data lake system, computing engines directly retrieve information from metadata while processing data. Such information includes the data storage location, data format, data schema, and data distribution and is directly used for data processing without the need for manual intervention or programming. In addition, an effective data lake system controls access to the stored data at the levels of the database, table, column, and row.

6
Figure 5: The Reference Architecture of Data Lake Components

The centralized storage shown in Figure 5 is a business-related concept. It provides a unified area for the storage of the internal data of an enterprise or organization. A data lake uses a scalable distributed file system for storage. Most data lake practices recommend using distributed systems, such as Amazon S3, Alibaba Cloud OSS, OBS, and HDFS, as the data lake's unified storage.

Figure 6 illustrates the overall data lifecycle in a data lake. In theory, a well-managed data lake retains raw data permanently, while constantly improving and evolving process data to meet your business needs.

7
Figure 6: Data Lifecycle in a Data Lake

4. Alibaba Cloud's Data Lake Solution

Alibaba Cloud provides a wide range of data products. I currently work in the data business unit. In this section, I will focus on how to build a data lake using the products of the data business unit. Other cloud products may also be involved. Alibaba Cloud's data lake solution is specially designed for data lake analytics and federated analytics. It is based on Alibaba Cloud's database products. Figure 12 illustrates Alibaba Cloud's data lake solution.

13
Figure 12: Data Lake Solution Provided by Alibaba Cloud

The solution uses Alibaba Cloud OSS as the data lake's centralized storage. The solution can use all Alibaba Cloud databases as data sources, including online transaction processing (OLTP), OLAP, and NoSQL databases. Alibaba Cloud's data lake solution provides the following key features:

  1. Data Access and Migration: Alibaba Cloud's Data Lake Analytics (DLA) provides the Formation component capable of metadata discovery and a one-click data lake setup. Currently, a one-click full setup is supported, while binlog-based incremental setup is being developed and expected to go online soon. The incremental data lake setup capability will significantly improve a data lake's real-time data performance and minimize the burden on the source business databases. DLA Formation is an internal component and is not exposed externally.
  2. Data Catalog: DLA provides a metadata catalog to centrally manage the data assets inside and outside of a data lake. The metadata catalog provides a unified metadata entry for federated analytics.
  3. DLA Provides Two Built-In Computing Engines: SQL and Spark, both of which are deeply integrated with the metadata catalog, allow you to easily retrieve metadata. Based on Spark capabilities, the DLA solution supports computing models, such as batch processing, stream computing, and machine learning.
  4. In the peripheral ecosystem, DLA implements data access and aggregation based on all types of disparate data sources. DLA is deeply integrated with AnalyticDB, a native data warehouse, allowing it to provide external access capabilities. DLA directly pushes data processing results to AnalyticDB to support real-time, interactive, and complex ad hoc queries. AnalyticDB uses foreign tables to implement data backflow to Alibaba Cloud OSS. DLA streamlines all disparate data sources in Alibaba Cloud to achieve data mobility.
  5. Alibaba Cloud's data lake solution implements data integration and development by using two methods: DataWorks and Data Management Service (DMS). Both methods provide visual process orchestration, task scheduling, and task management capabilities to external users. DataWorks provides sophisticated data map capabilities for data lifecycle management.
  6. DMS provides powerful data management and data security capabilities. DMS manages data at four granularities, database, table, column, and row, providing control over data security required by enterprises. In addition to fine-grained permission management, DMS deeply refines data lake O&M and development by applying database-specific DevOps to the data lake field.

This further refines the data application architecture of Alibaba Cloud's data lake solution.

14
Figure 13: Data Application Architecture of Alibaba Cloud's Data Lake Solution

Data flows from left to right. Data producers produce all types of data, including on-premises data, off-premises data, and data from other clouds, and use tools to upload the produced data to generic or standard data sources, including Alibaba Cloud OSS, HDFS, and databases. DLA implements data discovery, data access, and data migration to build a complete data lake that is adaptable to all types of data sources. DLA processes incoming data based on SQL and Spark and externally provides visual data integration and development capabilities based on DataWorks and DMS. To implement external application service capabilities, DLA provides a standard Java database connectivity (JDBC) interface that can be directly connected to all types of report tools and dashboards. Based on Alibaba Cloud's database ecosystem, including OLTP, OLAP, and NoSQL databases, DLA provides SQL-based external data processing capabilities. If your enterprise develops technology stacks based on databases, DLA allows you to implement transformations more easily and at a lower cost.

DLA integrates data lakes and data warehouses based on cloud-native. Traditional enterprise data warehouses are still essential for report applications in the era of big data. However, data warehouses do not support flexible data analytics and processing. Therefore, we recommend deploying a data warehouse as an upper-layer application in a data lake. The data lake is the only place to store your enterprise or organization's raw business data. It processes raw data as required by business applications to generate reusable intermediate results. DLA pushes the intermediate results to the data warehouse with a relatively fixed data schema, so you can implement business applications based on the data warehouse. DLA is deeply integrated with AnalyticDB in the following two aspects:

  1. Both DLA and AnalyticDB use the homologous SQL parsing engine. DLA's SQL syntax is fully compatible with that of AnalyticDB, allowing you to develop applications based on data lakes and data warehouses using the same technology stack.
  2. Both DLA and AnalyticDB inherently support OSS access. DLA uses OSS as its native storage, whereas AnalyticDB provides easy access to OSS's structured data through foreign tables. Foreign tables enable data mobility between DLA and AnalyticDB.

The combination of DLA and AnalyticDB integrates data lakes and data warehouses under cloud-native. DLA can be viewed as the near-source layer of a scalable data warehouse. Compared with a traditional data warehouse, this near-source layer provides the following advantages:

  1. It stores structured, semi-structured, and unstructured data
  2. It supports access to all types of disparate data sources
  3. It discovers, manages, and synchronizes metadata
  4. It provides the built-in SQL and Spark computing engines, which can process various types of data more effectively
  5. It manages full data throughout its lifecycle.

By integrating DLA and AnalyticDB, you can enjoy the processing capabilities of a big data platform and a data warehouse at the same time.

15

DLA enables "omnidirectional" data mobility, allowing you to access data in any location just as you would access data in a database, regardless of whether the data is on-premises or off-premises, inside or outside your organization. In addition, the data lake monitors and records inter-system data mobility so you can trace the data flow.

Summary

A data lake is more than a technical platform and can be implemented in many ways. The maturity of a data lake is primarily evaluated based on its data management capabilities and its interworking with peripheral ecosystems. Data management capabilities include capabilities related to metadata, data asset catalogs, data sources, data processing tasks, data lifecycles, data governance, and permission management.

5. Case Studies

5.1 Advertising Data Analytics

In recent years, the cost of traffic acquisition has been increasing, forcing many companies to invest heavily to attract new online customers. Increasing Internet advertising costs have made companies give up the strategy of expanding their customer bases by buying traffic. Frontend traffic optimization no longer works well. An effective way to break out of ineffective online advertising methods is to use data tools to convert more of your website visitors into paying customers and refine ad serving comprehensively. Big data analytics is essential for the conversion from advertising traffic into sales.

To provide a stronger foundation for decision support, you can collect more tracking data, including channels, ad serving times, and target audiences. Then, you can analyze the data based on the click-through rate (CTR) to develop strategies that lead to better performance and higher productivity. Data lake analytics products and solutions are widely favored by advertisers and ad publishers. They provide next-generation technologies to help collect, store, and analyze a wide variety of structured, semi-structured, and unstructured data related to ad serving.

DG is a leading global provider of intelligent marketing services to enterprises looking to expand globally. Based on its advanced advertising technology, big data, and operational capabilities, DG provides customers with high-quality services for user acquisition and traffic-to-sales conversion. When it was founded, DG decided to build its IT infrastructure on a public cloud. Initially, DG chose the AWS cloud platform. It stored its advertising data in a data lake built on Amazon S3 and used Amazon Athena for interactive analytics. However, the rapid development of Internet advertising has created several challenges for the advertising industry. This has given rise to mobile advertising and tracking systems designed to solve the following problems:

  1. Concurrency and Traffic Peaks: Traffic peaks are a frequent occurrence in the advertising industry, where ads can receive tens or even hundreds of thousands of clicks nearly simultaneously. This requires good system scalability to quickly respond to and process each click.
  2. Real-Time Analysis of Massive Data Volumes: A performance monitoring system analyzes the data of each click and activation and transmits related data to downstream media.
  3. Exponential Data Growth on Advertising Platforms: Business log data is continuously generated and uploaded, while data that is exposed, clicked, and pushed is constantly processed. About 10 to 15 TB of new data is created each day, which requires high data processing performance. In this context, DG must efficiently collect offline and near real-time advertising statistics and aggregate and analyze the collected statistics based on the needs of customers.

Apart from the preceding three business challenges, DG was also facing rapidly increasing daily data volumes, having to scan more than 100 TB of data daily. The AWS platform no longer provided sufficient bandwidth for Amazon Athena to read data from Amazon S3, and data analytics seriously lagged. To drive down analytics costs resulting from exponential data growth, DG decided on full migration from the AWS platform to the Alibaba Cloud platform after meticulous testing and analysis. Figure 16 shows the architecture of DG's transformed advertising data lake solution.

17
Figure 16: DG's Transformed Advertising Data Lake Solution

After migration, we integrated DLA and OSS to provide superior analytics capabilities for DG. This would allow it to better deal with traffic peaks and valleys. On the one hand, this made it easy to perform provisional analysis on data collected from brand customers. On the other hand, DLA provides powerful computing capabilities, allowing DG to analyze ad serving on a monthly and quarterly basis, accurately calculate the number of activities for each brand, and analyze the ad performance of each activity in terms of media, markets, channels, and data management platforms (DMPs). This approach allows an intelligent traffic platform to better improve the conversion rate for brand marketing. In terms of the total cost of ownership (TCO) for ad serving and analytics, DLA provides serverless elastic services that are billed in pay-as-you-go mode, with no need to purchase fixed resources. Customers can purchase resources based on the peaks and valleys of their businesses. This meets the needs of elastic analytics and significantly lowers O&M costs and operational costs.

18
Figure 17: The Deployment of a Data Lake

Overall, DG was able to significantly lower its hardware costs, labor costs, and development costs after migration from AWS to Alibaba Cloud. By using DLA's serverless cloud services, DG did not pay a lot of upfront fees for servers, storage, and devices and the company did not have to purchase many cloud services all at once. Instead, DG can scale out its infrastructure as needed. It adds servers during business peaks and reduces servers during business valleys, improving its capital utilization. The Alibaba Cloud platform also empowered DG with improved performance. DG's mobile advertising system frequently encountered exponential increases in traffic volume during its rapid business growth and introduction of multiple business lines. After DG migrated to Alibaba Cloud, Alibaba Cloud's DLA team worked with the OSS team to implement deep optimizations and transformations to significantly improve DG's analytics performance through DLA. The DLA computing engine dedicated to database analytics and the AnalyticDB shared computing engine, which ranked first in the TPC Benchmark DS (TPC-DS), improve performance dozens of times over compared with the Presto-native computing engine.

5.2 Game Operation Analytics

A data lake is a type of big data infrastructure with an excellent total cost of operation (TCO) performance. For many fast-growing game companies, a popular game often results in extremely fast data growth in a short time. If this happens, it is difficult for R&D personnel to adapt their technology stacks to the amount and speed of data growth. It is also difficult to utilize data growing at such a fast rate. A data lake is a technical solution that can solve these problems.

YJ is a fast-growing game company. It plans to develop and operate games based on an in-depth analysis of user behavior data. There is a core logic behind data analytics. As the gaming industry becomes more competitive, gamers are demanding higher quality products and the lifecycles of game projects are becoming increasingly short, which directly affects projects' return on investment (ROI). Through data operations, developers can effectively extend their project lifecycles and precisely control the various business stages. In addition, traffic costs are constantly increasing. Therefore, it is increasingly important to create an economic and efficient precision data operations system to better support business development. A company relies on its technical decision makers to select an appropriate infrastructure to support its data operations system by considering the following factors:

  1. High Elasticity: To adapt to short-term explosive data growth, a game company must have highly elastic computing and storage systems.
  2. High-Cost Performance: User behavior data is typically analyzed in a long time frame. For example, the customer retention rate is often analyzed over a time frame of 90-180 days. Therefore, it is important to find the most cost-efficient way to store massive data volumes.
  3. Sufficient Analytic Capabilities and Scalability: In many cases, user behaviors are reflected in tracking data, and tracking data is usually analyzed in association with structured data, such as user registration information, login information, and bills. Analyzing such data requires at least the following capabilities: big data ETL capabilities, disparate data source access capabilities, and modeling capabilities for complex analytics.
  4. Compatibility With the Company's Existing Technology Stacks and Consideration of Future Recruitment: YJ's technology selection is primarily based on the technology stacks familiar to its technical personnel. At YJ, most members of the technical team are only familiar with traditional database development, which means, MySQL. Moreover, YJ has a talent shortage. Only one technical engineer works on data operations analytics. Therefore, YJ cannot independently build a big data analytics infrastructure in a short time. For YJ, the ideal situation is for a majority of the data analytics work to be done in SQL. In the talent market, SQL developers far outnumber big data developers. We helped YJ transform its data analytics solution.

19
Figure 18: YJ's Data Analytics Solution Before the Transformation

Before the transformation, YJ stored all its structured data in a max-specification MySQL database. Gamer behavior data was collected by Logtail in Log Service (SLS) and then shipped to OSS and Elasticsearch. This architecture had the following problems:

  1. Behavior data and structured data are separated and cannot be associated for analysis
  2. Intelligent behavior data retrieval was supported, but deep data mining and analytics services were not
  3. OSS was only used for data storage and its deeper data value was not utilized

Our analysis showed that YJ's architecture was a prototype of a data lake because full data is stored in OSS. A simple way to transform the architecture was to improve YJ's ability to analyze the data in OSS. A data lake would support SQL-based data processing, allowing YJ to develop technology stacks. In short, we transformed YJ's architecture to help it build a data lake, as shown in Figure 19.

20
Figure 19: YJ's Data Lake Solution After the Transformation

While retaining the original data flow, the data lake solution adds DLA for secondary processing of data stored in OSS. DLA provides a standard SQL computing engine and supports access to various disparate data sources. The DLA-processed data can be directly used by businesses. The data lake solution introduces AnalyticDB, a cloud-native data warehouse, to support low-latency interactive analytics that otherwise cannot be implemented by DLA. The solution also introduces QuickBI in the frontend for visual analysis. Figure 14 illustrates a classic implementation of data lake-data warehouse integration in the gaming industry.

YM is a data intelligence service provider. It provides data analytics and operations services to medium- and small-sized merchants. Figure 20 shows the Software as a Service (SaaS) model of YM's data intelligence services.

21
Figure 20: SaaS Model of YM's Data Intelligence Services

The platform provides multi-client SDKs for merchants to access tracking data in diverse forms, such as webpages, apps, and mini programs. The platform also provides unified data access and analytics services in SaaS mode. Merchants can analyze this tracking data at a fine granularity through data analytics services. The analyzed data can be used for basic analytics functions, such as behavior statistics, customer profiling, customer selection, and ad serving monitoring. However, this SaaS model has the following problems:

  1. The platform cannot provide a full range of SaaS-based analytics functions to meet the various customization needs of all types of merchants. For example, some merchants focus on sales, some focus on customer management, and others focus on cost optimization.
  2. Unified data analytics services cannot support certain advanced analytics functions, such as user-defined extensions and customer selection based on custom labels, especially custom labels that depend on merchant-defined algorithms.
  3. Deep support for data asset management has yet to be achieved. In the era of big data, data has become a type of asset possessed by an enterprise or organization. The SaaS model has to figure out a way to appropriately accumulate merchant-owned data over the long term.

Therefore, we introduced a data lake to the SaaS model shown in Figure 20 to provide an infrastructure for data accumulation, modeling, and operations analytics. Figure 21 shows a SaaS-based data intelligence service model supported by a data lake.

22
Figure 21: SaaS-Based Data Intelligence Service Model Supported by a Data Lake

As shown in Figure 21, the platform allows each merchant to build its own data lake in one click. The merchant can synchronize its full tracking data and the data schema to the data lake, and also archive daily incremental data to the data lake in T+1 mode. The data lake-based service model not only provides traditional data analytics services, but also three major capabilities: data asset management, analytic modeling, and service customization.

  1. Data Asset Management: A data lake allows merchants to accumulate data based on a custom retention period, which allows them to control costs. The data lake provides data asset management capabilities, allowing merchants to both manage their raw data and classify and store process data and result data after data processing. This maximizes the value of tracking data.
  2. Analytic Modeling: A data lake stores the tracking data schema in addition to raw data. The tracking data schema is an abstraction of the business logic by the global data intelligence service platform. A data lake exports raw data as a type of asset along with the data schema. Merchants can use the tracking data schema to deeply analyze the user behavior logic behind tracking data and gain deep insights into user behavior. Ultimately, this will allow them to better identify user needs.
  3. Service Customization: A data lake supports data integration and development, as well as the export of the tracking data schema. By analyzing this schema, merchants can continuously process raw data in a custom and iterative manner to extract valuable information from it. This helps merchants get more values than by using only data analytics services.

6. Basic Data Lake Construction Process

A data lake is a more sophisticated big data processing infrastructure than a traditional big data platform. It is a technology that is better adapted to customers' businesses. A data lake provides more features than a big data platform, such as metadata, data asset catalogs, permission management, data lifecycle management, data integration and development, data governance, and quality management. All these features are easy to use and allow the data lake to better meet business needs. To meet your business needs at an optimal TCO, data lakes provide some basic technical features, such as the separate extension of storage and computing, a unified storage engine, and a multi-mode computing engine.

The data lake setup process is business-oriented and different from the process of building a data warehouse or data mid-end, which are also popular technologies. The difference is that a data lake is built in a more agile manner. It is operational during setup and manageable during use. To better understand the agility of the data lake setup, let's first review the process of building a data warehouse. A data warehouse can be built by using bottom-up and top-down approaches, which were proposed by Bill Inmon and Ralph Kimball, respectively. The two approaches are briefly summarized here:

1.  Bill Inmon proposed the bottom-up approach, in which a data warehouse is built using the enterprise data warehouse and data mart (EDW-DM) model. ETL tools are used to transfer data from data sources of an operational or transactional system to the data warehouse's operational data store (ODS). Data in the ODS is processed based on the predefined EDW paradigm and then transferred to the EDW. An EDW is an enterprise- or organization-wide generic data schema, which is not suitable for direct data analytics performed by upper-layer applications. Therefore, business departments build DMs based on the EDW.

A DM is easy to maintain and highly integrated, but it lacks flexibility once its structure is determined and takes a long time to deploy due to the need to adapt to your business needs. The Inmon model is suitable for building data warehouses for relatively mature businesses, such as finance.

2.  Ralph Kimball proposed the top-down (DM-DW) data schema. Data from data sources of an operational or transactional system is extracted or loaded to the ODS. Data in the ODS is used to build a multidimensional subject DM by using the dimensional modeling method. All DMs are associated through consistent dimensions to form an enterprise- or organization-wide generic data warehouse.

The top-down (DM-DW) data schema provides a fast warehouse setup, quick ROI, and agility. However, it is difficult to maintain as an enterprise resource and has a complex structure. In addition, DMs are difficult to integrate. The top-down (DM-DW) data schema applies to small and medium-sized and enterprises and Internet companies.

The division of implementation into top-down and bottom-up approaches is only theoretical. Whether you build an EDW or a DM first, you need to analyze data and design a data schema before building a data warehouse or data mid-end. Figure 22 illustrates the basic process of building a data warehouse or data mid-end.

23
Figure 22: Basic Process of Building a Data Warehouse or Data Mid-End

  1. Data Analysis: Before building a data lake, your enterprise or organization must comprehensively analyze and survey its internal data, including the data sources, data types, data forms, data schemas, total data volume, and incremental data volume. This allows you to sort out the organizational structure and clarify its relationship with data. This also helps you clarify the data lake's user roles, permission design, and service methods.
  2. Data Schema Abstraction: Sort and classify data based on your enterprise or organization's business characteristics, divide the data by domain to generate metadata for data management, and build a generic data schema based on the metadata.
  3. Data Access: Determine the data sources to be accessed based on the data analysis results in Step 1. Then, determine the required data access technology and capabilities based on the data sources. The accessed data must at least include the data source metadata, raw data metadata, and raw data. Store data by type based on the results in Step 2.
  4. Converged Governance: Use the computing engines provided by the data lake to process data to generate intermediate data and result data. Then, store the data appropriately. The data lake provides comprehensive capabilities for data development, task management and scheduling, and detailed recording of data processing. Data governance requires more data schemas and metric models.
  5. Business Support: Based on the generic data schema, each business department must customize a fine-grained data schema, data usage process, and data access service.

For a fast-growing Internet company, the preceding steps are cumbersome and impossible to implement especially data schema abstraction. In many cases, businesses are conducted through a trial-and-error exploration without a clear direction. This makes it impossible to build a generic data schema, without which the subsequent steps become unfeasible. This is one of the reasons why many fast-growing startups find it difficult to build a data warehouse or data mid-end to meet their needs.

In contrast, a data lake can be built in an agile manner. We recommend building a data lake according to the following procedure.

24
Figure 23: Basic Data Lake Construction Process

Compared with Figure 22, which illustrates the basic process of building a data warehouse or data mid-end, Figure 23 illustrates a five-step process for building a data lake in a simpler and more feasible manner:

1.  Data Analysis: Analyze basic information about the data, including the data sources, data types, data forms, data schemas, total data volume, and incremental data volumes. That is all that needs to be done in the data analysis step. A data lake stores full raw data, so it is unnecessary to perform in-depth design in advance.

2.  Technology Selection: Select the technologies used to build a data lake based on the data analysis results. The industry already has many universal practices for data lake technology selection. You should follow three basic principles: separation of computing and storage, elasticity, and independent extensions. We recommend using a distributed object storage system, such as Amazon S3, Alibaba Cloud OSS, or OBS, and select computing engines by considering your batch processing needs and SQL processing capabilities. In practice, batch processing and SQL processing are essential for data processing. Stream computing engines will be described later. We also recommend using serverless computing and storage technologies, which can be evolved in applications to meet future needs. You can build a dedicated cluster when you need an independent resource pool.

3.  Data Access: Determine the data sources to be accessed and complete full data extraction and incremental data access.

4.  Application Governance: Application governance is the key to a data lake. Data applications in the data lake are closely related to data governance. Clarify your needs based on data applications, and generate business-adapted data during the data ETL process. At the same time, create a data schema, metrics system, and quality standards. A data lake focuses on raw data storage and exploratory data analytics and applications. However, this does not mean the data lake does not require any data schema. Business insights and abstraction can significantly promote the development and application of data lakes. Data lake technology enables agile data processing and modeling, helping you quickly adapt to business growth and changes.

From the technical perspective, a data lake is different from a big data platform in that the former provides sophisticated capabilities to support full-lifecycle data management and applications. These capabilities include data management, category management, process orchestration, task scheduling, data traceability, data governance, quality management, and permission management. In terms of computing power, current mainstream data lake solutions support SQL batch processing and programmable batch processing. You can use the built-in capabilities of Spark or Flink to support machine learning. Almost all processing paradigms use the DAG-based workflow model and provide corresponding integrated development environments. Support for stream computing varies in different data lake solutions. First, let's take a look at existing stream computing models:

1. **Real-Time Stream Computing:** This model processes one incoming data record at a time, which is called mini batch processing. It is often applied in online businesses, such as risk control, recommendations, and warnings.

2. **Quasi-Stream Computing:** This model is used to retrieve data changed after a specified time point, read data of a specific version, or read the latest data. It is often applied in exploratory data applications, such as for the analysis of daily active users, retention rates, and conversion rates.

When the real-time stream computing model is processing data, the data is still in the network or memory but has not yet been stored in the data lake. The quasi-stream computing model only processes data that is already stored in the data lake. I recommend using the stream computing model illustrated in Figure 24.

25
Figure 24: Data Flow in a Data Lake

As shown in Figure 24, to apply the real-time stream computing model to the data lake, you can introduce Kafka-like middleware as the data forwarding infrastructure. A comprehensive data lake solution can direct the raw data flow to Kafka. A stream engine reads data from the Kafka-like component. Then, the engine writes the data processing results to OSS, a relational database management system (RDBMS), NoSQL database, or data warehouse as needed. The processed data can be accessed by applications. In a sense, you can introduce a real-time stream computing engine to the data lake based on your application needs. Note the following points when using a stream computing engine:

1. The stream engine must be able to conveniently read data from the data lake.

2. Stream engine tasks must be included in task management in the data lake.

3.Streaming tasks must be included in unified permission management.

The quasi-stream computing model is similar to batch processing. Many big data components, such as Apache Hudi, IceBerg, and Delta, support classic computing engines, such as Spark and Presto. For example, Apache Hudi provides special table types, such as COW and MOR, to allow you to access snapshot data (of the specified version), incremental data, and quasi-real-time data. AWS and Tencent have integrated Apache Hudi into their EMR services, and Alibaba Cloud DLA is also planning to introduce the DLA on Apache Hudi capability.

As mentioned above, data scientists and data analysts are the primary users of data lakes, mainly for exploratory analytics and machine learning. Real-time stream computing is typically applied to online businesses but is not essential for data lake users. The real-time stream computing model is essential for the online businesses of many Internet companies. A data lake not only provides centralized storage for an enterprise or organization's internal data but also a scalable architecture to integrate stream computing capabilities.

5.  Business Support: Many data lake solutions provide standard access interfaces, such as JDBC, to external users. You can also use business intelligence (BI) report tools and dashboards to directly access the data in a data lake. In practice, we recommend pushing the data processed in a data lake to data engines that support online businesses to improve the application experience.

7. Summary

A data lake is the infrastructure for next-generation big data analytics and processing. It provides richer functions than a big data platform. Data lake solutions are likely to evolve in the following directions in the future:

  1. Cloud-Native Architecture: The cloud-native architecture has not yet been defined in a unified manner. However, we can summarize the following three key features of a data lake-dedicated cloud-native architecture:

    1 . Separation of storage and computing, both of which are independently scalable

    2 . Support for multi-modal computing engines, such as SQL, batch processing, stream computing, and machine learning

    3 . Provision of serverless services, which are elastic and billed in pay-as-you-go mode

  2. Sufficient Data Management Capabilities: A data lake provides robust data management capabilities, including data source management, data category management, processing flow orchestration, task scheduling, data traceability, data governance, quality management, and permission management.
  3. Big Data Capabilities and Database-Like Experience: Currently, most data analysts only have experience using databases. Big data platforms are not user-friendly though they provide robust capabilities. Data scientists and data analysts generally focus on data, algorithms, models, and their adaptation to business scenarios. They should not have to spend a lot of time and energy learning development skills for big data platforms. A good user experience is the key to rapid data lake development. Many database applications are developed using SQL. In the future, SQL will be the primary way to develop more data lake capabilities.
  4. Comprehensive Data Integration and Development Capabilities: Data lakes will evolve in the direction of better support and management of disparate data sources, support for full and incremental disparate data migration, and support for all data formats. In the future, data lakes will provide comprehensive, visual, and scalable integrated development environments.
  5. Deep Convergence and Integration With Businesses: A typical data lake architecture consists of distributed object storage, multi-model computing engines, and data management. This architecture has become a norm for data lake development. Data management is the key to data lake solutions and involves the management of raw data, data categories, data schemas, data permissions, and processing tasks. Data management must be adapted to and integrated with businesses. In the future, more industrial data lake solutions will be developed to meet the interactive analytic needs of data scientists and data analysts. The key to victory in the data lake field lies in how to predefine industrial data schemas, ETL processes, analytics models, and custom models in a data lake solution.
0 0 0
Share on

ApsaraDB

376 posts | 57 followers

You may also like

Comments

ApsaraDB

376 posts | 57 followers

Related Products