Community Blog Build a Real-time Cloud Data Lake Based on Alibaba Cloud DLA and Apache Hudi

Build a Real-time Cloud Data Lake Based on Alibaba Cloud DLA and Apache Hudi

This article describes how to build a real-time data lake in the cloud based on Alibaba Cloud's Data Lake Analytics (DLA) and Apache Hudi

Conventional big data architecture is facing increasing pressure due to the increasing and diversified analytic requirements of modern day enterprises. With the development and popularization of cloud technologies, real-time data lakes in the cloud has become a popular solution, which typically serve as the big data platform for enterprises. This article describes how you can build a real-time enterprise-level data lake in the cloud by using Alibaba Cloud's Data Lake Analytics (DLA) and Apache Hudi.

What Is a Real-time Data Lake?

In the big data era, traditional data warehouses can no longer meet the requirements for storing data in diversified formats, such as structured data, semi-structured data, and unstructured data. In addition, they fail to meet a range of other requirements posed by upper-layer applications such as interactive analytics, streaming analytics, and machine learning. The "T+1" data delay mechanism of data warehousing leads to a significant delay in analytics, which is not conducive to companies' timely mining of data value. Meanwhile, with the evolvement of cloud computing technologies and the affordability of cloud object storage services, more and more companies are building data lakes in the cloud. However, traditional data lakes do not support the Atomicity, Consistency, Isolation, Durability (ACID) properties of database transactions. As a result, their tables do not support transactions and cannot handle data updates and deletions, which prohibits data lakes from further unleashing their capabilities. To help companies gain faster insights into data value and support ACID properties of transactions, we need to introduce a real-time data lake as the big data processing architecture to meet various analytic requirements of upper-layer applications.

The Big Data Platform Solution

The Traditional Hadoop Solution

In the big data era, big data analytics platforms headed by the Hadoop system gradually show their superiority, along with the improving ecosystem surrounding the Hadoop system. The Hadoop system has provided an effective solution to the bottlenecks of traditional data warehouses.


Traditional batch processing features a long latency. As the data size grows, the following problems emerge.

  • Limitations of Hadoop Distributed File System (HDFS). Many companies that rely on HDFS to expand their big data infrastructure face this problem. HDFS is designed to rely on the memory capacity of the single NameNode to handle the namespace. Therefore, storing a large number of small files can significantly undermine its performance. The problem begins to emerge when the data size exceeds 10 petabytes, and gets serious when the data size grows to 50 to 100 petabytes. Fortunately, some easy solutions are available for scaling HDFS from dozens of petabytes to hundreds, such as ViewFS and HDFS NameNode Federation. Alternatively, you can limit the number of small files and group different data to separate clusters to alleviate HDFS bottlenecks.
  • Fast update. It is necessary to query new data as much as possible in many use cases. However, the "T+1" delay mechanism of traditional data warehouses incurs a long update latency, and is not suitable for scenarios with demanding data timeliness requirements. Meanwhile, the significant data latency is not conducive to timely decision-making of companies.
  • Updates and deletions in Hadoop. Distributed storage in big data solutions stresses the read-only property of data. Therefore, similar to Hive, HDFS does not support UPDATE and parallel WRITE operations in its storage methods, which results in HDFS' limitations and failure to update or delete existing data. To cope with data size growth on platforms, we must find a method to remove these limitations in HDFS so as to support UPDATE and DELETE operations.

The Lambda Solution

Batch processing, despite its high data quality, produces a significant latency. In view of this problem, Lambda, the popular architecture industry-wide, may be a solution that ensures both low latency and high stability.


In the Lambda architecture, a data copy is transmitted to two layers for separate processing. That is, the speedy layer for streaming processing to generate a real-time incremental view, and the batch processing layer for batch processing to generate a stable and reliable historical view. When an upper-layer application queries the data, Lambda merges the incremental view and the historical view to form and return a complete view, which ensures the low latency of the data query. However, the architecture renders it necessary to maintain two data copies, store two results, and employ multiple processing frameworks, which increases system maintenance costs.

So, can we build a scalable real-time data lake that delivers a low data latency with less burdensome system O&M while avoiding the disadvantages of traditional HDFS solutions? The answer is yes. By using Alibaba Cloud Data Lake Analytics (DLA) and Apache Hudi, we can easily build a real-time data lake.

Alibaba Cloud's Real-time Data Lake Solution

The DLA + Hudi solution enables you to easily build a real-time data lake for analytics in Alibaba Cloud Object Storage Service (OSS).

The following steps demonstrate a typical data process in a company.

  • Collect various app data to Kafka or other MQ services
  • Process the data in Kafka by using an engine such as Spark or Flink
  • Write the processing results to a database such as DB, HDFS, or OSS
  • Analyze the results to generate reports by using an analysis engine such as Presto, Hive, or Spark

With the integrated Hudi and DLA's built-in out-of-the-box Spark capabilities, you can quickly build a Hudi data lake in DLA. The following figure shows the architecture.


Users consume input data by using DLA's Spark Streaming extension. The results are then written into OSS as Hudi increments, and the metadata is automatically synchronized to DLA Meta. This solution also supports user-created Spark clusters, in which scenario you only need to write input data into OSS in the Hudi format and automatically associate the data with DLA Meta. Then, you can use DLA-SQL for online interactive analytics, or use DLA-Spark for machine learning and offline analytics. Both of the preceding solutions can greatly lower the threshold for DLA use and embrace ultimate openness of DLA. The real-time data lakes based on DLA and Hudi have the following advantages.

  • Minutes of overall data latency in building a "T+0" data lake
  • Storage of incremental data in OSS to support UPSERT and DELETE operations, and automatic metadata management
  • Diversified data sources that cover more than 95% of those on Alibaba Cloud
  • Fully-managed SQL and Spark databases to free you from cluster O&M
  • Elastic serverless SQL and Spark databases to meet the needs of interactive processing, batch processing, machine learning, and other workloads
  • Only one data copy is stored in OSS and its increments are managed by DLA Meta, reducing storage cost
  • Multi-tenant and scans-based billing to effectively manage the query requests and SQL usage of multiple analysts

In the next section, we'll discuss the details of DLA and Apache Hudi.

What Is DLA?

As a core database service independently developed by Alibaba Cloud, Alibaba Cloud Data Lake Analytics (DLA) is a next-generation cloud-native analytics platform. DLA supports open computing, the MySQL protocol, and the Presto and Spark engines. It features a low cost and cost-free serverless management, and provides unified metadata to present a unified data view. DLA is serving thousands of customers of Alibaba Cloud.

For more information, visit https://www.alibabacloud.com/product/data-lake-analytics


DLA's serverless capabilities free companies from high O&M costs and the complexity of scaling in or out during data peaks and valleys. This service is billed on a pay-as-you-go basis without any ownership overhead. Meanwhile, DLA does not store user data separately. Instead, it stores user data in OSS in an open format, and you only need to associate the metadata with DLA Meta to analyze the data by using DLA SQL, and perform complex extract, transform, and load (ETL) operations by using DLA Spark.

What Is Apache Hudi?

Apache Hudi is a processing framework for incremental data lakes and supports data insertion, update, and deletion. You can use it to manage distributed file systems such as HDFS and ultra-large datasets in clouds such as OSS and S3. Apache Hudi has the following key features.

  • Provide a pluggable index mechanism that supports fast UPSERT and DELETE operations
  • Support incremental pulling of table changes for processing
  • Support time-travel queries to view historical data
  • Support ACID properties and the COMMIT or ROLLBACK statement of transactions
  • Automatically manage small files to optimize query performance
  • Support row-store-based fast data writes and asynchronous compression of data into a column-store table to facilitate analysis
  • Provide a metadata timeline for audit trail

For more information, visit https://hudi.apache.org/

Demo of Building a Real-time Data Lake with DLA

For more information about how to build a real-time data lake demo by using DLA and Hudi, please refer to the following guide: Build a real-time data lake by using DLA and DTS to synchronize data from ApsaraDB RDS.


This article introduces the data lake concept and common big data solutions. It also describes the real-time data lake solution of Alibaba Cloud, which uses DLA and Hudi in combination to quickly build a quasi-real-time data lake for data analytics, and explains the advantages of the solution. In the last section, the article provides a simple demo to show how to integrate DLA and Hudi.

0 0 0
Share on


314 posts | 41 followers

You may also like