Community Blog Databricks Data Insight Open Course - How to Use Delta Lake to Build a Batch-Stream Unified Data Warehouse

Databricks Data Insight Open Course - How to Use Delta Lake to Build a Batch-Stream Unified Data Warehouse

This article discusses using Delta Lake to build a batch-stream unified data warehouse and putting it into practice.

By Li Yuanjian, Databricks Software Engineer, Feng Jialiang, Alibaba Cloud Open-source Big Data Platform Technical Engineer

Delta Lake is an open-source storage layer that brings reliability to the data lake. Delta Lake provides ACID transactions and extensible metadata processing and unifies stream and batch processing. Delta Lake runs on top of an existing data lake and is fully compatible with Apache Spark's API. I hope this article can give you a deeper understanding of Delta Lake and put it into practice.

This article will break down the features of Delta Lake into two parts:

  1. Project Background and Problems to be Solved
  2. Implementation Principle of Delta Lake

1. Project Background and Problems to be Solved in the Delta Lake Project

1.1 Background


Some readers may be experienced in building data warehouses to process data. The industry consumes a lot of resources to build related systems.

We found that a series of data (such as semi-structured data, real-time data, batch data, and user data) is stored in various places. They provide services for users in different processing forms.

What kind of ideal system do we expect?

  • A more integrated or focused system that allows professional people to do professional things
  • Process Streaming and Batch at the Same Time
  • Referral Services
  • Alert Service
  • Help users analyze a series of problems


However, the reality is:

  • Low-quality, unreliable data makes integration difficult.
  • Poor performance may not meet the requirements of real-time warehousing and query.

Delta Lake was created under this context.

1.2 Problems to be Solved


Let's take a common user scenario as an example: how to solve such a problem if there is no Delta Lake.

This is probably one of the most common Delta Lake scenarios. For example, we have a series of streaming data that stream from the Kafka system. We expect to have real-time processing capabilities. At the same time, we can periodically put the data in Delta Lake. At the same time, we need the export of the whole system to have AI and Reporting capabilities.

1. Historical Query


The first process stream is simple. For example, you can use Streaming Analytics to open a real-time stream through Apache Spark.

At the same time, when offline streaming is required, historical queries can use the way the Lambda schema corresponds. Apache Spark provides a good abstract design. We can use a code or API to complete streaming and real-time Lambda architecture design.

Through the query of historical data, we can also use Spark for SQL analysis and generate AI technology in the form of Spark SQL jobs.

2. Data Verification


The first problem we need to face is data verification.

If our streaming and batch data exist in the form of Lambda architecture, how can we confirm that the data found at a certain point in time is correct? What is the difference between streaming and batch data? When should our batch data be synchronized with streaming data?

The Lambda architecture also needs to introduce Validation, which needs to be confirmed. Validation is an indispensable step, especially for accurate data analysis systems (such as reporting systems for users).

Perhaps we need a side branch to solve the synchronization problem between streaming and batch and the corresponding verification problem.

3. Data Repair


Assuming the preceding problem is solved, if there is a problem with our corresponding Partitioned data, the dirty data of the day needs to be corrected several days later after a time in the system. What do we need to do in that situation?

Generally, we need to stop online queries before repairing data and resume online tasks after repairing data. The process adds another fix to the system architecture and the ability to go back to the past versions. So, Reprocessing was born.

4. Data Update


Assuming that the Reprocessing problem is solved, a new series of requirements emerges at the final export end of AI and Reporting. For example, one day, the business department, the superior department, and the cooperation department asked whether Schema Change could be made. As more people use data and want to add the UserID dimension, what should be done at this time? Go to Delta Lake to add schemas, stay, and reprocess the corresponding data.

Therefore, there will be new problems after solving this. If it is solved case by case, the system will keep patching up. An otherwise simple or integrated requirement will become redundant and complex.

5. Ideal Delta Lake


What should the ideal Delta Lake look like?

It is the system corresponding to the entrance and exit to do the corresponding things. The only core is the Delta Lake layer, which means the corresponding data processing and the entire process of data warehousing can be achieved.

  • Process Data with a Continuous Processing Mode
  • Incremental data can also be processed by incremental streaming.
  • You no longer need to make choices from batch and streaming. In other words, batch and streaming make concessions to each other. Batch should be considered during streaming, and the role of streaming should be considered during batch. This should not be done by design.
  • If we can integrate the entire Delta Lake architecture, we can reduce maintenance costs.

2. Implementation Principle of Delta Lake

2.1 Delta Lake's Capabilities


Let's take a look at how this series of problems are solved in Delta Lake:

  1. It can read and write at the same time and guarantees data consistency. In Delta Lake, Reader and Writer are isolated through the snapshot mechanism. It means Reader and Writer can write and read in the form of optimistic locks without affecting each other.
  2. The Ability to Read Metadata from Large Tables with High Throughput: We can imagine that when a table becomes larger, its metadata, snapshot, checkpoint version, and all metadata operations that change Schema will become a big data problem. The great design of Delta Lake is that Meta Delta is regarded as a big data problem. The metadata problem of large tables is handled by the Spark framework. Therefore, there is no need to worry about the single-point processing Meta Delta welding in Delta.
  3. Rollback of Historical Data and Dirty Data: We need the ability of Time Travel to trace back to a certain point in time for data cleansing.
  4. The Ability to Process Historical Data Online In Historical Data Backfilling: We can still process the current incoming new data in real-time without having to stay or consider which are real-time or offline.
  5. Late data can be processed without blocking downstream jobs and can directly enter into tables.

After the preceding five points are solved, we can replace Lambda architecture with Delta Lake, or we can use Delta Lake architecture for a series of batch and stream systems.

2.2 Delta Lake-Based Architecture Design


What is Delta Lake-based architecture design?

A series of metadata or the lowest level in the architecture design of Delta Lake is a table. We can divide our data layer by layer into basic data tables, intermediate data tables, and high-quality data tables. Simply focus on the upstream and downstream of the table and whether the dependencies between them become simpler and cleaner. We only need to focus on data organization at the business level. Therefore, Delta Lake is a model for unified batch and streaming continuous data flow.

Related Products

0 1 0
Share on

Alibaba EMR

58 posts | 5 followers

You may also like