Zuoyebang's Best Practices for Building Data Lakes Based on Delta Lake

By Liu Jin (Head of Technology Department of Big Data Platform of Zuoyebang), Wang Bin (Senior Architect of Technology Department of Big data Platform of Zuoyebang), and Bi Yan (Technical Expert of Open-Source Big Data Platform for Computing Platform of Alibaba Cloud)

Background
Problems & Pain Points
Solutions
Delta Lake-Based Offline Data Warehouse
Future Planning
Acknowledgments

1. Background

Zuoyebang is an online tutoring enterprise based on technology. It has tool products (such as homework helper and Zuoyebang oral calculation), live lessons products for K12 students (such as Zuoyebang live lessons), essential-qualities-oriented education products (such as Xiaolu programming, Xiaolu calligraphy, and Xiaolu art), and intelligent learning hardware (such as Miaomiao machine). Zuoyebang has several business systems, including the teaching and research middle platform, teaching middle platform, tutoring operation middle platform, and big data middle platform, to continue to empower more products of essential-qualities-oriented education, continuously bringing a better learning and user experience to users. Among them, the big data middle platform, as the basic system middle platform, is mainly responsible for the constructing the enterprise-level warehouse, providing each product line with business-oriented data information, including retention rate, attendance rate, and the number of active users, to improve the efficiency and quality of operations and decision-making.

The preceding figure shows the overview of the data middle platform of Zuoyebang. It is mainly divided into three layers:

The first layer is data products and empowerment.

It mainly includes data tools and products based on the thematic data domain to support application scenarios, such as business intelligence (BI) and trend analysis.

The second layer is the all-domain data layer.

We have standardized the modeling of the connected data and constructed the theme data of the business domain for different time-sensitive scenarios using OneModel unified modeling, improving the efficiency and quality of upper-layer products.

The third layer is the data development layer.

A series of systems and platforms are built to support all data development projects within the enterprise, including data integration, task development, data quality, data services, and data governance.

This article aims to solve the performance problems of offline data warehouses (daily and hourly) during production and usage.

2. Problems & Pain Points

Based on Hive, the offline data warehouse of Zuoyebang provides the capability for data construction from the ODS layer to the ADS layer. When ADS tables are generated, they are written into the OLAP system through data integration to provide BI services for administrators. DWD, DWS, and ADS tables also provide offline data exploration and data retrieval services for analysts.

With the gradual development of the business and the correspondingly increasing volume of data, the offline data warehouse system mainly experiences the following problems:

The output latency in ADS tables is getting longer.

Due to the increase in data volume, it takes longer to build a comprehensive procedure from the ODS layer to the ADS layer. Although the procedure of core ADS tables can be dealt with in the short term through the mode of tilting resources, this mode aims to make minor sacrifices to safeguard major interests. Thus, this mode cannot be replicated on a large scale, which affects the timely output of other important ADS tables. For example, for analysts, due to the latency of data tables, tables for T+1 cannot be seen until T+2 in the worst case.

The requirement for hourly tables is difficult to be satisfied.

Some scenarios are hourly output tables. For example, some activities require hourly feedback to adjust operation strategies in a timely manner. For such scenarios, with the increase in data volume and the shortage of resources in computing clusters, it is often difficult to ensure the timeliness of hourly tables. It is often necessary to prepare sufficient resources in advance to improve computing performance. Computing resources need to be expanded by 24 times in the worst case, especially when hourly data is required to compute daily data.

Data exploration is slow, and data retrieval is unstable.

After the data is output, it is often used by analysts. It takes dozens of minutes or hours to directly access Hive, which is unacceptable for users that often complain about the problem. Although Presto is used to accelerate the query of Hive tables, the queried data table cannot be too large, and the logic cannot be too complicated because of the architecture characteristics of Presto. Otherwise, OOM will occur on Presto. Moreover, existing the UDF and VIEW of Hive cannot be directly used in Presto, which also limits the usage scenarios for analysts.

3. Solutions

Problem Analysis

Whether the output of a comprehensive procedure from the ODS layer to the ADS layer is slow, or data exploration and data retrieval of specific tables are slow, it means the Hive layer is insufficient for the computing performance. According to the scenario analysis above:

The Reason for Slow Procedure Computing: Hive does not support incremental updates, and the MySQL-Binlog from the data source at the business layer contains a large amount of update information. Therefore, in the ODS layer, incremental data and historical full data need to be used to be deduplicated to form new full data, and the following DWD, DWS, and ADS all use similar principles. This process brings a large amount of repeated computing of data and latency in data output.
The Reason for Slow Data Query: Hive lacks the necessary index data. Therefore, both heavy throughput computing and queries expected to ensure minute-level latency are translated into MR-Job for computing. This leads to the slow output of query results in fast data exploration scenarios.

Solution Research

Based on the analysis above, if you can solve the problem of incremental data updates in offline data warehouses, you can improve the performance of procedure computing. If data tables support the capability for index, you can reduce the query latency without degrading the query feature.

Solution Based on HBase and ORC

You can use HBase to deal with the data update. Setting RowKey as the primary key and Column for each column allows you to write data in real-time. However, due to the HBase architecture, the query performance for non-primary key columns is very poor. It is necessary to regularly (for example, perform operations every hour for hourly tables and every day for daily tables) sort HBase tables according to specific fields, export them to HDFS, and store them in ORC format to improve its query performance. However, the ORC format only supports min and max indexes in a single column. Therefore, the query performance still cannot meet the requirements. Moreover, since data writes of HBase continue to occur, the timing of export is difficult to control, and the data may change during the export process. For example, if we want to export the data before 21:00 on December 11 as the data of the 21:00 partition in the data table, we need to consider factors, such as the number of versions, storage capacity, and computing performance caused by filtering. The complexity of the system has increased sharply, and the costs of operations and maintenance have increased by introducing the HBase system.

Data Lake

A data lake is a data format that can be integrated between mainstream computing engines (such as Flink and Spark) and data storage (such as Object Storage Service). Without additional services, it supports real-time upsert operations. It provides support for multiple data versions and can read data of any version.

Currently, data lake solutions mainly include Delta Lake, Iceberg, and Hudi. We have investigated the three solutions on Alibaba Cloud. Their differences and characteristics are shown below:

In addition, considering the usability and functionality, we finally selected Delta Lake as the data lake solution based on our scenarios. (Delta Lake semantics are clear, and Alibaba Cloud supports the full-featured SQL syntax, which is easy to use. The latter two solutions have a high threshold for usage. Only Delta Lake supports Zorder and Data Skipping query acceleration.)

4. Delta Lake-Based Offline Data Warehouse

After the introduction of Delta Lake, the architecture of the offline data warehouse is shown below:

First, Binlogs are collected through Canal and written into Kafka through our self-developed data distribution system. It should be noted in advance that our distribution system needs to keep Binlogs in strict order at the table level. The reasons are detailed below. Then, data is written to Delta Lake in batches using Spark. Finally, we upgrade the data retrieval platform and use Spark SQL to retrieve data from Delta Lake.

We need to deal with the following key technical points in the process of using Delta Lake:

Streaming Data Converted to Batch Data

In business scenarios, ETL tasks for the offline data warehouse are triggered based on the readiness of data table partitioning. For example, tasks on December 31, 2021, may be triggered only after the readiness of data table partitioning on December 30, 2021. This scenario can be easily supported on the Hive system because Hive naturally supports partitioning by date fields (such as dt). However, data writes are streaming writes for Delta Lake. Therefore, you need to convert streaming data to batch data. After the data is completely ready on a certain day, you can provide the read capability of the corresponding daily partition.

How Can We Define When Data Is Completely Ready？

Streaming data is generally out of order. Therefore, even if the watermark mechanism is used, it can only ensure that the data is ordered within a certain time. However, the data for offline data warehouses need to be reliable without any loss. If we can solve the problem of the orderliness of data sources, the solution to the problem of data readiness will be simplified. If the data is partitioned each day, data on December 30 can be considered ready when data on December 31 appears.

Therefore, our solution is broken down into two sub-issues:

Define batch data boundaries after stream data is ordered.
A mechanism is used to ensure orderly streaming data.

First of all, the overall plan for the first sub-issue is shown below:

Set the logical partition field (dt) of data tables and the corresponding time unit information

After Spark reads a piece of batch data, it uses the event time in the data to generate the corresponding dt value based on the metadata of the preceding table. For example, if values of event time in the data stream belong to T+1, it triggers the generation of a snapshot of data version T. When data is read, the system reads the corresponding data version information based on the snapshot.

How Can We Solve the Out-of-Order Problem of Streaming Data?

Logs are ordered for app-log and MySQL-Binlog. Let's take MySQL-Binlog as an example. Binlogs of a single physical table must be ordered. However, the business system often uses database sharding and table sharding in actual business scenarios. A logical table is divided into several tables for scenarios where table sharding is used, like Table1, Table2, etc. You need to mask the details and logic of MySQL table sharding on the business side for ODS tables in the offline data warehouse. This way, the problem is narrowed down to dealing with orderly data in table sharding scenarios.

Ensure the orderliness of data written to Kafka when database sharding and table sharding are used, even when different sharding tables are in different clusters. The data (which is written to Delta Lake and is in a logical table read by Spark from a topic) has orderly partition granularity.
Ensure the timeliness of readiness of ODS tables. For example, if no binary log data is available, the data in the ODS layer can also be ready on schedule.

The original system needs to be upgraded, and the solutions are listed below:

As shown in the preceding figure, the binlogs of a MySQL cluster are collected by Canal and written to a specified Kafka-topic. However, the partition is determined based on the hash value of db and tables (the _* part in the table name is deleted) when writing binlogs. Therefore, a single partition contains binlogs of multiple physical tables, which is unfriendly for writing data to Delta Lake. Considering the compatibility with other data application parties, we have added a new data distribution service:

Write the data of the logical table name (the _* part in the table name is deleted) to the corresponding topic and operate the hash function on the data under the physical table name. It needs to be ensured that the internal data of a single partition is always orderly. A single topic includes only the data of one logical table.
An internal heartbeat table is built in the MySQL cluster to monitor the abnormal latency of Canal collection. Based on this feature, a certain threshold is set to determine whether there is a problem with the system or whether there is really no data when the system does not have binlog data. For the latter situation, Delta Lake is triggered to perform a savepoint, and then a snapshot is triggered in time to ensure that ODS tables are ready.

With the preceding solution, we write the binlog data to Delta Lake in the streaming manner, and the readiness of table partitioning is latened by less than ten minutes.

The Optimization of Read/Write Performance

Let's talk about the performance problems encountered when we use Delta Lake and the corresponding solutions.

Improve Write Performance with DPP

Delta Lake allows you to write data with the SparkStreamingSQL method.

Since you need to merge and deduplicate records, you are required to write data by merging. Delta Lake updates data in two steps:

Locate files that need to be updated. You need to read all the files and join the incremental data of the batch in Spark and associate the files that need to be updated by default.
Rewrite these files after merging them and mark the old files as deleted.

As shown in the figure on the left, Delta Lake reads all files of the previous version by default. As a result, the write performance is extremely low. One merging operation cannot be completed within one batch of Spark.

In this scenario, Delta Lake is upgraded using DPP to perform partition pruning to optimize the performance of merging into, as shown in the preceding figure on the right:

Analyze the merge-on condition to obtain the fields in the source table that correspond to the partition fields in the Delta Lake table
Obtain the enumerated list of partition fields by statistics
Convert the results of the previous step into a filter object and apply it and filter and prune the list of data files
Read the final data file list and associate the source data of the batch to obtain the final file list to be updated

After DPP optimization, the processing latency of a Spark batch (5min granularity) is reduced from over 20 minutes to up to three minutes. This removes the problem that the latency is constantly superimposed due to too much processing time.

Improve Read Performance with Zorder

After solving the problem of data write performance, we encounter the problem of data read performance.

When we use the same data (20 billion +) and use Hive to calculate, the average latency is over 10min. However, the average latency reaches 11min and more after using Delta Lake. After analysis, it is found that Zorder is not used to sort the filtered columns. When Zorder is enabled, the latency is reduced to 24s, which improves the performance by nearly 25 times.

Query optimization for Delta Lake tables based on Zorder involves two improvements:

Data Skipping

Delta Lake counts the max and min values of each field based on the file granularity to directly filter data files.

Zorder

It is a method of data layout, allowing you to rearrange data to ensure the data locality of Zorder fields as much as possible.

Optimize the Duration of Zorder Construction

For those columns on which Zorder is enabled, the regular construction duration is 30min. In the case of data skew, the construction duration is up to 90 minutes.

Zorder is optimized for these two situations:

In conventional cases, for a Zorder with multiple columns, the construction efficiency is improved by changing traversing the data set from multiple times to once. The construction duration is reduced from an average of up to 30 minutes to up to 20 minutes.
In the case of data skew, the hotspot is dispersed for the bucket where the skewed column is located, and the construction duration is reduced from an average of up to 90 minutes to up to 30 minutes.

Overall Effects

After the development and optimization for more than half a year, the offline data warehouse based on Delta Lake was launched recently. Its focus is to improve the query optimization of analysis. At the same time, it supports scenarios with hourly full-scale requirements. The overall effects are listed below:

Faster Readiness Time: After the ODS layer is replaced with Delta Lake, the output time is moved from 2:00 am or 3:00 am to about 00:10 am. The output time is moved ahead by more than two hours.
More Capabilities: Big data has the capability for supporting hourly full-scale tables. It uses the incremental update feature of Delta Lake to meet the requirement of hourly full-scale at a low cost and avoids the consumption of reading full data in conventional scenarios. It has been applied to some core businesses to build an hourly full-scale table. At the same time, the timeliness is guaranteed from up to 40 minutes in the past to up to 10 minutes.
Higher Query Speed: We focus on improving the efficiency of ad hoc queries for analysts. After migrating data warehouse tables commonly used by analysts to Delta Lake, Zorder is used to accelerate queries. The query time has been reduced from over ten minutes to three minutes.

5. Future Planning

With the use of Delta Lake in Zuoyebang, there are still some problems to be solved:

Improve the Efficiency of Data Repairing

When using Hive, we can independently fix a historical partition, but when repairing the data of Delta Lake tables, we need to roll back all versions after the failed version.

Fully Support the Hive Engine

Currently, we use Delta Lake, which mainly deals with slow queries by using Hive and limits of complex queries by using Presto in the past. We provide solutions for complex queries and low latency. However, Hive does not support the aforementioned features (such as GSCD and Data Skipping), which makes users unable to use Delta Lake.

Support Flink Access

Our stream computing ecosystem is mainly built based on Flink. After Delta Lake is introduced, Spark is also used at the same time, which will increase the maintenance cost of our stream computing ecosystem.

6. Acknowledgments

We appreciate the efforts of the Alibaba Cloud EMR Data Lake Team. They helped us solve many essential problems during the migration of our data lakes with their efficient support and professional capabilities in Delta Lake.

The Data Lake Formation (DLF) Official Website: https://www.alibabacloud.com/product/datalake-formation
The EMR Official Website: https://www.alibabacloud.com/product/emapreduce

Community

Zuoyebang's Best Practices for Building Data Lakes Based on Delta Lake

Contents

1. Background

2. Problems & Pain Points

3. Solutions

Problem Analysis

Solution Research

4. Delta Lake-Based Offline Data Warehouse

Streaming Data Converted to Batch Data

How Can We Define When Data Is Completely Ready？

How Can We Solve the Out-of-Order Problem of Streaming Data?

The Optimization of Read/Write Performance

Improve Write Performance with DPP

Improve Read Performance with Zorder

Optimize the Duration of Zorder Construction

Overall Effects

5. Future Planning

6. Acknowledgments

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Hologres

Personalized Content Recommendation Solution