By Liu Jin (Head of Technology Department of Big Data Platform of Zuoyebang), Wang Bin (Senior Architect of Technology Department of Big data Platform of Zuoyebang), and Bi Yan (Technical Expert of Open-Source Big Data Platform for Computing Platform of Alibaba Cloud)
Zuoyebang is an online tutoring enterprise based on technology. It has tool products (such as homework helper and Zuoyebang oral calculation), live lessons products for K12 students (such as Zuoyebang live lessons), essential-qualities-oriented education products (such as Xiaolu programming, Xiaolu calligraphy, and Xiaolu art), and intelligent learning hardware (such as Miaomiao machine). Zuoyebang has several business systems, including the teaching and research middle platform, teaching middle platform, tutoring operation middle platform, and big data middle platform, to continue to empower more products of essential-qualities-oriented education, continuously bringing a better learning and user experience to users. Among them, the big data middle platform, as the basic system middle platform, is mainly responsible for the constructing the enterprise-level warehouse, providing each product line with business-oriented data information, including retention rate, attendance rate, and the number of active users, to improve the efficiency and quality of operations and decision-making.
The preceding figure shows the overview of the data middle platform of Zuoyebang. It is mainly divided into three layers:
It mainly includes data tools and products based on the thematic data domain to support application scenarios, such as business intelligence (BI) and trend analysis.
We have standardized the modeling of the connected data and constructed the theme data of the business domain for different time-sensitive scenarios using OneModel unified modeling, improving the efficiency and quality of upper-layer products.
A series of systems and platforms are built to support all data development projects within the enterprise, including data integration, task development, data quality, data services, and data governance.
This article aims to solve the performance problems of offline data warehouses (daily and hourly) during production and usage.
Based on Hive, the offline data warehouse of Zuoyebang provides the capability for data construction from the ODS layer to the ADS layer. When ADS tables are generated, they are written into the OLAP system through data integration to provide BI services for administrators. DWD, DWS, and ADS tables also provide offline data exploration and data retrieval services for analysts.
With the gradual development of the business and the correspondingly increasing volume of data, the offline data warehouse system mainly experiences the following problems:
Due to the increase in data volume, it takes longer to build a comprehensive procedure from the ODS layer to the ADS layer. Although the procedure of core ADS tables can be dealt with in the short term through the mode of tilting resources, this mode aims to make minor sacrifices to safeguard major interests. Thus, this mode cannot be replicated on a large scale, which affects the timely output of other important ADS tables. For example, for analysts, due to the latency of data tables, tables for T+1 cannot be seen until T+2 in the worst case.
Some scenarios are hourly output tables. For example, some activities require hourly feedback to adjust operation strategies in a timely manner. For such scenarios, with the increase in data volume and the shortage of resources in computing clusters, it is often difficult to ensure the timeliness of hourly tables. It is often necessary to prepare sufficient resources in advance to improve computing performance. Computing resources need to be expanded by 24 times in the worst case, especially when hourly data is required to compute daily data.
After the data is output, it is often used by analysts. It takes dozens of minutes or hours to directly access Hive, which is unacceptable for users that often complain about the problem. Although Presto is used to accelerate the query of Hive tables, the queried data table cannot be too large, and the logic cannot be too complicated because of the architecture characteristics of Presto. Otherwise, OOM will occur on Presto. Moreover, existing the UDF and VIEW of Hive cannot be directly used in Presto, which also limits the usage scenarios for analysts.
Whether the output of a comprehensive procedure from the ODS layer to the ADS layer is slow, or data exploration and data retrieval of specific tables are slow, it means the Hive layer is insufficient for the computing performance. According to the scenario analysis above:
Based on the analysis above, if you can solve the problem of incremental data updates in offline data warehouses, you can improve the performance of procedure computing. If data tables support the capability for index, you can reduce the query latency without degrading the query feature.
You can use HBase to deal with the data update. Setting RowKey as the primary key and Column for each column allows you to write data in real-time. However, due to the HBase architecture, the query performance for non-primary key columns is very poor. It is necessary to regularly (for example, perform operations every hour for hourly tables and every day for daily tables) sort HBase tables according to specific fields, export them to HDFS, and store them in ORC format to improve its query performance. However, the ORC format only supports min and max indexes in a single column. Therefore, the query performance still cannot meet the requirements. Moreover, since data writes of HBase continue to occur, the timing of export is difficult to control, and the data may change during the export process. For example, if we want to export the data before 21:00 on December 11 as the data of the 21:00 partition in the data table, we need to consider factors, such as the number of versions, storage capacity, and computing performance caused by filtering. The complexity of the system has increased sharply, and the costs of operations and maintenance have increased by introducing the HBase system.
A data lake is a data format that can be integrated between mainstream computing engines (such as Flink and Spark) and data storage (such as Object Storage Service). Without additional services, it supports real-time upsert operations. It provides support for multiple data versions and can read data of any version.
Currently, data lake solutions mainly include Delta Lake, Iceberg, and Hudi. We have investigated the three solutions on Alibaba Cloud. Their differences and characteristics are shown below:
In addition, considering the usability and functionality, we finally selected Delta Lake as the data lake solution based on our scenarios. (Delta Lake semantics are clear, and Alibaba Cloud supports the full-featured SQL syntax, which is easy to use. The latter two solutions have a high threshold for usage. Only Delta Lake supports Zorder and Data Skipping query acceleration.)
After the introduction of Delta Lake, the architecture of the offline data warehouse is shown below:
First, Binlogs are collected through Canal and written into Kafka through our self-developed data distribution system. It should be noted in advance that our distribution system needs to keep Binlogs in strict order at the table level. The reasons are detailed below. Then, data is written to Delta Lake in batches using Spark. Finally, we upgrade the data retrieval platform and use Spark SQL to retrieve data from Delta Lake.
We need to deal with the following key technical points in the process of using Delta Lake:
In business scenarios, ETL tasks for the offline data warehouse are triggered based on the readiness of data table partitioning. For example, tasks on December 31, 2021, may be triggered only after the readiness of data table partitioning on December 30, 2021. This scenario can be easily supported on the Hive system because Hive naturally supports partitioning by date fields (such as dt). However, data writes are streaming writes for Delta Lake. Therefore, you need to convert streaming data to batch data. After the data is completely ready on a certain day, you can provide the read capability of the corresponding daily partition.
Streaming data is generally out of order. Therefore, even if the watermark mechanism is used, it can only ensure that the data is ordered within a certain time. However, the data for offline data warehouses need to be reliable without any loss. If we can solve the problem of the orderliness of data sources, the solution to the problem of data readiness will be simplified. If the data is partitioned each day, data on December 30 can be considered ready when data on December 31 appears.
Therefore, our solution is broken down into two sub-issues:
First of all, the overall plan for the first sub-issue is shown below:
Logs are ordered for app-log and MySQL-Binlog. Let's take MySQL-Binlog as an example. Binlogs of a single physical table must be ordered. However, the business system often uses database sharding and table sharding in actual business scenarios. A logical table is divided into several tables for scenarios where table sharding is used, like Table1, Table2, etc. You need to mask the details and logic of MySQL table sharding on the business side for ODS tables in the offline data warehouse. This way, the problem is narrowed down to dealing with orderly data in table sharding scenarios.
The original system needs to be upgraded, and the solutions are listed below:
As shown in the preceding figure, the binlogs of a MySQL cluster are collected by Canal and written to a specified Kafka-topic. However, the partition is determined based on the hash value of db and tables (the _* part in the table name is deleted) when writing binlogs. Therefore, a single partition contains binlogs of multiple physical tables, which is unfriendly for writing data to Delta Lake. Considering the compatibility with other data application parties, we have added a new data distribution service:
With the preceding solution, we write the binlog data to Delta Lake in the streaming manner, and the readiness of table partitioning is latened by less than ten minutes.
Let's talk about the performance problems encountered when we use Delta Lake and the corresponding solutions.
Delta Lake allows you to write data with the SparkStreamingSQL method.
Since you need to merge and deduplicate records, you are required to write data by merging. Delta Lake updates data in two steps:
As shown in the figure on the left, Delta Lake reads all files of the previous version by default. As a result, the write performance is extremely low. One merging operation cannot be completed within one batch of Spark.
In this scenario, Delta Lake is upgraded using DPP to perform partition pruning to optimize the performance of merging into, as shown in the preceding figure on the right:
After DPP optimization, the processing latency of a Spark batch (5min granularity) is reduced from over 20 minutes to up to three minutes. This removes the problem that the latency is constantly superimposed due to too much processing time.
After solving the problem of data write performance, we encounter the problem of data read performance.
When we use the same data (20 billion +) and use Hive to calculate, the average latency is over 10min. However, the average latency reaches 11min and more after using Delta Lake. After analysis, it is found that Zorder is not used to sort the filtered columns. When Zorder is enabled, the latency is reduced to 24s, which improves the performance by nearly 25 times.
Query optimization for Delta Lake tables based on Zorder involves two improvements:
Delta Lake counts the max and min values of each field based on the file granularity to directly filter data files.
It is a method of data layout, allowing you to rearrange data to ensure the data locality of Zorder fields as much as possible.
For those columns on which Zorder is enabled, the regular construction duration is 30min. In the case of data skew, the construction duration is up to 90 minutes.
Zorder is optimized for these two situations:
After the development and optimization for more than half a year, the offline data warehouse based on Delta Lake was launched recently. Its focus is to improve the query optimization of analysis. At the same time, it supports scenarios with hourly full-scale requirements. The overall effects are listed below:
With the use of Delta Lake in Zuoyebang, there are still some problems to be solved:
When using Hive, we can independently fix a historical partition, but when repairing the data of Delta Lake tables, we need to roll back all versions after the failed version.
Currently, we use Delta Lake, which mainly deals with slow queries by using Hive and limits of complex queries by using Presto in the past. We provide solutions for complex queries and low latency. However, Hive does not support the aforementioned features (such as GSCD and Data Skipping), which makes users unable to use Delta Lake.
Our stream computing ecosystem is mainly built based on Flink. After Delta Lake is introduced, Spark is also used at the same time, which will increase the maintenance cost of our stream computing ecosystem.
We appreciate the efforts of the Alibaba Cloud EMR Data Lake Team. They helped us solve many essential problems during the migration of our data lakes with their efficient support and professional capabilities in Delta Lake.
The Spark and Delta Lake Engine Enterprise Edition of Databricks Helps Efficiently Access Lake Houses
56 posts | 4 followersFollow
Alibaba EMR - October 12, 2021
Alibaba EMR - July 9, 2021
Alibaba EMR - July 19, 2021
Hologres - December 2, 2022
Alibaba EMR - June 22, 2021
Alibaba Cloud MaxCompute - July 15, 2021
56 posts | 4 followersFollow
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
A real-time data warehouse for serving and analytics which is compatible with PostgreSQL.Learn More
This all-in-one omnichannel data solution helps brand merchants formulate brand strategies, monitor brand operation, and increase customer base.Learn More
More Posts by Alibaba EMR