How to Build a Cloud-Native Open-Source Big Data Platform | The Application Practice of Weimiao

By Qiao Dan, Senior Big Data Development Engineer of Weimiao

The high integration of open-source technologies and cloud-native has allowed the open-source big data platform of Alibaba Cloud to accumulate rich practical experience in functionality, usability, and security. It has been serving thousands of enterprises and helped them focus on the advantages of their core business, shorten the development cycle, reduce difficulties of operations and maintenance, and expand more business innovations.

This article mainly shares the application practice of Weimiao based on the big data ecosystem of Alibaba Cloud. It also shares the practice summary of the full hosting of Alibaba Cloud Realtime Compute for Apache Flink in Weimiao.

1. Enterprise Introduction

Weimiao is an enterprise specializing in financial management and entrepreneurial skills training. Instead of selling or acting as an agent for financial management and insurance products, Weimiao is committed to helping users establish the correct views of money, financial management, and entrepreneurship and master the correct financial management methods and entrepreneurial skills to improve financial literacy and entrepreneurial levels national. Weimiao has more than 8 million paying users and 15 million fans from the self-media matrix.

2. Construction of Big Data Platform

Early Big Data Platform Architecture of Weimiao

Weimiao established a big data department in 2020 and began to build a cluster in August 2020 to help enterprises make better business decisions and provide users with better services.

Background

In the early days of the establishment of the big data department, the demand for real-time data was relatively low, and most requirements were offline analysis.

Advantages of Alibaba Cloud E-MapReduce (EMR)

Convenient and fast creation of a cluster
Integration of a large number of open-source components and frameworks
Low operations and maintenance costs
Easy Expansion
High Stability

First of all, EMR is very convenient to create a cluster. After selecting a cluster template, it can automatically create a cluster with one click. It reduces the difficulty of building a cluster and avoids pitfalls while building a cluster. Second, EMR integrates and adapts a large number of open-source components to avoid version incompatibility among open-source components. The Alibaba Cloud EMR Team also optimizes most of the open-source components to improve the performance of the components. In addition, EMR integrates the Flink real-time computing engine. Alibaba Cloud provides 24-hour online enterprise-level services to help enterprises solve problems related to clusters and offers suggestions on cluster optimization to reduce the costs of operations and maintenance. EMR is also convenient for expansion. It can automatically complete expansion within a few minutes after the application for the expansion without manually deploying and starting services. Alibaba Cloud Object Storage Service (OSS) provides file storage services instead of the Hadoop Distributed File System (HDFS). It ensures security and performance and reduces storage costs. It also has unlimited scalability without needing constant maintenance.

Based on the background and business requirements of the early establishment of the big data department and the advantages of Alibaba Cloud EMR, Weimiao built an EMR-based big data platform. This architecture provided convenience for the construction of the platform in the early stage. The Flink components complete real-time analysis tasks, and the Hive components complete the layering modeling of the offline data warehouse, which fits the original intention of building a big data platform and supports the business of Weimiao.

Challenges Faced by Weimiao

Multiple business lines go hand in hand with the rapid growth of business of Weimiao. However, bottlenecks of the big data platform are exposed:

The rapid growth of business has led to explosive growth in data volume and task requirements.
The daily increase of data volume in the second half of 2021 increased more than ten times compared to the first half of the year.
The number of scheduled tasks increased more than eight times year on year.
T+1 offline data analysis can no longer meet business demands.
Real-time and quasi-real-time analysis tasks have increased dramatically.
The existing real-time computing architecture cannot meet the rapid requirements of the business.
The chimney-like development leads to serious code coupling problems with increasing data metrics.
In terms of more demands, some need detailed data, and some need OLAP analysis. It is difficult for a single Flink development model to meet multiple requirements.
Each requirement needs to apply for resources, resulting in the rapid expansion of resource costs. EMR cluster resources are tight, and real-time and offline tasks preempt resources.
It is difficult to upgrade the core component, Flink. Upgrading a large version is equivalent to recreating a cluster, so the labor cost is high.

Solution Architecture of the Open-Source Big Data Platform of Alibaba Cloud

Based on the problems caused by the business growth mentioned above, the Weimiao Big Data Research and Development Team began to build the 2.0 version of the real-time computing architecture. On the one hand, a batch of new components has been introduced to enrich the architecture of the entire platform. On the other hand, the real-time computing architecture has been optimized and upgraded, and the concept model of real-time data warehouses has been introduced.

For the surge of data volume:

Use OSS to reduce storage pressure and storage costs
Isolate and optimize cluster resources

Alibaba Cloud OSS is widely used to replace HDFS as the storage service for the offline data warehouse. It reduces the storage pressure and storage costs. The architecture separating compute and storage makes the computing resources of the cluster more fully utilized. The computing resources match the storage resources, which makes the resource utilization of the entire cluster exceed 50%. At the same time, the cluster is optimized for resource isolation, which saves costs to a large extent.

For the surge of demands for real-time analysis tasks:

Introduce Hudi and OLAP components
Use incremental updates to speed up data support response
Introduce OLAP components to replace part of the workload of the real-time OLAP, reducing development costs

The data lake Hudi and the OLAP component Doris are introduced to meet the increasing requirements for real-time and quasi-real-time analysis tasks. After Hudi integrates Presto and Spark engines, it can achieve near real-time queries and analysis and solve most quasi-real-time requirements. Doris is an MPP architecture database that analyzes large amounts of data quickly. It is easy to use and has high performance in data analysis. It supports detailed query, aggregate analysis, and multi-dimensional analysis. Its response from seconds to milliseconds meets most real-time requirements.

For the inability of the existing real-time computing architecture to meet the rapid development of the business:

Explore to build a real-time data warehouse
Realize the layering of the real-time data warehouse to avoid the chimney-like development
Introduce the OLAP analysis engine to flexibly handle analysis requirements
Introduce Alibaba Cloud Realtime Compute for Apache Flink to make a more flexible selection for versions and realize more thorough isolation of compute resources

The full hosting of Alibaba Cloud Realtime Compute for Apache Flink was introduced to solve problems related to versions of Flink, providing flexible scaling and a wider range of versions. You can select different versions of Flink based on task requirements. The support of the full hosting of Alibaba Cloud Realtime Compute for Apache Flink for versions closely follows the community. It can flexibly switch between versions from 1.10 to 1.13, which solves the problem of Flink upgrading. In addition, the full hosting Alibaba Cloud Realtime Compute for Apache Flink and EMR resources are completely isolated, which solves the problem of resource preemption between real-time and offline tasks.

Technology Evolution Brought by Architecture Upgrading

The Weimiao Big Data Research and Development Team also upgraded the real-time computing architecture. Referring to the offline data warehouse, the team divided the real-time data warehouse into four layers according to the idea of layering design.

The four-layer model of real-time data warehouses:

1. ODS Layer:

Store the data of event tracking and various logs in the ODS Layer

The ODS Layer is often called the real-time data access layer. Data collection tools are used to collect real-time data from various business systems for unified and structured processing. During this process, it does not filter the data and ensures the original appearance of the data. The main source of data in this layer consists of three parts. The first part is the MQ message accepted by the business party, the second part is the binlogs of the business database, and the third part is the logs of event tracking and applications. These three parts are finally written into Kafka.

2. DW Detail Layer:

Stream data association dimension tables are unified for ETL, deduplication, filtering, diversion, and other actions to generate public behavior schedules and business behavior schedules.
Business schedules are associated with their respective business dimension tables to form a business theme schedule.
Schedules are written to the OLAP engine for OLAP analysis and quick summary.

The DW Detail Layer is the detail middle layer. This layer uses business processes as modeling drivers and is built based on specific business process events. For example, the transaction process includes order events and payment events, and the DW Detail Layer is constructed based on these events. At this layer, the detailed data is divided by referring to the subject domain of the offline data warehouse. The data is also organized in the way of dimensional modeling to make appropriate redundancy for some important dimension fields. The data in this layer comes from the ODS Layer. Flink is used to clean the data, and the dimensions are completed by multi-stream association and finally written into Kafka. The layer of real-time dimension tables is used to store dimension data. It is mainly used for data completion when the DW Detail Layer is widened. The data in this layer is mainly stored in HBase. In the future, the data will flexibly select a more suitable storage medium, such as Redis, based on the QPS and data volume.

3. DWS Layer:

Read business topic schedules to calculate the common dimensions and metrics that each business topic cares about and store them in the OLAP engine.

The DWS Layer is the real-time summary layer. This layer uses data from the DW Detail Layer for multi-dimensional summary to provide for downstream business parties to use. Different businesses may use different methods of dimension summary, and different requirements can be satisfied by different technical solutions during the actual application process. The first solution is to use Flink for real-time summary and write result metrics to databases, such as HBase or MySQL. The advantage of this method is the flexible implementation logic, but the disadvantage is the solidified aggregation force and difficult extension. The second method is to use real-time OLAP tools for summary. This method is easy to an extent, but its business logic needs to be preprocessed in the middle layer.

4. ADS Layer:

Provide ad-hoc queries and real-time dashboard services

The ADS Layer is the real-time application layer. The data in this layer has been written into the storage of the application system. For example, the data is written into Doris as a real-time dataset for BI dashboard or to provide real-time OLAP services, and the data written into HBase or MySQL is used to provide a unified data service interface.

Business Value Promoted by Flink and OLAP Real-Time Data Warehouses

Based on the new platform architecture and real-time warehouse architecture, Weimiao quickly and stably supports the needs of the business. Over the past two months, the company business has been supported in the following parts:

Five large and medium-sized projects have been developed and deployed.
Twenty tasks have been developed and scheduled.
Five business systems have been supported.
Seven real-time visualized dashboards have been supported.

Improvements in operations:

The function of Urging to Class has increased the attendance rate by 10.5%.
Real-time monitoring of live metrics has improved the renewal rate by 1.5%.
Real-time monitoring of the landing page access has promoted 13 landing pages of product optimization.

The accurate data produced in real-time has earned valuable decision-making time for the Operation and Delivery Team. It provides teachers with the strong support of real-time teaching data and has been unanimously affirmed by all demanders.

3. Future Cooperation Planning with Alibaba Cloud

Weimiao has accumulated more depth and breadth in data architectures and technical solutions from real-time computing to real-time data warehouses.

The rapid development of the company business and the continuous introduction of new technologies allow real-time data warehouses to be upgraded and optimized constantly and continually. For example, the OLAP engine currently uses Apache Doris. Later, Weimiao will have more in-depth communication with Alibaba Cloud in this field. Weimiao will also further improve the service capabilities of real-time data warehouses from the following aspects:

Follow up the user experience of Alibaba Cloud Realtime Compute for Apache Flink continuously
Improve the lineage of real-time data warehouses and the quality monitoring of tasks and tables
Improve the metadata management system
Improve the monitoring of Flink operations, establish a value evaluation system for real-time data warehouses, and quantify inputs and outputs
Strengthen the robustness of real-time tasks

This is the application practice of Weimiao based on the big data ecosystem of Alibaba Cloud and the practice of full hosting of Realtime Compute for Apache Flink in Weimiao.

Community

How to Build a Cloud-Native Open-Source Big Data Platform | The Application Practice of Weimiao

1. Enterprise Introduction

2. Construction of Big Data Platform

Early Big Data Platform Architecture of Weimiao

Challenges Faced by Weimiao

Solution Architecture of the Open-Source Big Data Platform of Alibaba Cloud

Technology Evolution Brought by Architecture Upgrading

Business Value Promoted by Flink and OLAP Real-Time Data Warehouses

3. Future Cooperation Planning with Alibaba Cloud

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Cloud Migration Solution

Hologres