By Qiao Dan, Senior Big Data Development Engineer of Weimiao
The high integration of open-source technologies and cloud-native has allowed the open-source big data platform of Alibaba Cloud to accumulate rich practical experience in functionality, usability, and security. It has been serving thousands of enterprises and helped them focus on the advantages of their core business, shorten the development cycle, reduce difficulties of operations and maintenance, and expand more business innovations.
This article mainly shares the application practice of Weimiao based on the big data ecosystem of Alibaba Cloud. It also shares the practice summary of the full hosting of Alibaba Cloud Realtime Compute for Apache Flink in Weimiao.
Weimiao is an enterprise specializing in financial management and entrepreneurial skills training. Instead of selling or acting as an agent for financial management and insurance products, Weimiao is committed to helping users establish the correct views of money, financial management, and entrepreneurship and master the correct financial management methods and entrepreneurial skills to improve financial literacy and entrepreneurial levels national. Weimiao has more than 8 million paying users and 15 million fans from the self-media matrix.
Weimiao established a big data department in 2020 and began to build a cluster in August 2020 to help enterprises make better business decisions and provide users with better services.
Background
In the early days of the establishment of the big data department, the demand for real-time data was relatively low, and most requirements were offline analysis.
Advantages of Alibaba Cloud E-MapReduce (EMR)
First of all, EMR is very convenient to create a cluster. After selecting a cluster template, it can automatically create a cluster with one click. It reduces the difficulty of building a cluster and avoids pitfalls while building a cluster. Second, EMR integrates and adapts a large number of open-source components to avoid version incompatibility among open-source components. The Alibaba Cloud EMR Team also optimizes most of the open-source components to improve the performance of the components. In addition, EMR integrates the Flink real-time computing engine. Alibaba Cloud provides 24-hour online enterprise-level services to help enterprises solve problems related to clusters and offers suggestions on cluster optimization to reduce the costs of operations and maintenance. EMR is also convenient for expansion. It can automatically complete expansion within a few minutes after the application for the expansion without manually deploying and starting services. Alibaba Cloud Object Storage Service (OSS) provides file storage services instead of the Hadoop Distributed File System (HDFS). It ensures security and performance and reduces storage costs. It also has unlimited scalability without needing constant maintenance.
Based on the background and business requirements of the early establishment of the big data department and the advantages of Alibaba Cloud EMR, Weimiao built an EMR-based big data platform. This architecture provided convenience for the construction of the platform in the early stage. The Flink components complete real-time analysis tasks, and the Hive components complete the layering modeling of the offline data warehouse, which fits the original intention of building a big data platform and supports the business of Weimiao.
Multiple business lines go hand in hand with the rapid growth of business of Weimiao. However, bottlenecks of the big data platform are exposed:
Based on the problems caused by the business growth mentioned above, the Weimiao Big Data Research and Development Team began to build the 2.0 version of the real-time computing architecture. On the one hand, a batch of new components has been introduced to enrich the architecture of the entire platform. On the other hand, the real-time computing architecture has been optimized and upgraded, and the concept model of real-time data warehouses has been introduced.
For the surge of data volume:
Alibaba Cloud OSS is widely used to replace HDFS as the storage service for the offline data warehouse. It reduces the storage pressure and storage costs. The architecture separating compute and storage makes the computing resources of the cluster more fully utilized. The computing resources match the storage resources, which makes the resource utilization of the entire cluster exceed 50%. At the same time, the cluster is optimized for resource isolation, which saves costs to a large extent.
For the surge of demands for real-time analysis tasks:
The data lake Hudi and the OLAP component Doris are introduced to meet the increasing requirements for real-time and quasi-real-time analysis tasks. After Hudi integrates Presto and Spark engines, it can achieve near real-time queries and analysis and solve most quasi-real-time requirements. Doris is an MPP architecture database that analyzes large amounts of data quickly. It is easy to use and has high performance in data analysis. It supports detailed query, aggregate analysis, and multi-dimensional analysis. Its response from seconds to milliseconds meets most real-time requirements.
For the inability of the existing real-time computing architecture to meet the rapid development of the business:
The full hosting of Alibaba Cloud Realtime Compute for Apache Flink was introduced to solve problems related to versions of Flink, providing flexible scaling and a wider range of versions. You can select different versions of Flink based on task requirements. The support of the full hosting of Alibaba Cloud Realtime Compute for Apache Flink for versions closely follows the community. It can flexibly switch between versions from 1.10 to 1.13, which solves the problem of Flink upgrading. In addition, the full hosting Alibaba Cloud Realtime Compute for Apache Flink and EMR resources are completely isolated, which solves the problem of resource preemption between real-time and offline tasks.
The Weimiao Big Data Research and Development Team also upgraded the real-time computing architecture. Referring to the offline data warehouse, the team divided the real-time data warehouse into four layers according to the idea of layering design.
The four-layer model of real-time data warehouses:
1. ODS Layer:
The ODS Layer is often called the real-time data access layer. Data collection tools are used to collect real-time data from various business systems for unified and structured processing. During this process, it does not filter the data and ensures the original appearance of the data. The main source of data in this layer consists of three parts. The first part is the MQ message accepted by the business party, the second part is the binlogs of the business database, and the third part is the logs of event tracking and applications. These three parts are finally written into Kafka.
2. DW Detail Layer:
The DW Detail Layer is the detail middle layer. This layer uses business processes as modeling drivers and is built based on specific business process events. For example, the transaction process includes order events and payment events, and the DW Detail Layer is constructed based on these events. At this layer, the detailed data is divided by referring to the subject domain of the offline data warehouse. The data is also organized in the way of dimensional modeling to make appropriate redundancy for some important dimension fields. The data in this layer comes from the ODS Layer. Flink is used to clean the data, and the dimensions are completed by multi-stream association and finally written into Kafka. The layer of real-time dimension tables is used to store dimension data. It is mainly used for data completion when the DW Detail Layer is widened. The data in this layer is mainly stored in HBase. In the future, the data will flexibly select a more suitable storage medium, such as Redis, based on the QPS and data volume.
3. DWS Layer:
The DWS Layer is the real-time summary layer. This layer uses data from the DW Detail Layer for multi-dimensional summary to provide for downstream business parties to use. Different businesses may use different methods of dimension summary, and different requirements can be satisfied by different technical solutions during the actual application process. The first solution is to use Flink for real-time summary and write result metrics to databases, such as HBase or MySQL. The advantage of this method is the flexible implementation logic, but the disadvantage is the solidified aggregation force and difficult extension. The second method is to use real-time OLAP tools for summary. This method is easy to an extent, but its business logic needs to be preprocessed in the middle layer.
4. ADS Layer:
The ADS Layer is the real-time application layer. The data in this layer has been written into the storage of the application system. For example, the data is written into Doris as a real-time dataset for BI dashboard or to provide real-time OLAP services, and the data written into HBase or MySQL is used to provide a unified data service interface.
Based on the new platform architecture and real-time warehouse architecture, Weimiao quickly and stably supports the needs of the business. Over the past two months, the company business has been supported in the following parts:
Improvements in operations:
The accurate data produced in real-time has earned valuable decision-making time for the Operation and Delivery Team. It provides teachers with the strong support of real-time teaching data and has been unanimously affirmed by all demanders.
Weimiao has accumulated more depth and breadth in data architectures and technical solutions from real-time computing to real-time data warehouses.
The rapid development of the company business and the continuous introduction of new technologies allow real-time data warehouses to be upgraded and optimized constantly and continually. For example, the OLAP engine currently uses Apache Doris. Later, Weimiao will have more in-depth communication with Alibaba Cloud in this field. Weimiao will also further improve the service capabilities of real-time data warehouses from the following aspects:
This is the application practice of Weimiao based on the big data ecosystem of Alibaba Cloud and the practice of full hosting of Realtime Compute for Apache Flink in Weimiao.
59 posts | 5 followers
FollowAlibaba Cloud Native Community - March 14, 2022
Alibaba Cloud MaxCompute - December 22, 2021
Alibaba Developer - September 6, 2021
Farruh - March 29, 2023
Aliware - June 23, 2021
Alibaba Developer - March 1, 2022
59 posts | 5 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreSecure and easy solutions for moving you workloads to the cloud
Learn MoreA real-time data warehouse for serving and analytics which is compatible with PostgreSQL.
Learn MoreMore Posts by Alibaba EMR