Weimiao's ecological application practice based on big data
This article mainly shares the application practice of Wemiao based on Alibaba Cloud's big data ecosystem, as well as the practice summary of real-time computing Flink fully hosted in Wemiao.
Guest: Jordan, senior big data development engineer of Weimiao Big Data
Video address: https://yqh.aliyun.com/live/bigdataop
1、 Company Profile
Weimiao is a company specialized in financial management and entrepreneurial skills training. Weimiao does not sell or act as an agent for financial management and insurance products, and is always committed to helping users establish a correct view of money, financial management and entrepreneurship, help users master the correct financial management methods and entrepreneurial skills, and comprehensively improve the financial literacy and entrepreneurial level of the people. At present, it has more than 8 million paying users and 15 million We-media matrix fans.
2、 Construction history of big data platform
Weimiao's initial big data platform architecture
In order to help enterprises make better business decisions and provide users with better services, Weimiao established a big data department in 20 years and began to build clusters in August of the same year.
At the early stage of the establishment of the big data department, the demand for real-time data was less, and offline analysis was the main method
• Advantages of EMR
• Creating clusters is convenient and fast
• Integrate a large number of open source components and frameworks
• Low operation and maintenance costs
• Convenient expansion
• High stability
EMR is very convenient to create a cluster. After selecting the cluster template, you can create a cluster automatically with one click, which greatly reduces the difficulty of cluster construction and effectively avoids various stepping problems in the process of cluster construction. Secondly, EMR has integrated a large number of community open source components and adapted them to avoid the issue of version incompatibility between open source components. Alibaba Cloud EMR team has also optimized most open source components, greatly improving the performance of components. In addition, EMR also integrates Flink real-time computing engine. Alibaba Cloud's 24-hour online enterprise-level service helps enterprises solve various cluster problems, and provides cluster tuning suggestions to reduce operation and maintenance costs. EMR expansion is also very convenient. It can be completed automatically within a few minutes after applying for expansion, without manual deployment and service startup. Alibaba Cloud Object Storage OSS replaces HDFS to provide file storage services, which can significantly reduce storage costs while ensuring security and performance. It has unlimited scalability without participating in maintenance.
Based on the background and business needs of the big data department in the early stage of its establishment, and combined with the advantages of Alibaba Cloud EMR, Weimiao has built a big data platform based on EMR, which has provided many convenience in the early stage of platform construction. The Flink component completes the real-time analysis task, and the Hive component completes the hierarchical modeling of offline data warehouse, which conforms to the original intention of building the big data platform and supports our business.
With the rapid growth of WeChat business, multiple business lines are advancing side by side, and the bottleneck of big data platform is highlighted:
• Rapid business growth leads to explosive growth in data volume and task demand
• Daily data volume increased by more than 10 times year-on-year in the first half of the year
• The number of scheduled tasks increased more than 8 times year-on-year
• T+1 offline data analysis can no longer meet business demands
• Rapid increase in real-time and quasi-real-time analysis tasks
• The existing real-time computing architecture cannot meet the rapid needs of the business
• There are more and more data indicators, and the "chimney" development leads to serious code coupling problems
• There are more and more requirements, some need detailed data, and some need OLAP analysis. A single Flink development model is difficult to cope with multiple needs
• Each demand needs to apply for resources, resulting in rapid expansion of resource costs, tight EMR cluster resources, and resource preemption by real-time tasks and offline tasks
• It is difficult to upgrade the core component Flink. Upgrading a large version is equivalent to re-creating a cluster, and the labor cost is high
Alibaba Cloud open source big data platform solution architecture
Based on the problems caused by the above business growth, the research and development team of Weimiao Big Data began to build the real-time computing architecture of version 2.0. On the one hand, a number of new components have been introduced to enrich the architecture of the entire platform; On the other hand, the real-time computing architecture has been optimized and upgraded, and the concept model of real-time data warehouse has been introduced.
For data explosion:
• Widely use object storage OSS to reduce storage pressure and storage cost
• Isolate and optimize cluster resources
Alibaba Cloud object storage OSS is widely used to replace HDFS as the storage service of offline data warehouse, which greatly reduces the storage pressure and storage costs. The architecture of computing and storage separation makes full use of the computing resources of the cluster. The matching of computing resources and storage resources makes the resource utilization rate of the whole cluster exceed 50%. At the same time, the cluster is optimized for resource isolation, which greatly saves costs.
The demand for real-time analysis tasks has increased significantly
• Introduction of Hudi and OLAP components
• Incremental update to improve data support response speed
• The introduction of OLAP components replaces the workload of some real-time OLAP, which greatly reduces the development cost
In view of the increasing demand for real-time and quasi-real-time analysis tasks, the Hudi data lake and the OLAP component Doris are introduced. After integrating Presto and Spark engines, Hudi can achieve near-real-time query and analysis, and solve most of the quasi-real-time requirements. Doris is an MPP architecture database for rapid analysis of massive data. In the field of data analysis, Doris has the characteristics of simplicity and high performance. It supports detailed query, aggregation analysis, multi-dimensional analysis, and so on. Second-level to millisecond level response meets most real-time requirements.
The existing real-time computing architecture cannot meet the rapid development of business
• Explore the construction of real-time data warehouse
• Real-time data warehouse layering to avoid "chimney" development
• Introduce OLAP analysis engine to flexibly handle various analysis requirements
• Real-time computing Flink version is introduced, with more flexible version selection and more thorough computing resource isolation.
For the issue of Flink version, Alibaba Cloud's real-time computing Flink version is fully hosted, providing flexible scaling, and there are richer versions. You can select different Flink versions according to the task requirements. Alibaba Cloud's real-time computing version of Flink is fully hosted and supports the version, closely following the community, and can achieve flexible switching between versions from 1.10 to 1.13, which perfectly solves the problem of Flink upgrade. In addition, the full hosting of real-time computing Flink and EMR resources are also completely isolated, which well solves the problem of resource preemption by real-time tasks and offline tasks.
Technological evolution brought by architecture upgrading
The R&D team of Weimiao Big Data has also upgraded the real-time computing architecture. The real-time data warehouse is divided into four layers according to the idea of hierarchical design with reference to offline data warehouse.
Real-time data warehouse four-tier model:
• ODS layer:
• Paste source to store buried point data and various logs;
ODS layer, also known as real-time data access layer. The real-time data of each business system is collected through the data collection tool for unified structured processing. This process does not filter the data and tries to ensure the original appearance of the data. The main sources of this layer of data include three parts: the first part is the MQ message accepted by the business side, the second part is the binlog log of the business database, and the third part is the buried point log and the application log. These three parts are finally uniformly written into KafKa.
• DW detail layer:
• The flow data is associated with the dimension table for unified ETL, de-duplication, filtering, diversion and other actions, and the public behavior details and business behavior details are generated
• The business details are associated with their respective business dimension tables to form the business subject details
• Parts lists are written into the OLAP engine for OLAP analysis and quick summary
DW detail layer, that is, detail middle layer. This layer takes business process as the modeling driver and is built based on specific business process events. For example, the transaction process includes ordering events, payment events, etc. Based on these events, the details layer is constructed. At this level, the detailed data is divided by reference to the subject field of the offline data warehouse, and the data will also be organized in the way of dimension modeling, with appropriate redundancy for some important dimension fields. The data in this layer comes from the ODS layer. Data cleaning is performed through Flink, and multi-stream association is used to complete the dimension. Finally, it is also written into KafKa. The real-time dimension surface layer is used to store dimension data, mainly for data completion during DW layer widening processing. The data of this layer is mainly stored in HBase. Later, more suitable storage media, such as Redis, will be flexibly selected based on the size of QPS and data volume.
• DWS summary layer:
• Read the business topic details table, calculate the common dimensions and indicators concerned by each business topic, and store them in the OLAP engine
DWS layer, that is, real-time summary layer. This layer provides multi-dimensional summary of data in DW layer for downstream business parties. In the actual application process, different businesses may use different ways to summarize dimensions, and different technical solutions can be used to achieve different needs. The first scheme is to use Flink for real-time summary, and then write the result indicators into databases such as HBase or MySQL. The advantage of this method is that the implementation logic is relatively flexible, while the disadvantage is that the polymerization strength is relatively solidified and not easy to expand. The second way is to use real-time OLAP tools to summarize. The advantage of this way is that it is easy to expand. The disadvantage is that business logic needs to be preprocessed in the middle layer.
• ADS layer:
• Provide ad-hoc query and real-time large-cap service.
The ADS layer is also the real-time application layer. This layer of data has been written into the storage of the application system, such as the real-time data set of Doris as the BI Kanban, or the real-time OLAP service, which is written into HBase. MySQL is used to provide a unified data service interface.
Flink+OLAP real-time data warehouse promotes business value
Based on the new platform architecture and real-time data warehouse architecture, Weimiao has rapidly and stably supported the business needs. In the past two months, the company has provided the following support for its business:
• Developed and deployed 5 large and medium-sized projects
• Develop and schedule 20 tasks
• Supported 5 business systems
• Support 7 real-time visual signage
• "Expedite class" function improved to class rate of 10.5%
• Real-time monitoring of live broadcast indicators improves the renewal rate by 1.5%
• Real-time monitoring of landing page visits to promote product optimization 13 landing pages
The accurate data produced in real time has bought valuable decision-making time for the operation and delivery teams; It has provided powerful real-time teaching data support for teachers and won the unanimous approval of all demanders.
3、 Future cooperation plan with Alibaba Cloud
From real-time computing to real-time data warehouse, Weimiao has accumulated more depth and breadth in terms of data architecture and technical solutions.
With the rapid development of the company's business and the continuous introduction of new technologies, the real-time data warehouse will also continue to be optimized iteratively. For example, the OLAP engine currently uses Apache Doris, and will have more in-depth exchanges with Alibaba Cloud in this field later. In addition, it will further improve the service capability of real-time data warehouse from the following aspects, which are also the direction of WeChat Big Data and Alibaba Cloud to discuss in depth in the future:
• Continuously follow up the use experience of real-time computing Flink
• Improve the blood relationship of real-time data warehouse and improve the quality monitoring of tasks and tables
• Improve metadata management system
• Improve the monitoring of Flink operations, establish a real-time data warehouse value evaluation system, and quantify inputs and outputs
• Further strengthen the robustness of real-time tasks
The above is the summary of the application practice of Wemiao based on Alibaba Cloud's big data ecosystem and the practice of real-time computing Flink version fully hosted in Wemiao.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00