Construction and application of Flink+Hologres real-time data warehouse in Lazada
Big data has been polished in China for nearly 20 years and has become very mature. Lazada data system is very similar to Taobao data system, which is mainly divided into six parts.
The left side is the data source, the right side is the data application, and the middle side is data integration, data modeling and calculation, data aggregation and market, and data services. Data integration mainly depends on cdp and datahub systems. Data modeling is divided into offline and real-time parts. The offline module is based on MaxCompute, and the real-time module is based on Flink platform to achieve real-time data warehouse. After 2020, the market layer gradually uses Hologres to replace the previously complicated data result layer and data computing layer middleware.
Promotion is a common and very important scenario for e-commerce. We expect to enable the promotion or change of the entire business form through real-time data and real-time technology.
The promotion of marketing can be divided into three stages: promotion before, promotion during and promotion after. The promotion includes various preparations, such as investment promotion, product selection, and venue page building, which are offline or long-term actions; In the promotion phase, it is necessary to warm up storage to prepare for additional purchase, flow distribution and dynamic regulation. This phase is mostly for real-time decision-making and real-time regulation, which requires real-time data as support; After promotion, business review, data analysis and data PR are required.
In order to support real-time decision-making and real-time regulation on the business side, the original offline data architecture needs to be modified. Therefore, we use Flink+Hologres to upgrade all data links in a real-time and systematic manner, and use Hologres' OLAP query capabilities to support real-time analysis of business on the day of promotion, such as fast query and analysis in a very short time, and select the target population for coupon issuance.
The promotion of coupon issuance pays attention to the strategy and rhythm, such as the need to circle the target person, whether the user has received the coupon after the issuance, whether the user has used the coupon after receiving, and other issues will affect the regulation of the promotion rhythm. Based on real-time data, business developers need to combine more data assets, users' purchasing power and consumption habits to target people and give real-time feedback during the promotion.
The specific implementation is as follows: consume users' real-time collection and use data from DataHub message middleware. At the same time, because of the long promotion cycle, it is necessary to calculate the user's historical status data at the same time. Therefore, the offline ODPS table will also be used as the initialization table for source consumption of real-time and offline data in Flink tasks. After consumption, on the one hand, data will be written into DataHub message middleware and pushed to downstream marketing systems for direct consumption and use; In addition, data will be stored in Hologres to provide business personnel with data indicators and data labels for real-time OLAP analysis. The data stored in Hologres can not only analyze the user's purchasing power, consumption habits, preference categories and other basic data asset data, but also make associated queries with the real-time change transaction and equity change data in the promotion, so as to quickly target the target population for different business needs.
Through the above technical links and solutions, it is possible to dynamically adjust the operation strategy and operation action business process in the promotion scenario.
In this scheme, Flink computing engine is used to consume real-time and offline data sources at the same time to achieve real-time streaming batch computing; The user's historical cumulative equity data can be combined with the real-time changed equity data for real-time calculation to obtain the user's full status equity collection and real-time data.
In addition, the architecture also implements real-time and offline mixed OLAP analysis. Offline data is stored in the Hologres computing engine for some more complex offline calculations. After the calculation, it is synchronized to Hologres, and then the online real-time change state data is written into Hologres at the same time. Therefore, Hologres will have a full range of status and overall data of a wide range of users. In addition to observing the current status, it can also conduct a comprehensive analysis of historical behavior and performance.
Through this scheme, the marketing activity system successfully transited from the original offline state to a real-time system that can be adjusted, decided and landed.
The Lazada LAB experimental platform has accumulated 10000 experiments, ranking top 5 in terms of the number of experiments, supporting hundreds of sub service thresholds, and thousands of monthly experiments.
The LAB architecture is divided into three levels: data module, system module and application module from bottom to top.
All the data storage engines used by the data module layer have been switched to Hologres. The general data and business indicators in the experiment will be pre calculated in advance, which can reduce the calculation pressure of Hologres. In addition, detailed level data and mild summary level data will be written to Hologres through real-time calculation to support the ability of customization and flexible and rapid analysis in the AB experimental scenario. Finally, synchronize various experimental dimension data to Hologres for customized analysis and query.
The above figure shows the data flow processing process of LAB platform experiment.
The data source is common Binlog data, including log collection, search and promotion log data, etc. The offline data warehouse also performs data processing, and then writes the data to the remote Hologres. In addition, the real-time detailed layer and summary layer data will be synchronized to Hologres through Flink real-time calculation and operation.
Therefore, Hologres has established a complete set of real-time data warehouses, including the real-time DWD detail layer, ADS layer, and many calculated offline data, as well as DWS data and dimensional data. A large number of logical views and some materialized views have also been built on it. Because in the experimental scenario, query conditions or query modes are very fixed for the use of tables, it may be necessary to solidify frequently used query methods and indicators through logical views and materialized views to increase the experimental performance of the front end.
The above architecture uses Hologres' powerful query and data writing and exporting capabilities to improve the experimental speed and efficiency of the entire LAB platform.
With regard to the use of Hologres and storage, the use of distributed data must first ensure that the data is reasonably and evenly distributed. In addition, whether the data is stored in rows or columns depends on the business scenario and use demands. When you select a partition table, you must have a distribution key. When allocating a TableGroup and a Shard, you need to perform verification operations, including dimension table verification and verification between actual tables. Therefore, we need to constantly practice and explore in combination with dimension table data volume and business scenarios.
In terms of calculation, Hologres provides storage engines such as primary key design substitution, approximate calculation, clustering index, time phased index, and optimized dictionary coding.
Alibaba uses MaxCompute to support and implement the infrastructure construction of offline data warehouse. After the launch of Flink, the Alibaba data system has completely transformed from the original offline system to the real-time digital warehouse system. With the birth of Hologres and other cloud native OLAP data engines, we can already see the possible implementation and use of Hucang integration, and support heterogeneous multiple intelligent computing.
We expect to use Hologres service and analysis integration capabilities, combined with AI processing, to quickly complete data processing on one platform and one component, and effectively release business value through the technology platform.
Newton said that standing on the shoulders of giants can let us see farther. We also firmly believe that with a giant like Alibaba Cloud, we can play the value of data business more thoroughly and incisively.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00