This topic describes the best practices for data warehouse layering in Hologres. You can use these best practices to develop your business by using real-time data warehouses that feature high performance and agility.

Background information

Hologres is highly compatible with Realtime Compute for Apache Flink, MaxCompute, and DataWorks, and provides data warehousing solutions that integrate stream processing and batch processing. These solutions are applicable to a wide range of scenarios, such as real-time dashboards, real-time risk control, and fine-grained operations. Different scenarios pose varied requirements for the amount of data to be processed, data complexity, data sources, and real-time performance. To develop a traditional data warehouse based on the classic methodology, you need to develop the following layers in sequence: Operational Data Store (ODS), Data Warehouse Detail (DWD), Data Warehouse Service (DWS), and Application Data Service (ADS). Data tasks are scheduled among the layers in an event-driven or micro-batch manner. Layering helps improve semantic abstraction and data reuse. However, layering also increases scheduling dependencies, reduces the real-time performance of data, and reduces the agility of data analysis.

Real-time data warehouses drive customers to make real-time business decisions. In most cases, rich contextual information is required to make business decisions. This poses challenges to the traditional development method that highly depends on business-oriented ADS customization. Thousands of ADS tables are difficult to maintain, and the utilization rate is low. An increasing number of customers expect to perform multi-dimensional data comparison and analysis at the DWS or even DWD layer. This poses higher requirements for computing efficiency, scheduling efficiency, and I/O efficiency of the query engine.

The computing capabilities of Hologres are improved in each new version due to the application of various query engine optimization technologies, such as computing operator vectorization and rewriting, fine-grained indexing, asynchronous execution, and multi-level caching. An increasing number of customers adopt an agile development method. In the pre-computing stage, the customers perform only data cleansing and basic large table association and widening. Data modeling stops at the DWD and DWS layers. This reduces the number of modeling layers. The customers use the interactive search engine of Hologres to perform flexible queries. Hologres performs second-level interactive analytics to support the trend of data democratization.

To meet the requirements of different business scenarios, we recommend that you design layers and process data by using the three solutions. This can help increase the agility of your development process.
  • Solution 1: ad hoc queries. In this solution, data is preprocessed at the DWD layer in Realtime Compute for Apache Flink. The processed data is directly written to Hologres. Hologres provides online analytical processing (OLAP) queries and online services. This way, data can be used immediately after it is written.
  • Solution 2: minute-level quasi-real-time data warehousing. In this solution, micro-batch processing is implemented. Data is preprocessed at the DWD layer in Realtime Compute for Apache Flink. After the processed data is written to Hologres, Hologres processes the data at the aggregation layer and then provides data services to upper-layer applications.
  • Solution 3: real-time statistics collection of incremental data. In this solution, event-driven processing is implemented. Data is processed at the DWD and DWS layers in Realtime Compute for Apache Flink. Then, the processed data is written to Hologres for upper-layer applications.

Rules for selecting solutions

After data is written to Hologres, you can use Hologres to implement one of the preceding data warehousing solutions.

  • Select Solution 1 if the following conditions are met: Your business requires high real-time performance. You expect data to be available for queries immediately after the data is written to Hologres and expect data updates to be synchronized to Hologres in real time. Ad hoc queries need to be supported. You have sufficient resources. For more information, see Solution 1: Ad hoc queries.
  • Select Solution 2 if your business requires real-time analytics and you prioritize development efficiency over real-time performance. The minute-level quasi-real-time solution is suitable for more than 80% of real-time data warehousing scenarios. For more information, see Solution 2: Minute-level quasi-real-time data warehousing.
  • Select Solution 3 if the following conditions are met: Your business focuses on providing online services such as dashboards and risk control. The data volume for your business is small and only incremental data is required to generate statistical results. You prioritize real-time performance over development efficiency and cost-effectiveness of computing. For more information, see Solution 3: Real-time statistics collection of incremental data.

Solution 1: Ad hoc queries

In this solution, the query patterns of upper-layer applications are unknown. The data is stored to support flexible ad hoc queries.

We recommend that you apply the following policies:
  • Perform simple cleansing and association on data from the ODS layer and store the processed data to the DWD layer. Then, write the detail data to Hologres without processing or aggregating the data.
  • Use Realtime Compute for Apache Flink to process incremental data and update detail data in Hologres in real time. Write the batch tables that are processed by MaxCompute to Hologres.
  • Encapsulate SQL logic into views at the common data model (CDM) or ADS layer because the analysis SQL statements of upper-layer applications are not fixed.
  • Query the encapsulated views in upper-layer applications to implement ad hoc queries.
Advantages:
  • The flexibility is high. The views can be adjusted based on your business logic in a flexible way.
  • The metrics are easy to correct. The logic is encapsulated into views and no aggregate tables exist in the upper layers. To update data, you need to only update data in underlying tables. This process involves only one layer. You do not need to update tables for upper-layer applications.

Disadvantages: If the logic of views is complex and the data volume is large, the query performance is low.

Use scenarios: Data originates from databases and event tracking systems, high flexibility is required, high queries per second (QPS) is not required, and the computing resources are sufficient.

Solution 2: Minute-level quasi-real-time data warehousing

The computing efficiency provided by Solution 1 cannot meet high QPS requirements. Solution 2 is an upgraded version of Solution 1. In Solution 2, the views are materialized into tables. Solution 2 uses the same logic as Solution 1, but a smaller volume of data is stored in tables. This helps improve query performance.

We recommend that you apply the following policies:
  • Perform simple cleansing and association on data from the ODS layer and store the processed data to the DWD layer. Then, write the detail data to Hologres without processing or aggregating the data.
  • Use Realtime Compute for Apache Flink to process incremental data and update detail data in Hologres in real time.
  • Store data in physical tables at the CDM or ADS layer. Schedule DataWorks to periodically write data to the tables.
  • Query the physical tables in real time from upper-layer applications. The real-time performance of data depends on the scheduling cycle that is configured in DataWorks. For example, DataWorks supports 5-minute and 10-minute scheduling cycles. This way, you can implement minute-level quasi-real-time data warehousing.
Advantages:
  • The query performance is high. Upper-layer applications query only aggregate data. Compared with view queries, table queries are performed on less data and provide higher query performance.
  • Data can be updated in a short period of time. If a step error or a data error occurs, you need to only run scheduled nodes again in DataWorks. All the logic is fixed. You do not need to perform complicated link revision operations.
  • The business logic can be adjusted in a short period of time. If you need to add or adjust the business code at each layer, you can develop business scenarios based on SQL statements in what you see is what you get (WYSIWYG) mode. This helps shorten the release cycle of your business application.

Disadvantages: The real-time performance provided by Solution 2 is lower than that provided by Solution 1 because more processing and scheduling steps are involved.

Use scenarios: Data originates from databases and event tracking systems, and high QPS and real-time performance are required. This solution is suitable for 80% of real-time data warehousing scenarios and can meet the requirements of most business scenarios.

Solution 3: Real-time statistics collection of incremental data

Incremental computing is required if your business is sensitive to data latency and your business requires data to be processed immediately after the data is generated. In this case, you can use Realtime Compute for Apache Flink to process and aggregate data at the DWD and DWS layers and store aggregated result sets for upper-layer applications.

We recommend that you apply the following policies:
  • Use Realtime Compute for Apache Flink to cleanse, transform, and aggregate incremental data. Store the application data that is generated at the ADS layer in Hologres.
  • Write the result sets that are generated in Realtime Compute for Apache Flink in dual-write mode. The result sets are delivered to the message topic at the next layer and exported to data sinks in Hologres at the same layer. This way, you can check and refresh the status of historical data in subsequent operations in a convenient manner.
  • Collect statistics in Realtime Compute for Apache Flink by using incremental streams, incremental streams connected to static dimension tables, or incremental streams connected to incremental streams. Write the collected data to Hologres.
  • Hologres provides tables for upper-layer applications to perform real-time queries.
Advantages:
  • The real-time performance is high. This solution can meet the requirements of latency-sensitive business scenarios.
  • The metrics are easy to correct. This solution is different from traditional incremental computing in that the intermediate status is persistently stored in Hologres. This helps improve the flexibility of subsequent analysis operations. If the quality of intermediate data cannot meet your requirements, you can modify tables to update data.

Disadvantages: Real-time incremental computing relies on Realtime Compute for Apache Flink. Users must be skilled and proficient in using Realtime Compute for Apache Flink. This solution cannot meet requirements in scenarios in which data is frequently updated and cannot be aggregated or complex overhead computing scenarios such as multi-stream join queries.

Use scenarios: The data volume is not large, the data is collected from event tracking systems, and only incremental data is required to generate statistical results. This solution provides the highest real-time performance among the three solutions.