Real-time data warehouses - AnalyticDB - Alibaba Cloud Documentation Center

This topic describes the technical framework conversion of business processing from traditional data warehouses to real-time data warehouses and the benefits of real-time data warehouses. This helps you select appropriate data warehouses.

Background information

With the development of information technology and the widespread use of the Internet, enterprises focus more on the timeliness of business. Especially in the field of data analysis, enterprises have higher requirements on real-time or near-real-time data. In data warehouse scenarios, the processing of substantial business data has started to be converted from the traditional batch mode to the real-time mode that uses incremental streaming. Therefore, building real-time data warehouses has become an important task for data analysis.

Building real-time data warehouses has the following pain points:

If you want to use real-time data warehouses in traditional batch processing mode, the code is difficult to reuse between two different architectures. This increases development and O&M costs.
If you want to batch process data in data warehouses, a scheduling system is required to implement complex scheduling methods. The batch processing mode must process full data. As a result, the cost-effectiveness of achieving data timeliness is low.
Pure stream processing cannot implement complex business scenarios, such as joining with large offline dimension tables, multi-layer nesting of tables, and backtracking of historical data.
In most business scenarios, multiple systems must work together and build environments to extract data, which results in complex O&M. When issues occur, a large number of modules are involved and a deep understanding of various new knowledge is required to identify the root cause. This increases the troubleshooting costs.

To address the preceding pain points, AnalyticDB for PostgreSQL provides an end-to-end real-time data warehouse solution that integrates stream and batch processing based on real-time materialized views. The solution can implement the entire process from real-time data source import to real-time analysis in data warehouses by using one copy of data, one write operation, and one system.

Technical architecture

Building real-time data warehouses involves three steps: real-time data writing, real-time data processing, and real-time data analysis. If a simple business process involves a small amount of data, such as statistics on the number of accumulated orders, you can use a stream processing engine to implement the business process. If a business process involves a large amount of data, nonstandard data formats, complex computing logic, or high dependency on intermediate tables, you must consider the hierarchical architecture of data warehouses that includes the operational data store (ODS), data warehouse detail (DWD), data warehouse service (DWS), and application data service (ADS) layers. AnalyticDB for PostgreSQL can integrate the hierarchical architecture of batch data warehouses with real-time materialized views to implement the features of real-time data warehouses and meet data timeliness requirements.

In most cases, a business process can be implemented by using the following steps: First, write the multi-source data (real-time data and batch data) in parallel to a real-time data warehouse in a real-time and high-performance manner. Second, use real-time materialized views in the data warehouse to perform real-time incremental data processing, such as data cleansing, field conversion, and JOIN, GROUP BY, and UNION ALL operations. Then, a result set is generated from the extract, transform, load (ETL) operations based on the real-time materialized views. Lastly, perform data analysis on the real-time materialized views, such as ad-hoc queries, report displays, and real-time dashboards.

Real-time data writing: supports various data formats and comprehensive transaction processing capabilities to implement high performance and low latency. The written data can be immediately viewed.
Real-time data processing: performs real-time ETL processing in incremental streaming mode based on real-time materialized views. Compared with standard materialized views, real-time materialized views can be automatically and incrementally refreshed when data changes without the need to execute REFRESH statements. You can create a real-time materialized view based on a base table, and then create another real-time materialized view based on the previously created real-time materialized view. When the base table changes, the two real-time materialized views can be automatically and incrementally updated. This helps build a real-time ETL processing chain for data analysis.
Real-time data analysis: implements efficient real-time data analysis by using the vectorization execution engine based on single instruction, multiple data (SIMD) instruction sets, cost-based optimizer (CBO), and column-oriented storage engine.

Benefits

Simple architecture and cost-effectiveness

AnalyticDB for PostgreSQL allows you to write, process, and analyze data in real time. You can use a single data warehouse system to build an end-to-end real-time data warehousing process. AnalyticDB for PostgreSQL can perform data warehousing development and data transfer within a single data warehouse system. This provides high stability and simple O&M at low costs. The simple architecture provides the following benefits:

Cost-effectiveness. Only a single copy of data needs to be written to a single system, which achieves optimal costs.
Simple development process. The development process can be complete by using only one set of SQL scripts, without the need to integrate with multiple systems.
High performance. Simple links and incremental data processing require a small amount of resources and result in low latency.
Simple O&M. You need to only maintain a single system. Data exceptions can be easily identified and corrected.

Scheduling-free real-time ETL data processing

You can build an ETL processing chain based on real-time materialized views to process data dependencies in data warehouses. After data is written to base tables in a data warehouse, the associated real-time materialized views can be used to automatically process data dependencies to implement real-time updates of incremental data. AnalyticDB for PostgreSQL provides a chain of real-time materialized views to perform real-time ETL processing without the need to maintain additional scheduling systems. This reduces the building and maintenance costs for real-time data warehouses.

Real-time processing jobs that support complex SQL statements

A major challenge when you build real-time data warehouses is converting complex batch processing jobs into real-time processing jobs. Batch processing jobs can be performed by using complex SQL statements, such as nested or multi-table JOIN statements. However, stream processing imposes numerous limits on SQL statements.

AnalyticDB for PostgreSQL provides real-time materialized views based on traditional data warehouses. Therefore, real-time materialized views can support complex SQL statements. Data warehouses that are built based on real-time materialized views can support complex join relationships between dozens of tables without computing window limits. Compared with stream computing engines, complex batch processing jobs can be converted into real-time processing jobs at lower costs. In practical applications, a considerable number of batch processing jobs can be converted into real-time processing jobs by using real-time materialized views.

No data window limits

Real-time materialized views are implemented based on database engines, and database engines are designed based on disk storage. Compared with the stream compute engines that are designed based on memory, real-time materialized views can implement real-time JOIN operations between large tables with higher performance, especially when the tables contain ultra-large amounts of data. Real-time materialized views also support backtracking of all historical data without data window limits. Real-time materialized views can be used to correct and backtrack historical data. Updates of raw data are automatically reflected in real-time links. In actual scenarios, AnalyticDB for PostgreSQL allows you to create a real-time materialized view by joining dozens of tables to implement real-time incremental updates without data window limits at a latency of several seconds.

Integrated stream and batch processing

Only one set of SQL scripts is required to support integrated stream and batch processing based on real-time materialized views. The first time a real-time materialized view is created, SQL statements are executed in batch processing mode to calculate the full result set. Then, the SQL result set is incrementally updated when data is written in real time. Real-time materialized views also support scenarios in which only specific data needs to be updated in real time and the remaining data needs to be updated in a scheduled manner. You can configure the tables that require scheduled updates to be excluded from real-time updates. If necessary, you can execute REFRESH statements to batch refresh the result sets that are configured with scheduled updates. The following figure shows the flow chart of integrated stream and batch processing.

Zero-ETL support

AnalyticDB for PostgreSQL provides the zero-ETL feature. After you create an AnalyticDB for PostgreSQL instance, you can use the instance to perform parallel data import, ETL processing, ad-hoc queries, and data analysis in real time.