Traditional data warehouse architectures like Lambda and Kappa face three main challenges: high development and maintenance costs due to separate batch and streaming frameworks, storage inefficiencies from multiple data copies, consistency issues from unaligned logic and schemas across layers. To overcome these challenges, Realtime Compute for Apache Flink introduces materialized tables, which automatically derive table schemas based on data freshness (from daily to every few minutes) and query statements to create continuously refreshing data pipelines. This feature unifies batch and stream processing logic, not only eliminating redundant data copies, but also ensuring consistent data processing logic and table schemas from end to end, thereby greatly simplifying the maintenance of a real-time data warehouse.
Core concepts
How materialized tables work
When creating a materialized table, you must explicitly define the FRESHNESS parameter and the AS <select_statement> clause. The Flink engine automatically derives the schema for the materialized table based on the query results, and registers the schema in a catalog. It also creates a streaming or batch refresh job based on the FRESHNESS value.
Assume materialized table C's freshness is set to 30 minutes. When its source, materialized table A, updates, Flink attempts to refresh materialized table C as closely as possible within 30 minutes. The freshness of its downstream materialized tables, like E and F, must be a positive multiple of materialized table C's freshness, such as 60 or 90 minutes. Increasing the freshness value, such as from X minutes to Y hours (and capped at 1 day), lowers resource consumption by reducing refresh frequency.
Scenarios
By unifying batch and stream processing, materialized tables provide notable technical and cost advantages for the following use cases:
Backfilling historical data.
Final data can sometimes be partially distorted due to issues like data transmission latency. Traditionally, correcting historical data often requires a batch job. The materialized table provides the on-demand refresh capability that allows you to manually trigger a refresh for the specific materialized table and all downstream dependent materialized tables.
Unifying data processing logic and table schemas.
In the Lambda architecture, historical and real-time data are stored in separate systems, making it challenging to align their processing logic and the schemas of tables hosting the data. With materialized tables, only a single copy of your data is stored, eliminating the need for complex joins and computations. This feature not only improves storage efficiency, but also aligns the logic of batch and stream processing and unifies the schemas of tables hosting historical and real-time data.
Building dynamic dashboards with adaptable data freshness.
Dynamic dashboards often require varying data freshness across different business scenarios. The materialized table caters to this by allowing you to easily adjust refresh intervals, from daily to every few seconds, by modifying the freshness value. This approach eliminates the need to build and maintain separate real-time pipelines.
Use materialized tables
References | Description |
This topic describes how to create a materialized table, backfill historical data, change data freshness for a materialized table, and view the data lineage of a materialized table. | |
This topic describes how to use materialized tables and Apache Paimon tables to build a stream-batch integrated data lakehouse. It also covers how to adjust the freshness of the materialized table to switch from batch to streaming execution modes, enabling real-time data updates. |
References
Apache Paimon is a centralized lake storage platform that allows you to process data in batch and streaming modes. You can use Apache Paimon tables in Realtime Compute for Apache Flink to quickly build a data lake based on services such as Object Storage Service (OSS). For more information, see Use Apache Paimon to build a streaming lakehouse.