Traditional data warehouse architectures such as Lambda and Kappa suffer from three main challenges: high maintenance costs caused by separate batch and streaming frameworks, storage waste from duplicate data copies, and consistency risks from misaligned logic across layers. Materialized tables in Realtime Compute for Apache Flink address these issues by automatically deriving table schemas from query statements and a configurable data freshness target (from daily to every few minutes) to create continuously refreshing data pipelines. By unifying batch and stream processing into a single path, materialized tables eliminate redundant data copies and ensure consistent data processing logic and schemas end to end, simplifying real-time data warehouse maintenance.
Core concepts
How materialized tables work
When you create a materialized table, you must specify the FRESHNESS parameter and the AS <select_statement> clause. The Flink engine automatically derives and registers the table schema in a catalog, and creates a streaming or batch refresh job based on the FRESHNESS value.
For example, if materialized table C has a freshness of 30 minutes, Flink attempts to refresh it as closely as possible within 30 minutes after its source table A updates. Downstream materialized tables such as E and F must use a freshness value that is a positive multiple of C's freshness, such as 60 or 90 minutes. Increasing the freshness value (for example, from X minutes to Y hours, capped at 1 day) reduces refresh frequency and lowers resource consumption.
Scenarios
By unifying batch and stream processing, materialized tables offer technical and cost advantages in the following use cases:
-
Backfilling historical data.
Final data can sometimes be partially distorted by issues such as transmission latency. Correcting historical data traditionally requires a separate batch job. Materialized tables support on-demand refresh, allowing you to manually trigger a refresh for a specific table and all its downstream dependents.
-
Unifying data processing logic and table schemas.
In the Lambda architecture, historical and real-time data reside in separate systems, making it difficult to align processing logic and table schemas. Materialized tables store only a single copy of the data, eliminating complex joins and computations. This improves storage efficiency while unifying batch and stream processing logic and the schemas for historical and real-time data.
-
Building dynamic dashboards with adaptable data freshness.
Dynamic dashboards often require different data freshness levels across business scenarios. Materialized tables let you adjust refresh intervals, from daily to every few seconds, by modifying the freshness value, without building and maintaining separate real-time pipelines.
Use materialized tables
|
References |
Description |
|
Learn how to create a materialized table, backfill historical data, change data freshness, and view data lineage. |
|
|
Learn how to use materialized tables and Apache Paimon tables to build a stream-batch integrated data lakehouse, and how to adjust freshness to switch from batch to streaming execution modes for real-time data updates. |
References
-
Apache Paimon is a centralized lake storage platform for unified batch and streaming data processing. You can use Apache Paimon tables in Realtime Compute for Apache Flink to build a data lake on services such as Object Storage Service (OSS). For more information, see Streaming lakehouse with Paimon.