Flink Materialized Table Architecture for Unified Batch & Stream - Realtime Compute for Apache Flink

Traditional data warehouse architectures like Lambda and Kappa face three main challenges: high development and maintenance costs due to separate batch and streaming frameworks, storage inefficiencies from multiple data copies, consistency issues from unaligned logic and schemas across layers. To overcome these challenges, Realtime Compute for Apache Flink introduces materialized tables, which automatically derive table schemas based on data freshness (from daily to every few minutes) and query statements to create continuously refreshing data pipelines. This feature unifies batch and stream processing logic, not only eliminating redundant data copies, but also ensuring consistent data processing logic and table schemas from end to end, thereby greatly simplifying the maintenance of a real-time data warehouse.

Core concepts

Data freshness

Definition: Data freshness is a crucial attribute of the materialized table. It defines the maximum amount of time that a materialized table's content can lag behind updates to the base tables. Data freshness is not a guarantee. Instead, it is a target that Flink attempts to achieve. Data freshness determines the refresh frequency, and is essential to the management of automated data pipelines.
Purposes:
- Determines the refresh mode: Continuous or Full.
- Strikes a balance between data freshness and resource consumption: For example, data freshness measured in minutes is ideal for real-time dashboards, whereas daily or hourly data freshness suits batch analytics.

Refresh mode

The materialized table supports two refresh modes: Continuous and Full modes.

Refresh mode	Description	Visibility	Applicable scenario
Continuous mode	Incrementally updates the materialized table data via a streaming job.	Updates are visible either immediately to provide low latency or after checkpoint competition to ensure consistency.	Ideal for real-time applications such as risk control or real-time recommendation.
Full mode	The scheduler periodically triggers a batch job, daily or hourly, to overwrite the materialized table data in full. By default, overwriting occurs at the table level. If partition fields (such as time partitions) are defined, overwriting happens at the partition level, refreshing only the latest partition each time.	Data is visible after a full refresh is complete.	Suitable for scenarios such as backfilling historical data and generating regular reports.

Query definition

All Flink SQL queries are supported and can be used to define the data source and computation logic.

Dynamic update:

In Continuous mode, query results are populated to the materialized table in real time.
In Full mode, query results overwrite the materialized table to ensure accuracy.

Schema

The column names and types of a materialized table are automatically derived from the query, without the need for manual declaration.

Benefits:

Explicitly declare primary keys to optimize query performance.
Define partition keys (like time) to organize data in layers, improving refresh efficiency.

How materialized tables work

When creating a materialized table, you must explicitly define the FRESHNESS parameter and the AS <select_statement> clause. The Flink engine automatically derives the schema for the materialized table based on the query results, and registers the schema in a catalog. It also creates a streaming or batch refresh job based on the FRESHNESS value.

Assume materialized table C's freshness is set to 30 minutes. When its source, materialized table A, updates, Flink attempts to refresh materialized table C as closely as possible within 30 minutes. The freshness of its downstream materialized tables, like E and F, must be a positive multiple of materialized table C's freshness, such as 60 or 90 minutes. Increasing the freshness value, such as from X minutes to Y hours (and capped at 1 day), lowers resource consumption by reducing refresh frequency.

Scenarios

By unifying batch and stream processing, materialized tables provide notable technical and cost advantages for the following use cases:

Backfilling historical data.
Final data can sometimes be partially distorted due to issues like data transmission latency. Traditionally, correcting historical data often requires a batch job. The materialized table provides the on-demand refresh capability that allows you to manually trigger a refresh for the specific materialized table and all downstream dependent materialized tables.
Unifying data processing logic and table schemas.
In the Lambda architecture, historical and real-time data are stored in separate systems, making it challenging to align their processing logic and the schemas of tables hosting the data. With materialized tables, only a single copy of your data is stored, eliminating the need for complex joins and computations. This feature not only improves storage efficiency, but also aligns the logic of batch and stream processing and unifies the schemas of tables hosting historical and real-time data.
Building dynamic dashboards with adaptable data freshness.
Dynamic dashboards often require varying data freshness across different business scenarios. The materialized table caters to this by allowing you to easily adjust refresh intervals, from daily to every few seconds, by modifying the freshness value. This approach eliminates the need to build and maintain separate real-time pipelines.

Use materialized tables

References	Description
Create and use materialized tables	This topic describes how to create a materialized table, backfill historical data, change data freshness for a materialized table, and view the data lineage of a materialized table.
Quick start with materialized tables	This topic describes how to use materialized tables and Apache Paimon tables to build a stream-batch integrated data lakehouse. It also covers how to adjust the freshness of the materialized table to switch from batch to streaming execution modes, enabling real-time data updates.

References

Apache Paimon is a centralized lake storage platform that allows you to process data in batch and streaming modes. You can use Apache Paimon tables in Realtime Compute for Apache Flink to quickly build a data lake based on services such as Object Storage Service (OSS). For more information, see Use Apache Paimon to build a streaming lakehouse.
Build a streaming data lakehouse with Paimon and StarRocks
Overview of materialized tables