This topic describes the major updates of Realtime Compute for Apache Flink version released on December 20, 2024.
The version upgrade is incrementally rolled out across the network by using a canary release plan. You can use the new features in this version only after the upgrade is complete for your account. To apply for the upgrade at the earliest opportunity, submit a ticket.
Overview
This release introduces the materialized table feature. Materialized tables are designed to streamline both batch and streaming data pipelines, providing a consistent development experience.
In today's complex market, the business team relies on data for decision-making. Therefore, it is essential for the data team to provide accurate data to support the business team's efforts. Different business scenarios come with distinct requirements for data:
Risk control scenarios demand high data freshness, typically with a latency from seconds to milliseconds.
User profiling and real-time recommendations usually require data updates in minutes.
BI reporting and historical data analytics, such as year-on-year and month-on-month comparisons, can tolerate lower data freshness, usually at the day level.
A traditional data warehouse typically relies on two architectures, Kappa and Lambda. Both meet business needs to some extent, though with notable limitations. Therefore, it is essential to have an integrated architecture that can satisfy the varying data freshness requirements in different business scenarios.
Realtime Compute for Apache Flink serves as a unified stream and batch processing platform, providing a comprehensive technical solution that meets the diverse data timeliness needs of businesses. To this end, materialized tables are introduced. This feature is based on Apache Paimon, which supports integrated stream-batch storage. Different from the traditional way of separately defining streaming and batch job logic, materialized tables allow you to define data freshness using Flink SQL. This way, Flink can attempt to refresh data at the defined interval. This approach streamlines ETL processes, seamlessly transitions jobs between stream and batch modes, offers cascading update capabilities, and significantly enhances data update efficiency.
Materialized tables are ideal for scenarios where the Lambda architecture cannot ensure consistent data processing logic, real-time statistics are required for offline reports, and real-time dashboard applications rely on historical data for accuracy.