Hello everyone, and good afternoon. I'm glad you're joining us for today's discussion, where we'll delve into the intriguing world of Flink SQL. If you just attended Xintong's session on Flink 101, you'll find today's topic particularly complementary, as we'll be exploring Flink SQL 101.
To introduce myself, I'm from Alibaba Cloud, and since 2016, I've been deeply involved with Flink SQL, focusing on stream and batch unification. In today's blog, I'll cover three essential areas:
Flink SQL is a distributed processing engine renowned for stateful computation and its ability to unify data streams—both unbounded and bounded. This unified approach enables efficient processing of data within the Flink SQL framework and its runtime environment. At its core, Flink SQL is constructed atop the unified Flink runtime, offering a unified API for batch and streaming data processing across bounded and unbounded datasets.
Flink SQL showcases a rich expressiveness, comprising numerous SQL operations and built-in functions. It also empowers users with extensibility interfaces to integrate with other systems seamlessly. When it comes to performance, the SQL engine's planner excels in auto-optimization, crafting the most effective execution plan using a cost-based methodology to ensure optimal stream and batch processing efficiency.
Let's delve into two key concepts within Flink SQL, starting with the idea of Stream-Table Duality. This concept is basis for unifying stream and batch processes. Though Stream-Table Duality might seem abstract at first, it becomes more concrete when we compare it to relational databases.
Consider a table within a relational database. If an application consistently inserts records into a table—take a click table for instance, which lacks primary keys—every insertion operation is logged. The database records these changes as change logs, allowing for data recovery even if the database encounters a crash, thus ensuring data integrity. In this context, change logs represent another form of a table in a database since every modification is recorded. As depicted, these logs are append-only, capturing only insertion operations.
Naturally, databases support a broader range of operations, including updates and deletions. When a table has a primary key, the application may perform upsert or delete operations, which are then reflected in the change log as related updates. The example diagram demonstrates the deletion of old records and the insertion of new ones, illustrating how tables and data streams can represent data interchangeably.
Stream-Table duality isn't limited to tables—it extends to operations within streaming computations. For example, appended-only change logs from a click table might serve as input for a group aggregate calculation. This aggregation processes the input to generate an updated table and an accompanying new change log stream.
By understanding Stream-Table duality, we gain insight into how tables and streams can transform data and computation within Flink SQL. This is an important concept that underlies the unification of streams and batch processing.
Having explored the first concept in Flink SQL, let's now examine the critical notions of time within this framework. Time plays a pivotal role in event-driven computation and is equally important in the world of Flink SQL.
What are event time and processing time? As highlighted by Xintong, processing time refers to the machine's clock time when operators encounter incoming records. Event time, on the other hand, is the specific moment an event occurs, representing an immutable time attribute of the event itself. In our illustration, events 1, 2, and 3 each have distinct timestamps denoting their event times. Meanwhile, on the right side, the sequential processing by operators—such as filtering or computation—features timestamps that differ from the original event times.
In stream processing, two key realities emerge. First, out-of-order events are inevitable. As depicted, the red events are perfectly ordered, whereas the two green events arrive late, demonstrating the challenge of out-of-order events. How can we manage this issue? Fortunately, although we can't completely eliminate disordered events, they typically appear in a roughly sequential order, allowing us to employ mechanisms like watermarks to segment these events into manageable buckets.
Watermarks act as a perspective from operators to monitor all incoming records, differentiating between timely events and those arriving late. They serve as a crucial mechanism for assessing whether a data set representing a time range is complete. Consider a watermark definition example: a watermark defined for "TS" as "TS - 3 seconds" indicates a calculation based on event time minus three seconds. This configuration tolerates data arriving up to three seconds late, allowing events arriving within this window to be deemed valid. However, data exceeding this threshold is classified as late, prompting Flink SQL to disregard it.
By comprehending these notions of time, we can better understand how Flink SQL can help us deal with out-of-order events in the real world for event-driven computations.
Having explored the second concept of Flink SQL—event time and watermarks—let's delve into some concrete use cases showcasing the versatility of Flink SQL.
Three classic use cases for Flink SQL are highlighted on the Flink website. The first scenario involves event-driven applications where users frequently use data streaming APIs, and Flink SQL supports this use case, providing functionality such as pattern matching for complex event processing.
The second application scenario centers on data engineering, specifically streaming batch analytics - a common use case. flink SQL excels at building data processing pipelines that run both offline (batch) and in real-time.
Finally, Flink SQL is an effective architecture for developing data pipelines and performing ETL operations for data warehouses and lake houses.
Let's consider a practical example of Flink SQL in action, executing batch and real-time ETL in the batch world. For instance, using Apache Iceberg, Flink can undertake various ETL operations on batch data. Coupled with a scheduling system like Apache Airflow, these pipelines can be automated for daily or hourly execution. For real-time ETL operations, Apache Kafka is widely used; when integrated with Flink, it facilitates dynamic data transformation.
This capability is instrumental in building comprehensive solutions for data warehouses or lakehouses. Consider the classic layered warehouse architecture, consisting of three data layers. At each layer, Flink engines can collaborate with Kafka to perform necessary transformations. Ultimately, the processed data can be updated into relational databases or key-value systems, supporting BI visualizations and insights.
By understanding and implementing these use cases, users can leverage Flink SQL to optimize data workflows and enhance analytics capabilities, driving more informed decision-making and business growth.
Following our exploration of Flink SQL use cases, let's delve into some specific functionalities that make Flink SQL indispensable for data processing.
Data enrichment is a crucial aspect of data engineering, particularly when working with relational databases. To minimize redundancy, databases often require joining disparate data sources to provide comprehensive details for end-users. The accompanying diagram highlights the four commonly used join types supported by Flink SQL: inner join, left join, right join, and outer join.
Here's a brief overview of these join types:
Beyond the join types themselves, Flink SQL supports various join operations, including lookup joins, regular joins, interval joins, and temporal table joins.
In addition to enrichment, aggregation plays a vital role in data processing. Before diving into aggregation techniques, understand the importance of time windows in stream processing. We've already covered event timing and watermarking, and you may have seen examples of 'tumble window'.
Flink SQL offers four main types of time windows useful in data processing:
These window functions are essential for organizing and analyzing streaming data efficiently in Flink SQL.
Understanding infinite data streams can be challenging for many users. However, windows offer a practical solution to organize and simplify stream processing. Windows segment infinite data streams into discrete data buckets, transforming the calculation process into a batch-like style where data becomes static and manageable.
Tumble Window: A tumble window is effective for calculating metrics like unique visitors or completed orders at regular intervals—such as every minute or hour. By defining a fixed window size based on the event time column, data can be aggregated systematically. In the example provided, the interval is set to 10 minutes, allowing Flink SQL to compute results for downstream processing every 10 minutes.
Session Window: Slightly different from tumble windows, session windows include an additional definition based on partition keys. This example features a 5-minute session window centered on individual items. The session gap indicates a period of no activity, signifying the end of a previous window, allowing for deterministic results. Once a session window closes, a new one commences, following the fundamental mechanism of session windows.
Hop Window: Also known as sliding windows, hop windows have a total size and a step size. In the example, the total window size is 10 minutes with a 5-minute step. The window moves every 5 minutes, making it useful for calculating statistics over the most recent periods, such as the last five or ten minutes.
Cumulative Window: The cumulative window differs from the previous types. It includes a total window size, such as one hour, along with a one-minute step. Unlike hop windows, cumulative windows do not slide until the window ends. For instance, a one-hour cumulative window may emit results every minute, accumulating data from the start until the current time.
These windowing strategies empower users with efficient ways to manage and analyze infinite data streams in Flink SQL, providing flexibility and precise control in stream processing applications.
This example demonstrates how window aggregation operates using a click log as a data source. On the right, you’ll see two batches of window results. Window aggregation simplifies stream computing by breaking down endless streams into manageable data sets. However, it has limitations such as delayed data freshness due to waiting for late-arriving events.
Consider a scenario with a 5-second-sized tumble window and a 3-second delay to account for late events. This setup results in at least an 8-second latency. For those requiring more immediate data, group aggregation offers an alternative. By applying group aggregation to the same input table without a window, unbounded group aggregation is performed. This method processes calculations for every input record, enabling second-level real-time results, making it highly effective for tasks such as real-time analytics or anomaly detection.
Beyond window aggregation, Flink SQL continues to expand its suite of built-in operators. In addition to foundational operations like SELECT
, FILTER
, ORDER BY
, and LIMIT
, Flink SQL offers advanced capabilities including window aggregation, distinct calculations, group aggregation, and over aggregation.
Flink SQL supports four types of joins and allows the utilization of user-defined table functions. Other advanced operators include Top-N, ranking, deduplication, window deduplication, and windowed Top-N.
In addition to the built-in operators, we also have a variety of built-in functions, commonly used in three categories: scala functions, table functions and aggregate functions. The most used are scala functions, including string functions, JSON functions, temporal functions, as well as basic comparison operations, logical judgments and so on.
Beyond the built-in system functions, Flink SQL offers users the opportunity to extend its capabilities by implementing custom functions via interfaces. In addition to the standard functions previously mentioned, two advanced user-defined functions are noteworthy: the async scalar function and the table aggregate function.
Async Scalar Function:
This function represents an innovative approach compared to traditional scalar functions. It allows asynchronous interactions with external storage systems, such as making HTTP requests or executing RPC calls. By leveraging asynchronous execution, the async scalar function can significantly boost throughput, enhancing performance and scalability in complex data processing environments.
Table Aggregate Function:
Think of this function as a fusion of table functions and aggregate functions. It enables sophisticated data manipulations by combining the operational aspects of both, offering users the flexibility to perform extensive aggregations across datasets efficiently.
Let's recap today's discussion. We started with an introduction to Flink SQL and learned about its main application scenarios. We then delved into two key concepts in Flink SQL: stream-table duality, which enables the integration of stream and batch processing, and event time and watermarking, which is critical for managing event-driven computations.
We then explore various use cases that demonstrate the versatility of Flink SQL and detail its key features. From managing unlimited data streams with window aggregation to extending the system's functionality with user-defined interfaces, Flink SQL provides a powerful tool for efficient data processing.
This concludes our exploration of Flink SQL. Understanding and utilizing these capabilities can effectively build data pipelines and enhance business insights. Thank you for attending today's session.
Apache Flink FLIP-7: Visualizing Monitoring Metrics in Web UI
Streaming Data Integration from MySQL to Kafka using Flink CDC YAML
184 posts | 49 followers
FollowApache Flink Community - June 5, 2025
Apache Flink Community China - September 27, 2020
Apache Flink Community China - March 17, 2023
Apache Flink Community China - September 26, 2021
Apache Flink Community - May 27, 2024
Apache Flink Community China - April 13, 2022
184 posts | 49 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreMore Posts by Apache Flink Community