×
Community Blog Compare Flink SQL and DataStream API: Comprehensive Guide for New Developers

Compare Flink SQL and DataStream API: Comprehensive Guide for New Developers

Apache Flink is a stream processing framework with two main interfaces: Flink SQL, which uses SQL for batch and streaming data, and the DataStream API...

Apache Flink has emerged as a leading framework for stream processing, offering two distinct programming interfaces: Flink SQL (declarative) and DataStream API (imperative). This guide helps developers understand their differences, strengths, and ideal use cases through code examples, architectural diagrams, and performance insights.

Flink SQL: SQL-Driven Stream Processing

Flink SQL provides a declarative approach using standard SQL syntax to process both bounded (batch) and unbounded (streaming) data. Key features include:

  • Unified batch/stream processing: Execute the same SQL query on historical or real-time data.
  • Window functions: Built-in support for tumbling, sliding, and session windows.
  • Connector ecosystem: Native integration with Kafka, JDBC, Hive, and more.
-- Create a Kafka-backed table
CREATE TABLE user_clicks (
  user_id INT,
  url STRING,
  event_time TIMESTAMP(3),
  WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'clicks',
  'properties.bootstrap.servers' = 'localhost:9092',
  'format' = 'json'
);

-- Tumbling window aggregation using Window TVF
SELECT 
  window_start,
  window_end,
  COUNT(user_id) AS click_count
FROM TABLE(
  TUMBLE(
    TABLE user_clicks,
    DESCRIPTOR(event_time),
    INTERVAL '1' MINUTE
  )
)
GROUP BY window_start, window_end;

Flink SQL: Core Concepts

Dynamic Tables

Flink SQL treats streams as dynamic tables that evolve over time. This enables SQL operations on unbounded data streams.

Tips: learn more about dynamic table by reading Apache Flink's documentation.

Time Attributes

Time attributes like event Time and processing Time are two fundamental concepts in Apache Flink, which are crucial for understanding how stream processing works, especially in scenarios involving real-time data.

  • Event Time: Event Time refers to the actual time when an event occurred. This time is typically embedded within the data itself as a timestamp, such as when a sensor captures a reading or when a log entry is generated. For example, in IoT applications, event time might be the timestamp recorded by a sensor at the moment it detected a change.
  • Processing Time: Processing Time, on the other hand, refers to the time when the event is processed by the system. This is typically determined by the local system clock of the machine executing the operations . Processing Time is simpler to handle compared to Event Time because it relies on the system's clock, which is straightforward to manage.

Both event time and processing time is used in Flink SQL, for eample:

CREATE TABLE user_actions (
  user_id STRING,
  action STRING,
  event_time TIMESTAMP(3),                  -- Event time from the data
  processing_time AS PROCTIME(),            -- Processing time generated by Flink
  WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND  -- Watermark for event time
) WITH (
  'connector' = 'kafka',
  'topic' = 'user_actions',
  'format' = 'json'
); 

Windowed Aggregations

Windowed aggregations are a powerful technique used in data analysis to perform calculations over a defined window of data, often based on time or other criteria. This allows for the computation of metrics such as sums, averages, counts, and more over specific intervals or partitions of the data.

Calculate hourly user activity:

SELECT 
  window_start,
  window_end,
  COUNT(DISTINCT user_id) AS active_users
FROM TABLE(
  TUMBLE(
    TABLE user_actions,               
    DESCRIPTOR(event_time),         
    INTERVAL '1' HOUR                
  )
)
GROUP BY window_start, window_end;

Stream Joins

Streaming joins in Flink SQL enable real-time correlation of unbounded data streams by leveraging Flink's unified batch/stream processing model. Unlike batch joins that process finite datasets, streaming joins must handle continuous data arrival, out-of-order events, and state management challenges. Below is a structured analysis of key concepts and implementation details:

Flink SQL supports three primary join patterns for streaming data:

Join Type Characteristics State Management Use Cases
Regular Join • Continuous updates as new data arrives• Supports all SQL join types (INNER/LEFT/RIGHT/FULL) Requires unbounded state retention (until TTL expiration) Real-time dashboard metrics
Interval Join • Joins events within a time range• Use time attributes (processing or event time) Bounded state (events outside time window discarded) Fraud detection in 5-min windows
Temporal Join • Joins a stream with a versioned table (e.g., dimension table)• Uses FOR SYSTEM_TIME AS OF syntax State tied to temporal table's update frequency Enriching orders with currency rates
Lookup Join • Enrich stream with external table data• On-demand external system queries No persistent state (external lookup) Adding product info from database to orders

you can enrich clickstream data with user profiles in Flink SQL:

CREATE TABLE user_profiles (  
  user_id STRING,  
  country STRING,  
  PRIMARY KEY (user_id) NOT ENFORCED  
) WITH (  
  'connector' = 'jdbc',  
  'url' = 'jdbc:mysql://localhost:3306/mydb',  
  'table-name' = 'users'  
);  

SELECT  
  u.user_id,  
  u.country,  
  COUNT(c.click_id) AS click_count  
FROM clicks c  
JOIN user_profiles u  
ON c.user_id = u.user_id;  

Pattern Detection with MATCH_RECOGNIZE

The MATCH_RECOGNIZE clause is a powerful SQL feature introduced in 2016 as part of the SQL:2016 standard, designed for pattern recognition within relational data. It allows users to define and detect specific patterns in rows of data, making it particularly useful for complex event processing (CEP) and time-series analysis.

Identify failed login sequences:

SELECT *  
FROM login_attempts  
MATCH_RECOGNIZE (  
  PARTITION BY user_id  
  ORDER BY event_time  
  MEASURES  
    START_ROW.event_time AS start_time,  
    LAST(FAIL.event_time) AS end_time  
  AFTER MATCH SKIP TO LAST FAIL  
  PATTERN (START FAIL{3})  
  DEFINE  
    FAIL AS action = 'login_failed'  
);  

Streaming vs. batch in Flink SQL

Flink SQL treats batch processing as a special case of streaming where the input data is bounded. This unified approach enables:a special case of streaming where the input data is bounded. This unified approach enables:

  • Single SQL Interface: Use identical syntax for both modes: Use identical syntax for both modes
  • Dynamic Table Abstraction: All data (static/bounded or unbounded) is modeled as evolving tables: All data (static/bounded or unbounded) is modeled as evolving tables
  • Shared Connector System: Same connectors (Kafka, JDBC, etc.) work in both modes: Same connectors (Kafka, JDBC, etc.) work in both modes

Time Handling in Streaming Mode and Batch Model of Flink SQL

Aspect Streaming Mode Batch Mode
Time Semantics Event-time/Processing-time with watermarks Implicit time (data order irrelevant)
ORDER BY Support Only time-based sorting Any column sorting
Watermarks Required for event-time processing Not applicable

Example Watermark Definition

CREATE TABLE orders (
  order_id STRING,
  order_time TIMESTAMP(3),
  WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH (...);

Join Operations

Join Type Streaming Mode Batch Mode
Regular Joins Continuous updates as new data arrives Single complete computation
Temporal Joins Optimized using time attributes Not available
State Management Requires continuous state retention No state during job

Streaming Join Example

SELECT * 
FROM orders
JOIN currency_rates FOR SYSTEM_TIME AS OF orders.order_time
ON orders.currency = currency_rates.currency

Time Semantics & Watermarks

Effective streaming joins depend on precise time handling:

Event-Time Joins:

-- Click events table (with watermark)
CREATE TABLE clicks (
  user_id STRING,
  click_time TIMESTAMP(3),
  WATERMARK FOR click_time AS click_time - INTERVAL '30' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'clicks',
  'format' = 'json'
);

-- Purchase events table (with watermark)
CREATE TABLE purchases (
  user_id STRING,
  purchase_time TIMESTAMP(3),
  amount DECIMAL(10,2),
  WATERMARK FOR purchase_time AS purchase_time - INTERVAL '20' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'purchases',
  'format' = 'json'
);

-- Event-time interval join query
SELECT
  c.user_id,
  c.click_time,
  p.purchase_time,
  p.amount,
  TIMESTAMPDIFF(MINUTE, c.click_time, p.purchase_time) AS mins_diff
FROM clicks c
JOIN purchases p 
ON c.user_id = p.user_id
AND p.purchase_time 
  BETWEEN c.click_time - INTERVAL '30' MINUTE 
  AND c.click_time + INTERVAL '15' MINUTE;
  1. Watermark Propagation Mechanism
    Watermarks from both tables propagate through the join operation, with the downstream watermark calculated as:

MIN(clicks_watermark, purchases_watermark)

  1. State Retention Policy
    Flink automatically retains state data for 30 MINUTE + 15 MINUTE = 45 minutes. Outdated state beyond this interval is automatically cleaned.
  2. Late Data Handling
    When the watermark exceeds purchase_time + 15m, corresponding click events will no longer wait for new purchase events, triggering output of final window results.

Lookup Joins:

SELECT * FROM A
JOIN B FOR SYSTEM_TIME AS OF A.PROCTIME()
ON A.key = B.key

Uses the latest version of dimension tables .

When to Use Flink SQL?

Use Case Example
Real-time dashboards Aggregating metrics every 5 seconds
ETL pipelines Cleaning & transforming IoT device data
Fraud detection Pattern matching on transaction streams
Customer analytics Joining clickstream with user profiles

Flink SQL's Relationship with Other Flink APIs

Apache Flink provides a layered API architecture designed to balance accessibility, flexibility, and performance. Flink SQL sits at the top of this hierarchy but maintains deep interoperability with lower-level APIs like DataStream API and Table API. Below is a detailed analysis of their relationships and integration mechanisms.


Flink's APIs are structured to address different abstraction levels and use cases:

API Layer Key Characteristics
Flink SQL • ANSI SQL-compliant• Declarative syntax for batch/stream unification• Highest abstraction layer (#user-content-15)
Table API • Language-integrated (Java/Scala/Python)• Relational operations (e.g., select, join)• Shares planner with Flink SQL (#user-content-3)
DataStream API • Imperative programming (Java)• Fine-grained control over time, state, and windows• Foundation for stream processing

DataStream API: Fine-Grained Control

The DataStream API offers low-level control over streaming logic, ideal for complex event processing:

  • State management: Manual control over stateful operations.
  • Custom operators: Build user-defined functions (UDFs) for unique requirements.
  • Time semantics: Explicit handling of event time vs. processing time.
DataStream<UserClick> clicks = env
  .addSource(new FlinkKafkaConsumer<>("clicks", new JSONDeserializer(), properties));

DataStream<ClickCount> counts = clicks
  .keyBy(click -> click.userId)
  .window(TumblingEventTimeWindows.of(Time.minutes(1)))
  .process(new CountClicksPerWindow());

counts.addSink(new ElasticsearchSink<>());

Processing Data with DataStream API

Programming Model

Every DataStream program follows this structure:

// 1. Create execution environment  
StreamExecutionEnvironment env =  
    StreamExecutionEnvironment.getExecutionEnvironment();  

// 2. Define data source  
DataStream<String> text = env.readTextFile("input.txt");  

// 3. Apply transformations  
DataStream<Tuple2<String, Integer>> counts =  
    text.flatMap(new Tokenizer())  
        .keyBy(0)  
        .sum(1);  

// 4. Define output sink  
counts.print();  

// 5. Execute  
env.execute("WordCount");  

Key Components:

Component Description
StreamExecutionEnvironment Entry point for job configuration
DataStream Immutable distributed data collection
Transformation Operator defining data processing logic
Sink Output system (Kafka, JDBC, files, etc.)

Key Operations Explained

1. Data Ingestion (Sources)

Purpose: Read data from external systems into Flink.
Key Sources:

  • Message Queues: Kafka, Pulsar, RabbitMQ
  • Files: Local/HDFS/S3 files
  • Sockets: Network streams
  • Custom Sources: User-defined connectors

Example:

DataStream<String> stream = env.addSource(new FlinkKafkaConsumer<>(...));  

2. Data Transformations

Purpose: Process and manipulate streaming data.

Transformation Description
Map 1:1 element transformation (e.g., parsing strings to objects)
Filter Discard unwanted elements (e.g., remove invalid records)
KeyBy Partition data by a key for stateful operations (e.g., group by user ID)
Window Group events into time-based buckets (e.g., 5-minute aggregates)
Process Custom logic via ProcessFunction (e.g., complex event detection)

Example:

stream  
  .map(record -> parseRecord(record))  
  .filter(record -> record.isValid())  
  .keyBy(record -> record.getUserId())  
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))  
  .sum("value");  

3. State Management

Purpose: Track and update information across events in stateful operations.

State Type Use Case
ValueState Single value per key (e.g., user session count)
ListState Append-only list per key (e.g., recent transactions)
MapState Key-value storage per key (e.g., user profile attributes)

Why It Matters:

Enables stateful computations (e.g., running totals, session tracking)

Automatically fault-tolerant via checkpoints


4. Time Handling

Purpose: Manage event ordering and out-of-order data.

Time Concept Description
Event Time Timestamp embedded in data (e.g., sensor reading time)
Processing Time System time when Flink processes the event
Watermarks Signal event-time progress (e.g., "no events older than X will arrive")

Example Watermark:

stream.assignTimestampsAndWatermarks(  
  WatermarkStrategy
    .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))  
    .withTimestampAssigner((event, ts) -> event.getTimestamp())  
);  

5. Data Output (Sinks)

Purpose: Write processed results to external systems.

Sink Type Example Use Case
Databases Write aggregates to PostgreSQL/MySQL
Message Queues Emit alerts to Kafka/Pulsar
Filesystems Store results in HDFS/S3
Dashboards Stream metrics to Elasticsearch/Grafana

Example:

stream.addSink(new ElasticsearchSink<>(...));  

6. Fault Tolerance

Purpose: Recover from failures without data loss.

Mechanism Description
Checkpoints Periodic snapshots of state (configurable intervals)
Savepoints Manual state snapshots for version upgrades or pipeline changes
State Backends Storage for state (e.g., RocksDB for large state, Memory for speed)

Configuration:

env.enableCheckpointing(1000); // 1-second checkpoint interval  
env.setStateBackend(new RocksDBStateBackend(...));  

Why These Operations Matter:

Flexibility: Mix stateless (e.g., map, filter) and stateful (e.g., window, ProcessFunction) operations.

Precision: Fine-grained control over time, state, and partitioning.

Resilience: Built-in mechanisms to handle failures and late-arriving data.

Scalability: Parallel execution across distributed keys and operators.


For new developers, start with simple pipelines (e.g., Source → Map → Sink) and gradually incorporate stateful operations like keyBy and window as you gain familiarity. The DataStream API's power lies in its ability to combine these building blocks for complex, production-grade streaming logic.

When to Choose DataStream API?

The DataStream API in Apache Flink is particularly advantageous in scenarios where users require fine-grained control over various aspects of stream processing. Here’s a detailed explanation of when to choose the DataStream API based on the provided evidence:

Scenario DataStream API Advantage
Custom window triggers Full control over windowing logic
Low-level state access Direct state manipulation
Microsecond latency Bypass SQL optimizer overhead
Complex event patterns Native ProcessFunction support
  • Custom Window Triggers:

    • The DataStream API offers full control over windowing logic, allowing users to define custom window operators, assigners, triggers, and evictors. This flexibility is crucial for applications that need precise control over how data is grouped and processed over time intervals. For example, users can implement custom window triggers using ProcessFunction to handle events based on specific conditions or time-based logic .
  • Low-Level State Access:

    • The DataStream API provides direct access to state management, enabling users to manipulate state directly within their transformations. This is particularly useful for applications that require complex stateful computations or need to maintain session-level information. Users can leverage low-level APIs like ValueState, ReducingState, and ListState to manage state efficiently .
  • Microsecond Latency:

    • The DataStream API bypasses the overhead of SQL optimization, allowing for lower latency processing. This is essential for real-time applications where immediate analysis and decision-making are critical. For instance, applications like real-time monitoring dashboards or financial trading platforms benefit from the low latency provided by the DataStream API .
  • Complex Event Patterns:

    • The DataStream API supports complex event patterns through native ProcessFunction support. This allows developers to implement sophisticated event-driven logic, such as pattern recognition, anomaly detection, and rule-based alerting. The integration with Flink’s Complex Event Processing (CEP) library further enhances this capability, enabling real-time pattern recognition and processing .

Key Differences of Flink SQL and DataStream API at a Glance

Flink SQL offers a higher-level, declarative approach that is easier to learn and use, making it suitable for users who prefer SQL-like syntax and require automatic state management. On the other hand, the DataStream API provides a lower-level, imperative approach that offers full flexibility and control, making it better suited for complex scenarios where custom logic and fine-grained optimization are required.

Below are the summary of the key differences:

Feature Flink SQL DataStream API
Abstraction Level High (declarative SQL) Low (imperative Java/Scala)
Learning Curve Easy for SQL users Steeper (requires coding skills)
State Management Automatic (managed by Flink) Manual (developer-controlled)
Custom Logic Limited to UDFs/UDTFs Full flexibility (custom operators)
Performance Optimized via Calcite planner Depends on implementation efficiency
Use Cases ETL, real-time analytics Complex event processing, low-latency

Flink SQL vs DataStream API: When to Choose Which API?

Criteria Flink SQL DataStream API
Development speed ✔️ (Declarative, no coding) ❌ (Requires Java/Scala expertise)
Custom state logic ✔️ (Enhanced via UDAFs & PTFs) ✔️ (Full control via KeyedState)
BI tool integration ✔️ (JDBC/ODBC connectors) ❌ (Requires custom sink development)
Latency Profile ✔️ Sub-second latency (typical 100ms-1s)✔️ Micro-batch optimizations ✔️ Millisecond-level latency✔️ True event-at-a-time processing✔️ Native low-level optimizations

Tradeoff

Use Flink SQL: For rapid prototyping, standard operations, and ease of development, especially when working with simple data processing tasks like data cleaning, real-time reporting, or data warehousing .

Switch to DataStream API: For advanced state management, complex logic, or scenarios requiring low latency. This is particularly useful when developers need full control over the processing pipeline and can invest the time and effort required for more complex development .

Hybrid Approach:

In some cases, a hybrid approach can be beneficial. For example, using Flink SQL for initial data processing and then converting the results to a DataStream for further complex processing can leverage the strengths of both APIs .

The choice between Flink SQL and the DataStream API depends on the specific requirements of the project, including development speed, complexity of state logic, integration with BI tools, and latency requirements. A hybrid approach can also be considered to combine the strengths of both APIs.

Getting Started Tips

  1. Start with SQL: Use Flink SQL for proof-of-concepts before diving into DataStream.
  2. Mix APIs: Combine SQL queries with DataStream operators when needed.
  3. Leverage Connectors: Use pre-built connectors for Kafka, JDBC, etc., to avoid boilerplate code.
  4. Monitor Metrics: Track throughput (numRecordsOutPerSecond) and latency via Flink's REST API.

Ready to Get Started with Apache Flink?

Ready to unlock the full potential of real-time data processing? Dive into Alibaba Cloud's Realtime Compute for Apache Flink and experience the game-changing power of Flink SQL and DataStream API firsthand. Whether you're building complex event-driven applications or performing real-time analytics, our platform offers the tools you need to simplify development and scale effortlessly. Start your journey today with a free trial and see how Alibaba Cloud makes it easy to harness the speed and flexibility of stream processing. Want to learn more? Explore Flink’s comprehensive documentation on Alibaba Cloudto get step-by-step guidance and best practices tailored to your needs. Don’t wait—transform your data processing capabilities now!

0 1 0
Share on

Apache Flink Community

178 posts | 48 followers

You may also like

Comments

Apache Flink Community

178 posts | 48 followers

Related Products