Community Blog How to Write Simple and Efficient Flink SQL

How to Write Simple and Efficient Flink SQL

This article is compiled from Xiaolin He’s presentation at the 2022 Flink Forward Asia (FFA) Conference, discussing Flink SQL insight, best practices, and future works.


This article is compiled from Xiaolin He, Alibaba's Senior Technical Expert and Apache Flink PMC Member and Committer, who shared it at the 2022 Flink Forward Asia (FFA) Conference. This article is mainly divided into three parts:

  1. Flink SQL Insight
  2. Best Practices
  3. Future Works

1. Flink SQL Insight

As a stream-batch integrated computing engine, Flink provides a unified API, unified operator description, and unified scheduling. However, there are still some subtle differences at the bottom of the Flink operator.

For a batch operator, its input is a finite data set. The batch operator performs calculations based on the complete dataset. If the memory is insufficient during the calculation, the data is spilled to the disk.

For a stream operator, its input is an infinite data set. Unlike batch operators, stream operators cannot start processing after all input data has been collected, nor can they save this data to the disk. Therefore, stream operators process the data one by one (or data may be calculated in small batches).

After the stream operator receives a piece of data from the upstream, the stateful operator reads the previous calculation result from the state, calculates the current data, and stores the calculation result in the state. Therefore, state optimization is a very important part of stream computing.

Batch processing only contains Append messages, but stream processing contains Append messages and UpdateBefore, UpdateAfter, and Delete messages, which are revisions of historical data. In the computational logic of the operator, if the revision message cannot be processed well, the output data will expand and cause new problems.

How can we make a Flink job run fast? There are six methods: reduce duplicate computing, reduce invalid data, solve data-skew issues, improve computing throughput, reduce state access, and reduce state size.

Except that the last two are for stream processing, the rest are all for common computing engines.

Next, by analyzing the process of how a Flink SQL becomes a Flink job, we introduce how the job is optimized.

After the Flink engine receives an SQL text, it parses it into a SqlNode using SqlParser. The Flink engine queries the metadata information in the catalog to verify the tables, fields, types, and UDFs in the SqlNode. After passing the verification, SqlNode is converted to LogicalPlan, and Optimizer is used to optimize LogicalPlan to ExecPlan. ExecPlan is the final execution plan at the SQL layer. The CodeGen technology is used to translate the execution plan into executable operators, which is described by Transformation, converted into JobGraph, and submitted to the Flink cluster for execution.

How can we make a SQL job execute fast? Optimizer plays an important role here. It converts LogicalPlan equivalent to SQL semantics into ExecPlan that can be executed efficiently, and the conversion process from ExecPlan to JobGraph is deterministic.

The following figure shows the internal details of the Optimizer. A Flink job usually contains multiple INSERT statements. When multiple INSERT statements are translated into a LogicalPlan, a DAG is formed. This optimizer is called the DAG Optimizer.

The input of the Optimizer includes five parts: LogicalPlan, Flink Conf, Constraints (such as PK), Statistics, and Hints from SQL.

The specific optimization process includes several steps. First, you need to break up DAG based on View, which ensures the execution plan can be reused as much as possible. After the breakup, an independent Tree will be obtained. From the leaf node to the root node, the Tree will be optimized by Calcite Optimizer. Then, the optimized results are reassembled into the DAG of PhysicalPlan. Then, the PhysicalPlan is rewritten using the sub-plan reuse technology to further eliminate duplicate computing. Finally, PhysicalPlan is converted to ExecPlan. Based on Exec DAG, more optimizations can be made (such as MultipleInput Rewrite), which reduces the overhead caused by network shuffling, and DynamicFiltering Rewrite, which reduces the reading and calculation of invalid data.

Calcite Optimizer is a classic relational algebra optimization in databases. It includes rule-based optimization (RBO) and cost-based optimization (CBO). It also defines a large number of optimization rules and deduces a large amount of meta information (such as PrimaryKey and RowCount) to help the Optimizer obtain the optimal execution plan.

In Calcite Optimizer, classic optimization ideas are combined. First, some deterministic rewrites are performed on LogicalPlan (such as SubQuery rewrites, decorrelation, and constant folding). Then, some general rewrites will be done (such as various push-down) to form FlinkLogicalPlan. This part of the optimization is generic to stream processing and batch processing. Then, Stream PhysicalPlan and Batch PhysicalPlan perform specific optimizations based on their respective underlying implementations. For example, Batch PhysicalPlan selects different Join algorithms based on Cost Model in batch processing, and Stream PhysicalPlan determines whether to convert Retract processing to Upsert processing and whether to enable Local/Global processing to resolve data hotspot issues based on the configurations.

The DAG optimizer solves a large number of various problems mentioned earlier. Whether Flink SQL is fast enough plays a very critical role.

2. Best Practices

Next, we will introduce some best practices of optimization based on some high-frequency usage scenarios and SQL in production.

2.1 Sub-Plan Reuse

The first is sub-plan reuse optimization. The two SQL statements in the following figure are optimized into two independent Pipelines. In the execution plan in the lower-left corner of the following figure, A> 10 is calculated twice in the Scan operator and the Filter operator.

By enabling sub-plan reuse optimization, the Optimizer automatically discovers and merges subgraphs that have identical computational logic in the execution plan to avoid duplicate calculations. In this example, since the two Scan operators are the same, after optimization, they become one Scan operator to reduce data reads. The optimized DAG is shown in the lower-right corner of the figure above.

Logically, Optimizer can also reuse the computation of a>10 in Filter. However, when the query is very complex and combined with various push-down optimizations, it is very troublesome to find an ideal reusable execution plan. The View-based sub-plan reuse technology may be used to maximize sub-plan reuse.

Here, View can be understood as a sub-plan reuse unit. Based on the user-defined View, the optimizer can automatically identify which Views can be reused. The SQL statements in the following figure are translated into the execution plan in the lower-left corner. After the execution plan is optimized based on the sub-plan reuse technology, Scan and Filter a>10 are reused (as shown in the execution plan in the lower-right corner of the following figure).


2.2 Fast Aggregation

Aggregation queries are frequently used in production. Here, select a,sum(b) from my_table group by a; is used as an example to describe the optimization of aggregation. The SQL statement is translated into the logical execution plan on the left in the following figure, and the right side of the figure shows the execution of the topology task. The concurrency of the Scan operator is 3, and the concurrency of the aggregation operator is 2.

A small box in the figure represents a piece of data. The value of field a is represented by color, and the value of field b is represented by number. All data of the Scan operator is shuffled by field a and output to the aggregate operator. In stream computing, when an aggregate operator receives a piece of data, it extracts the group key corresponding to the data, obtains the aggregation result before the corresponding key from the state, and performs aggregation based on the previous aggregation result and the current data. Finally, the result is updated to the state and output to the downstream. (As for logic, you can refer to the pseudo-code in the figure above.)

In the preceding figure, the input of the aggregate operator contains 15 data records, and the aggregate calculation triggers 15 access operations to state (15 times get and put operations). Such frequent operations may affect the performance. At the same time, the first aggregate instance receives eight red data entries, which will form a hotkey. How to solve the problem of frequent access to state and hotkey is often problematic in production.

You can enable MiniBatch to reduce state access and hotkey access. The corresponding configuration is:

  • table.exec.mini-batch.enabled: true
  • table.exec.mini-batch.allow-latency: 5s // Configure on demand.

allow-latency indicates that the size of the MiniBatch is five seconds, which means the data within five seconds forms a Mini Batch. After MiniBatch is enabled, the computational logic of aggregation changes. When an aggregate operator receives a piece of data, it does not directly trigger computing but puts the data into a Buffer. When the Buffer is full, the computing starts. The data in Buffer is grouped by key and calculated by each group. For each key, the previous aggregation results corresponding to the key are obtained from the state, and all data corresponding to the key is calculated one by one. After the calculation is completed, the result is updated to the state and output to the downstream.

Here, the number of visits to the state is equal to the number of keys. If all the data shown in the preceding figure is put into one Buffer, the number of keys is four, so there are only four accesses to state. Compared with the original 15 accesses, the overhead of accessing state is significantly reduced after the MiniBatch optimization is enabled.

MiniBatch is enabled for the following scenarios:

  • Low requirements for job delays (delays in collecting a batch)
  • Insufficient state processing capabilities
  • Insufficient processing capabilities of downstream operators. If MiniBatch is enabled, the amount of data output downstream is reduced.

In the preceding figure, we find that the first aggregate instance accesses the state less frequently, but the amount of data to be processed remains unchanged. The total data amount and the data amount of a single key of the aggregate instance are much larger than those of other aggregate instances, which indicates there is a data skew. We can use two-phase aggregation (local/global) to solve data skew and avoid data hotspot issues.

In order to enable two-phase aggregation, you must enable MiniBatch and set agg-phase-strategy to TWO PHASE/AUTO. The following figure shows the plan after the two-phase aggregation feature is enabled.

LocalAgg and the upstream operator are chained together, which means Scan and LocalAgg use the same concurrency, and there is no data shuffle in the middle of execution. In LocalAgg, aggregation is performed based on the size of MiniBatch, and the aggregation result is output to GlobalAgg.

In the preceding figure, the number of red data aggregated by LocalAgg is reduced from eight to three in the first GlobalAgg instance. There is no data skew between GlobalAgg instances.

Local/Global is enabled for the following scenarios:

  • All aggregate functions must implement the merge method. Otherwise, the results of LocalAgg cannot be merged in GlobalAgg.
  • LocalAgg has a high degree of aggregation, or GlobalAgg has data skew issues. Otherwise, additional computing overheads are incurred if you enable LocalAgg.

If a query contains distinct aggregation functions, Local/Global cannot resolve the data hotspot issues. As shown in the following figure, LocalAgg performs aggregation by field a and field b, and the aggregation does not work well. Therefore, the first GlobalAgg instance receives a large amount of hot data, a total of seven red data entries. If the aggregation keys (a and b) of LocalAgg are inconsistent with the downstream shuffle key (a), Local/Global cannot resolve the data hotspot issue of DistinctAgg.


We can perform partial/final aggregation to resolve the hotspot issue of DistinctAgg. PartialAgg shuffles using a group key and a distinct key. This ensures the same group key and distinct key can be shuffled to the same PartialAgg instance to complete the first layer of aggregation, reducing the amount of input data in FinalAgg. You can use the following configuration to enable Partial/Final optimization:

  • table.optimizer.distinct-agg.split.enabled: true;
  • table.optimizer.distinct-agg.split.bucket-num:1024

After the configuration is enabled, the optimizer translates the original plan into the plan shown in the following figure:

In the preceding example, after the aggregation by PartialAgg, the number of hotkeys in the first FinalAgg instance is reduced from seven to three. Partial/Final is enabled for the following scenarios:

  • Data skew in distinct functions
  • Only some built-in UDAFs are supported: count, sum, avg, max, and min.
  • The effect is better when the data set is relatively large because additional network shuffle overhead is introduced if Partial/Final is enabled.
  • PartialAgg also introduces additional computing and state overhead.

Unlike LocalAgg, PartialAgg stores the results in the state, which doubles the state overhead. Incremental optimization is introduced based on Partial/Final to solve the problem that PartialAgg introduces additional state overheads. It combines the Local/Global and Partial/Final optimizations and combines the Partial GlobalAgg and the Final LocalAgg into an IncrementalAgg operator. It only stores the values related to the Distinct Value. Therefore, the size of the state is reduced. The following figure shows the execution plan after IncrementalAgg is enabled.

Some query rewrites can also reduce the state size of the aggregate operator. Many users use COUNT DISTICNT + CASE WHEN to count the results of different conditions in the same dimension. In the aggregate operator, each COUNT DISTICNT function uses an independent state to store distinct-related data, which is redundant. You can reduce the state size by changing the CASE WHEN to FILTER, and the optimizer will share the state data of the same dimension.


2.3 Fast Join

Join is a very common query in our production, and the performance improvement effect brought by the optimization of Join is very obvious. In the following figure, a simple Join Query is used as an example to describe the optimization related to Join. The Query is translated into a Regular Join (also known as a dual-stream Join), and its state retains all data on the left and right sides. When the dataset is large, the Join State is also very large, which may cause serious performance problems in production. Optimizing Join State is very important in stream computing.

You can use the following two methods to optimize Regular Join:

  1. When Join Key contains PrimaryKey (PK), the state only stores the input data and the corresponding number of association times. When Join Input has PK, a Map is stored in State, Map Key is PK, and the Value is the input row and its number of associations. When neither the Join Key nor the Join Input has PK, State uses a Map for storage, the Map Key is the input row, and the Value is the number of occurrences and the number of associations. Although the two optimization methods use Map for storage, the state access efficiency when Join Input has PK is higher than without PK. Therefore, we recommend defining the PK information in the query to help the optimizer optimize better.
  2. We recommend only retaining the necessary fields before joining, so less data is stored in the Join State.

In addition to Regular Join, Flink provides other Joins: Lookup Join, Interval Join, Temporal Join, and Window Join. When meeting business requirements, you can rewrite Regular Join to other Joins to save State. The following figure shows the conversion between Regular Join and other Joins:


  • Rewrite Regular Join to Lookup Join. A piece of data from the mainstream will trigger JOIN operation. JOIN searches for the latest data in the dimension table based on the mainstream data. Therefore, LOOKUP JOIN does not need a State to store the input data. Currently, many dimension table connectors provide a point query mechanism and a cache mechanism. The execution performance is very good, and Lookup Join is widely used in production. The optimization of Lookup Join will be covered in a later section. The disadvantage of Lookup Join is that when the data in the dimension table is updated, the Join operation cannot be triggered.
  • Rewrite Regular Join to Interval Join. In Interval Join, Interval is specified for the Join Condition based on Regular Join. Therefore, only the data within the Interval is stored in the State, and the expired data will be cleared. Therefore, the State of Interval Join is much smaller than Regular Join.
  • Rewrite Regular Join to Window Join. Window Join defines that only data within the Window can be stored in the State. Therefore, the State only stores the latest Window, and the expired data will be cleared.
  • Rewrite Regular Join to Temporal Join. Temporal Join defines that only relevant versions of data can be joined based on Regular Join. Temporal Join retains the latest version of data, and the expired data will be cleared.

2.4 Fast Lookup Join

Lookup Join is widely used in production, so we will introduce its optimization separately here. Currently, Flink provides multiple optimization methods for Lookup Join:

The first optimization method is the synchronous query mechanism and the asynchronous query mechanism. As shown in the following figure, in a synchronous query, after a piece of data is sent to the dimension table collector, you need to wait for the result to be returned before you can process the next piece. There is a lot of waiting time. In an asynchronous query, a batch of data is sent to the dimension table collector. After the batch of data is queried, it is returned to the downstream. Using the asynchronous query mode significantly improves query performance. Flink provides query hints to enable synchronous and asynchronous modes, as shown in the following figure:

The second optimization method is the ordered mechanism and unordered mechanism in asynchronous mode. In the ordered mode, you need to wait until all data is returned and sort the data to ensure that the output sequence is the same as the input sequence. After the sorting is completed, the data can be sent to the downstream. In the unordered mode, the output sequence is not required. As long as the query result is returned, it can be directly sent to the downstream. Therefore, the unordered mode can greatly improve performance compared to the ordered mode. Flink provides query hints to enable the ordered and unordered modes, as shown in the following figure. (Flink also provides Job-level configuration, enabling all dimensional table queries to use the same output mode.)

The third optimization method is the cache mechanism. This is a very common optimization that uses local Memory Lookup to improve query efficiency. Currently, Flink provides three cache mechanisms:

  • Full Caching: Full caching mechanism caches all data to the memory. This mode is suitable for small data sets because excessive data volume may cause OOM. Flink provides table hints to enable the mechanism. You can also use hints to define a reload policy.
  • Partial Caching: Partial caching mechanism is suitable for large data sets. The underlying layer of the framework uses LRU Cache to store recently used data. You can also use hints to specify the size of the LRU cache and the expiration time of the cache.
  • No Caching: No caching mechanism disables caching.


2.5 Fast Deduplication

Data duplication often occurs in stream computing, so deduplication is performed very frequently. In earlier versions, Flink searches for the first row of data using group by and first_value and searches for the last row of data using group by and last_value. The following figure shows an example:


The preceding query is converted to the aggregate operator. Its State is very large and cannot guarantee complete semantics. Since the downstream output data may come from different rows, it is inconsistent with the semantics of deduplication by row.

In recent versions, Flink provides a more efficient deduplication method. Flink does not specifically provide the deduplication syntax. Currently, the Over statement is used to express the deduplication semantics, as shown in the following figure. If you want to search for the first row, output row_number in ascending order of time and filter the rows based on row_number. In this case, the State only needs to store Keys for deduplication. If you want to search for the last row, output row_number in descending order of time. The State will only store the last row of each Key.


2.6 Fast TopN

TopN (also known as Rank) is also frequently used in production. Flink does not provide a special syntax for TopN, which is also implemented through the Over statement. Currently, TopN provides three implementations (each with decreasing performance).

The first method, AppendRank, requires that the input is an Append message. In this case, the State in the TopN operator only needs to store N pieces of data that meet the requirements for each Key.

The second method, UpdateFastRank, requires that the input is an Update message and the upstream Upsert Key must contain Partition Key. The Order-by field must be monotonic, and the monotonic direction must be opposite to the Order-by direction. The query on the right in the preceding figure is used as an example. GROUP BY a, b generates an update stream with a, b as the keys. In the TopN operator, the Upsert Key (a, b) generated in the upstream contains the Partition Key (a, b) in the Over statement. The input parameters of the Sum function are all positive numbers (where c >= 0), so the result of Sum will be monotonically increasing, so the field c is monotonically increasing, which is opposite to the sorting direction. Therefore, this query can be translated into UpdateFastRank.

The third method, RetractRank, does not require any input. All input data needs to be stored in the State, and its performance is also the worst.

In addition to modifying queries to meet business requirements so the optimizer can select better operators, TopN has some other optimization methods:

  1. Do not output the row_number field. This can significantly reduce the amount of data processed downstream. If data needs to be sorted downstream, it can be resorted after it is obtained at the frontend.
  2. Increase the Cache size in the TopN operator to reduce access to the State. The Cache hit ratio is calculated by cache_hit = cache_size * parallelism / top_n_num / partition_key_num. This shows that increasing the Cache size can increase the Cache hit ratio (the Cache size can be modified by table.exec.rank.topn-cache-size, the default value is 10,000). Note: When the Cache size is increased, the memory of the TaskManager needs to be increased accordingly.
  3. The partition field is preferably time-related. If the partition field is not associated with the time attribute and cannot be cleared by TTL, the State will expand indefinitely. (Note: If TTL is configured, data may be cleared after expiration, and the result may be incorrect.)


2.7 Efficient User-Defined Connector

Flink provides a variety of connector interfaces to help the optimizer generate better execution plans. You can implement interfaces based on the capabilities of the connector to improve the execution performance of the collector.


  • SupportsFilterPushDown pushes down the Filter condition to Scan to reduce the Scan read I/O to improve performance.
  • SupportsProjectionPushDown tells Scan to only read the necessary fields and reduce invalid field reads.
  • SupportsPartitionPushDown tells Scan to only read valid partitions to avoid invalid partition reads during static optimization.
  • SupportsDynamicFiltering dynamically identifies which partitions are valid while the job is running to avoid invalid partition reads.
  • SupportsLimitPushDown pushes down the limit value to Scan, and only the limit data needs to be read, which significantly reduces Scan I/O.
  • SupportsAggregatePushDown reads aggregated results directly from Scan, reducing Scan read I/O while outputting less data to the downstream.
  • SupportsStatisticReport reports the statistics to the optimizer, so the optimizer can generate a better execution plan.

2.8 Use Hints Well

Any optimizer is not perfect. Flink provides a Hints mechanism to influence the optimizer to obtain a better execution plan. There are two types of hints:


  • Table hints are used to modify the configurations of a table. For example, you can use table hints to modify the cache policy of a lookup table.
  • Query hints are used to suggest the optimizer select an appropriate Join strategy. You can use hints to modify lookup policies. Please see the preceding description of Lookup Join optimization for more information. You can also use hints to help the optimizer to select the corresponding join algorithm for batch jobs, as shown in the preceding figure.

3. Future Works

In the future, we will provide more in-depth and more intelligent optimizations in addtional scenarios to improve the execution efficiency of the Flink engine.

First, we will continue to explore the depth of optimization. For example, if the problem of State duplication occurs in multiple consecutive Joins, we can optimize these Joins into one Join to avoid State duplication.

Secondly, we hope to expand the breadth of optimization and combine different business scenarios for targeted optimization.

Lastly, we hope the optimization will be more intelligent, and we will explore doing some dynamic optimization work (such as automatically optimizing the execution plan online according to traffic changes).


0 1 0
Share on

Apache Flink Community

144 posts | 41 followers

You may also like


Apache Flink Community

144 posts | 41 followers

Related Products