×
Community Blog The Next Generation of Apache Flink

The Next Generation of Apache Flink

This article discusses the main technical directions and plans of the Apache Flink community for the coming year, and the preparations for the Flink 2.

By Xintong Song

This article is a compilation of the speech given by Xintong Song at Flink Forward Asia 2023. Song is the director of Alibaba Cloud Intelligent Flink distributed execution, an Apache Flink PMC member, and the Flink 2.0 Release Manager. The speech discusses the main technical directions and plans of the Apache Flink community for the coming year, as well as the preparations for the Flink 2.0 version.

Main Technical Directions of the Community

1

The Flink community is currently investing and will continue to invest in the following three directions:

  1. Ultimate optimization and technical evolution of streaming: While Flink is recognized as the de facto standard in the field of real-time computing, there is still room for improvement when compared to its own earlier versions. The Flink community will focus on optimizing and evolving streaming technology to maintain Flink's leading position and drive the development of streaming technology in the industry.
  2. Evolution of batch-streaming unification: The Flink community has been promoting the batch-streaming unification architecture, which has become a crucial feature of Flink. This architecture, including the proposed Streaming Lakehouse concept, attracts more users to choose Flink. The community will continue to heavily invest in the evolution of this architecture.
  3. User experience improvement: Although Flink boasts an advanced streaming architecture, its ease of use has room for improvement due to the inherent complexity of streaming and design issues from the early days. The Flink community recognizes this and is actively seeking improvements to enhance the user experience, making it a high priority for the project's future development.

1. Streaming

This section focuses on the future work to be done in the field of stream computing.

1.1 Storage-computing Separation State Management

2

The most important technical direction for Flink in the field of stream computing is the evolution of the storage-computing separation architecture in the state management mechanism. Flink provides stateful stream computing capabilities, and state management is a crucial aspect of Flink. However, with the advent of the cloud-native era, migrating big data to the cloud presents new challenges for stateful computing engines like Flink. These challenges include resource isolation at the container level and the uncertainty of disk space requirements when storing Flink state locally. Additionally, the elasticity of cloud resources, which allows for dynamic scaling, requires job suspension and restart, impacting the interruption time during elastic scaling.

To address these challenges, the Flink community is investing in the design and discussion of the next-generation storage-computing separation state management architecture.

1.2 Continuous Refinement of Operator Semantics and Performance

In addition to state management, the Flink community continues to optimize and improve the semantics and functions of important operators in stream computing. Two key operators, window and join, are highlighted in this section.

3_

Window is a unique concept in stream computing, primarily used for aggregation. In Flink SQL, table-valued functions are recommended for defining windows. The Flink community will further enhance window capabilities based on table-valued functions, including support for different window types, changes in inputs, and early or late firings.

Join is another important operator commonly used in data analysis, but it can become a computational bottleneck in large-scale data processing. The Flink community is exploring various performance optimization technologies for joins, such as Mini-batch Join and Multi-way Join.

2. Batch-streaming Unification

This section focuses on the future work of the Alibaba Cloud Flink community in the area of batch-streaming unification.

2.1 Batch-streaming Unification APIs

Batch-streaming unification leverages a unified set of APIs for developing both streaming and batch tasks, using the same execution engine, and relying on unified operators to ensure data consistency. Unified APIs are a crucial prerequisite for achieving batch-streaming unification. Currently, Flink allows the use of the same APIs, whether it be SQL or DataStream, for developing streaming and batch tasks. However, it has been observed that in many cases, the code for streaming and batch tasks developed using these APIs is not the same, thus failing to achieve the desired flexibility for switching between the two running modes.

4

To address this issue, the Flink community is exploring a new unified syntax and semantics for batch-streaming unification in the SQL field. The key idea is to enable the successful execution of both batch and streaming modes within a single Flink SQL development, based on the concept of materialized views. However, due to DataStream's procedural API nature, which is oriented towards the underlying layer, it is challenging to completely shield the differences between streaming and batch modes. The community's main optimization approach is to clearly distinguish which operators or capabilities can achieve batch-streaming unification and which are specific to either streaming or batch scenarios. This will enable users to develop their business logic for batch-streaming unification more effectively.

2.2 Batch Capability Improvement

Having unified APIs alone is not enough to achieve true batch-streaming unification. It is also essential to have an engine with excellent streaming and batch capabilities and performance. Flink is widely recognized as an industry leader in terms of its streaming capabilities, and the community is currently focused on significantly improving Flink's batch capability to achieve a top-tier position in the industry.

The following are three directions for future work to improve the batch capability.

First of all, in terms of fault tolerance, Flink supports fault tolerance at the single task level. However, once a JM node fails, Fluid has to rerun the entire job, including tasks that have completed computing and produced results. This can incur high costs. Therefore, the Flink community has discussed and proposed a solution to restore the computing results of completed tasks when JM fails, thus greatly reducing the cost of JM failovers.

Second, Adaptive Query Execution (AQE). The AQE capability that the Flink community has achieved include dynamic concurrency inference and dynamic partition pruning. We will continue to make efforts in this direction in the future, adding capabilities such as dynamic load balancing, dynamic topology generation, and dynamic operator type selection.

Third, Remote Shuffle Service (RSS). Large-scale batch tasks often require RSS that relies on storage-computing separation, and Apache Celeborn is an open-source project that provides a common big data RSS solution. Flink has recently proposed a brand-new Hybrid Shuffle mode, which can combine the advantages of streaming and batch shuffle. It is safe to say that this mode is specifically oriented to batch-streaming unification. The Flink community is currently working with the Celeborn community to create an integrated solution of Hybrid Shuffle mode and Apache Celeborn.

5

2.3 Batch-streaming Integration

With excellent streaming and batch capabilities, the Flink community also plans to further break the boundary between the streaming and batch modes.

6

Let's first consider a question: Why are there two different modes? We observe that users' business logic of data processing in streaming and batch modes is essentially the same. The difference mainly lies in the tendency to different performance indicators in runtime.

In streaming mode, data processing capabilities with low latency are preferred. Users expect that data processing has better real-time performance and data flows through the entire system (from entering the system to output results) as fast as possible. In the process of pursuing low latency, some resource efficiency and computing performance are inevitably compromised.

In batch mode, users do not expect much real-time performance. Compared with the latency of processing each piece of data, the time and resources required to process a complete data set are more concerned, which means that high throughput performance is preferred.

In common data processing scenarios,

• the freshness of data in offline computing is low, so high throughput is preferred;

• in real-time computing, low latency is preferred in most cases, but there are also some scenarios tending to high throughput. For example, when data traffic suddenly increases, insufficient processing performance causes a data backlog, which increases the latency; or when data needs to be recovered during failback. In these cases, the increased latency can no longer meet the business requirements. Users often pay more attention to how long it takes to recover the delay than how much time is delayed. Thus, there is a demand for high throughput.

• In addition, in full-incremental scenarios, such as CDC data synchronization or data backfilling for state hot start, the high throughput performance of jobs is more concerned in full mode, while the low-latency real-time performance is preferred in incremental scenarios.

Therefore, we propose the concept of batch-streaming integration, hoping that the engine can automatically identify the tendency of jobs for high throughput or low latency, automatically select the appropriate mode to execute, and when the state and demand tendency of the job change during the running process, the mode can be switched automatically.

2.4 Streaming Lakehouse

7

The Flink community will continue to improve the architecture of Streaming Lakehouse. The preceding figure shows the architecture of Streaming Lakehouse created by combining the batch-streaming unification computing capability of Flink and the batch-streaming unification storage capabilitys of Paimon.

From the perspective of the Flink computing engine, there are two main aspects to be completed in addition to the preceding one: one is to optimize the performance of OLAP short queries and improve the functions of SQL Gateway, and the other is to enhance the data and meta-data management capabilities of Flink SQL for lake storage scenarios.

3. Experience Improvements

This section focuses on the work to improve user experience, which involves many details. The following lists some of the most representative works.

8

Firstly, let's discuss the upgrade of SQL jobs. Upgrades often involve changes in topology, which poses a long-standing challenge for state compatibility. The Flink community has a solution in the MVP stage, which will be further improved and optimized in the future.

In terms of serialization, Flink has a powerful type and serialization system that supports various serializers. However, there is room for improvement in the current serialization mechanism to enhance ease of use. For example, modifying the mapping between types and serializers or changing the rules for automatic serializer selection currently requires modifying code. The Flink community is currently discussing a serialization management solution based entirely on configuration files.

Regarding the configuration system, Flink has numerous configuration items. However, some of them suffer from issues such as unreasonable default values, unclear semantics and scope, and exposure of internal details. The Flink community is reorganizing the configuration mechanism, including reevaluating all existing configuration items.

Lastly, in terms of API evolution, we will phase out certain older APIs, including the DataSet API, Scala API, and Legacy Table Source/Sink. Simultaneously, the Flink community will continue to enhance existing major APIs such as the DataStream APIs, REST APIs, and Metrics.

As you may have found, many of these optimizations involve modifications to the public interface, which may impact backward compatibility for the entire project. This leads us to the next topic: Flink 2.0.

4. Flink 2.0

Flink released version 1.0 in 2016 and version 1.18 in October 2023. Over the past seven and a half years, 19 minor versions have been released. The decision to prepare a new major version is primarily driven by the need to introduce incompatible API changes.

9_

The Flink community has defined API compatibility levels, including @Public, @PublicEvolving, and @Experimental.

For compatibility guarantees, @Public APIs ensure compatibility within the same major version (first digit of version number remains the same), while @PublicEvolving APIs ensure compatibility within the same minor version (first two digits of version numbers remain the same). On the other hand, @Experimental APIs do not have compatibility guarantees. In addition to compatibility guarantees, the Flink community has introduced requirements for the API migration cycle. This means that there is a minimum period for an API to be marked as @Deprecated before it can be deleted. For @Public APIs, this period is two versions, while for @PublicEvolving APIs, it is one version. Any API modifications must adhere to both the compatibility guarantees and the minimum migration cycle requirements. For instance, if an @Public API is modified in the latest 1.18 version, it must be marked as deprecated in the 1.19 version, and after the migration cycle covering versions 1.19 and 1.20, it can be deleted in the 2.0 version.

Based on the API compatibility requirements mentioned above, the tentative timeline for Flink 2.0 is as follows:

10

The Flink community launched version 1.18 in October 2023. In February and June 2024, versions 1.19 and 1.20 will be released, respectively, to meet the API migration cycle requirements. Additionally, version 1.20 is expected to be the last release in the Flink 1.X series. Considering the complexity and workload involved in major version upgrades, version 1.20 is planned to serve as a long-term support (LTS) version, with continued vulnerability fixes. The brand-new Flink 2.0 is expected to be released in October 2024. Any API changes that impact compatibility must strictly adhere to the planned future work for this version. However, for optimizations and work that do not involve API changes, there are no strict version requirements.

0 1 0
Share on

Apache Flink Community

131 posts | 41 followers

You may also like

Comments

Apache Flink Community

131 posts | 41 followers

Related Products