Flink Core Technology
Mei Yuan | Head of Alibaba Cloud Flink storage engine team, Apache Flink engine architect, Apache Flink PMC&Committee
Flink is the industry recognized standard for real-time computing in the field of big data. However, we have not relaxed Flink's further efforts in real-time computing. This talk will report some progress and achievements of Flink fault-tolerant in the past year, as well as future prospects.
The first part is the improvement we have made in checkpoint in the past year. With the further improvement of Generic log based incremental checkpoints, the improvement of checkpoint (fast and stable checkpoint) is also gradually maturing.
In this talk, I will share some experimental results of the combination of Changelog, unaligned checkpoint and buffer de blocking. At the same time, I will also talk about the improvement of rescaling in the cloud native context, as well as the community's efforts in unifying the concept and operation of snapshots.
Flink Shuffle 3.0: Vision, Roadmap and Progress
Song Xintong | Alibaba Cloud Senior Technical Expert, Apache Flink PMC Member&Committee
Since the birth of Flink, the evolution of its Shuffle architecture can be divided into two stages:
• The initial 1.0 stage only supports the most basic data transmission function; In Phase 2.0, a series of improvements were made around the performance, stability and streaming batch architecture.
• Now, as Flink's application scenarios and forms become more and more rich, the Shuffle architecture is also facing many new challenges and opportunities, and is about to enter the 3.0 phase.
In this sharing, I will focus on such keywords as cloud native, streaming batch fusion, and adaption to present the Flink Shuffle 3.0 blueprint we described, and introduce the planning and progress of related work. Specific contents include: streaming batch adaptive Hybrid Shuffle, cloud native multi-level storage adaptive Shuffle, overall architecture upgrade and function integration of Shuffle, etc.
Adaptive and flexible execution management for batch jobs
Zhu Zhu | Alibaba Cloud Senior Technical Expert, Apache Flink PMC&Committer
In order to better support the execution of batch jobs and improve their ease of use, execution efficiency and stability, we have improved Flink's execution control mechanism:
1: Dynamic execution plans are supported, enabling Flink to adjust execution plans based on runtime information. Based on this, we introduced Flink's ability to automatically set the parallelism and automatically balance the amount of data issued.
2: It supports more flexible (finer grained) execution control, enabling Flink to run multiple execution instances of the same task at the same time.
Based on this, we introduced the predictive execution capability to Flink. In addition, these improved mechanisms also provide more possibilities for further optimizing Flink job execution in the future.
Flink OLAP Improvement of Resource Management and Runtime
Cao Dizhou | ByteDance infrastructure engineer
Flink OLAP job QPS and resource isolation are the biggest challenges faced by Flink OLAP computing, as well as the biggest pain points that need to be solved for ByteDance internal business to use Flink to perform OLAP computing. On the basis of job scheduling and execution optimization last year, the ByteDance Flink technical team has made a lot of in-depth optimization of the Flink engine architecture and function implementation, so that the QPS of the Flink engine for complex job execution has increased from 10 to more than 100 and the QPS of simple job execution has increased from 30 to more than 1000 with a small amount of data. This sharing will introduce Flink OLAP difficulties and bottlenecks analysis, job scheduling, runtime execution, revenue and future planning.
1: Introduce the difficulties and bottlenecks Flink encounters in OLAP.
2: Job scheduling
• Transformation of job resource registration, application and release processes from Slot granularity to TaskManager granularity
• Job initialization Job Manager and TaskManager module interaction optimization, supporting cross job • TaskManager connection reuse
• Job computing task deployment structure and serialization optimization
• Job submission and result acquisition from Pull model to Push model
• TaskManager CPU bottleneck and optimization of high concurrency initialization computing tasks
• Job Manager and TaskManager jobs perform memory optimization
• Improved scheduling performance
• Core business access
Benefits and future planning
• Serverless capacity building
• Performance improvement
Interpretation of the latest progress of PyFlink and introduction of typical application scenarios
Payment | Alibaba Cloud Senior Technical Expert, Apache Flink PMC&Committee
PyFlink is a new function introduced from Flink 1.9. It supports users to develop Flink jobs using Python language, greatly facilitates developers of Python technology stack, lowers the development threshold of Flink jobs, and also facilitates users to use rich Python third-party libraries in Flink jobs, greatly expanding Flink's application scenarios. After functional iteration from Flink 1.9 to Flink 1.16, PyFlink functions have become increasingly mature, basically covering most of the functions provided by Flink Java&Scala API.
In this sharing, we will first briefly introduce the current status of the PyFlink project and the new functions supported in the newly released version 1.16 of Flink; Then we will use some specific examples to deeply interpret the typical application scenarios of PyFlink, so that users can have an intuitive understanding of some typical uses and application scenarios of PyFlink.
Apache Flink 1.16 Function Interpretation
Huang Xingbo ｜ Alibaba Cloud Senior Development Engineer, Apache Flink Committee, Flink 1.16 Release Manager
Apache Flink has always been one of the most active and rapidly developing projects in the Apache community. In each release in the past, Apache Flink has brought many new functions and features, as has just released Flink 1.16.
In this sharing, we will first introduce the overall situation of Flink 1.16; Then we will explain the improvement of Flink 1.16 in the general direction of streaming and batching from three aspects:
1: Flink batch processing with more stability, ease of use and high performance;
2: Continuous leading Flink stream processing;
3: Flourishing Flink ecology.
Apache Flink+Volcano: Practice of efficient scheduling capability in big data scenarios
Jiang Yikun ｜ Volcano Reviewer, openEuler Infra Maintainer
Wang Leibo | Cloud Container Service Architect and Volcano Community Leader
Flink on Kubernetes has gained more and more attention and use. Due to Kubernetes' lack of support for batch scheduling, the problem of resource deadlock often occurs in the scheduling of big data scenarios. At the same time, it lacks advanced capabilities such as queue, priority, resource reservation, and diversity computing power scheduling.
This topic will introduce the latest progress and best practices of the Apache Flink community FLIP-250: Support Customized Kubernetes Schedulers Proposal.
Query optimization and landing practice of Flink OLAP in ByteDance
He Runkang | ByteDance infrastructure engineer
This sharing will be mainly divided into five parts: Flink OLAP's business practices in ByteDance and problems encountered, OLAP cluster operation and maintenance and stability governance, SQL Query Optimizer optimization, Query Executor optimization, revenue and future planning.
1: Application of Flink OLAP in ByteDance
• Overall architecture, the landing of Flink OLAP in ByteDance
• Introduce the requirements and problems of operation and maintenance, monitoring, stability and query performance in the process of business use
2: Flink OLAP cluster operation, maintenance and stability
• Flink OLAP monitoring system improves VS flow calculation
• Flink OLAP release and stability management VS stream computing
3: Query Optimizer Optimization
• Plan construction acceleration, including Plan Cache, Catalog Cache, etc
• Optimizer optimization, rich optimization rules, including the enhancement of push down ability, Join Filter transmission, statistical information enhancement, etc
4: Query Executor Optimization
• Memory optimization, including Codegen cache and ClassLoader reuse
• RuntimeFilter optimization
5: Revenue and future planning
• Product improvement, vectorization engine, materialization view, and optimizer evolution
Yu Hangxiang | Apache Flink Contributor
The ChangelogStateBackend introduced by FLIP-158 can provide a more stable and fast checkpoint for user jobs, further provide a faster failover process, and provide less end-to-end delay for transactional sinks jobs.
This topic will introduce the basic mechanism, application scenarios and future planning of ChangelogStateBackend from the performance optimization history of Checkpoint, and share the relevant performance test results.
Optimization and Practice of Flink Unaligned Checkpoint in Shope
Fan Rui | Tech Lead of Shopee Flink Runtime Team, Apache StreamPark Committee&Flink/Hadoop/HBase/LocksDB Contributor
In Flink production practice, Shopee encountered many problems related to checkpoints, and tried to introduce Unaligned Checkpoints to solve some problems. However, after investigation, it was found that there was a certain gap between the effect and the expectation, so it was deeply improved in the internal version, and most of the improvements had been fed back to the Flink community.
The lecture will include the following contents:
• Problems in Checkpoint
• Principle of Unaligned Checkpoint
• Greatly increase UC revenue
• Significantly reduce UC risk
UC's production practice and future planning in Shopee
Flink state optimization and remote state exploration
Zhang Yang | Bilibili Senior Development Engineer
The optimization of Flink rocksdb state optimizes the compression process and improves the throughput in large state task scenarios.
Flink remote state exploration, based on the internal kv storage of bilibili, realizes the separation of storage and computing, speeds up task recovery, and better supports Flink's cloud native deployment.
StateBackend performance improvement with TerarkDB
TerarkDB is an LSM storage engine derived from ByteDance's internal optimization of RocksDB to improve the performance of the storage engine and reduce resource overhead. This sharing will be introduced from five aspects: background, business pain points, Flink&TerarkDB integration, revenue and future planning.
Introduce the problems encountered by RocksDBStateBackend in production practice
Write amplification causes high CPU consumption and disk IO
The SST file cleaning strategy is not universal, which leads to continuous expansion of space
Resonance between Checkpoint and Comparison causes sharp spikes in CPU cycles
Introduce the core optimization of TerarkDB and compare it with RocksDB
KV separation mechanism: reduce write amplification and resource use
Schedule TTL GC: Speed up file recycling to prevent files from being recycled
Flink&TerarkDB Integration Scheme
Flink and TerarkDB JNI interfaces are adapted to support Flink Comparison Filter
Support coexistence of RocksDB and TerarkDB
Provide a new checkpoint mechanism to eliminate CPU cycle spikes
Progress of General Incremental Checkpoint Based on Log in Meituan
Wang Feifan | Computing Engine Engineer of Meituan Data Platform, Apache Flink Contributor
State Changelog based checkpoint is an important evolution of checkpoint mechanism. We think it can solve some business pain points, so we have followed up and co built it. This sharing will be introduced in the following aspects:
1:Meituan application scenario and verification
2:State Changelog Restore performance optimization
3:State Changelog Storage Model Selection Exploration
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00