Sort-Based Blocking Shuffle Implementation in Flink – Part 1

Part 1 of this 2-part series will introduce the sort-based blocking shuffle, present benchmark results, and provide guidelines on how to use this new feature.

By Yingjie Cao (Kevin) & Daisy Tsang

Part 1 of this 2-part series will explain the motivation behind introducing the sort-based blocking shuffle, present benchmark results, and provide guidelines on how to use this new feature.

How Data Gets Passed between Operators

Data shuffling is an important stage in batch processing applications and describes how data is sent from one operator to the next. In this phase, output data of the upstream operator will spill over to persistent storages like disks. Then, the downstream operator will read the corresponding data and process it. Blocking shuffle means that intermediate results from operator A are not sent immediately to operator B until operator A has completely finished.

The hash-based and sort-based blocking shuffles are the two main blocking shuffle implementations widely adopted by existing distributed data processing frameworks:

Hash-Based Approach: The core idea behind the hash-based approach is to write data consumed by different consumer tasks to different files. Then, each file can serve as a natural boundary for the partitioned data.
Sort-Based Approach: The core idea behind the sort-based approach is to write all the produced data together first and then leverage sorting to cluster data belonging to different data partitions or keys.

The sort-based blocking shuffle was introduced in Flink 1.12. It was optimized further and made production-ready in 1.13 for stability and performance. We hope you enjoy the improvements, and any feedback is highly appreciated.

The Motivation behind the Sort-Based Implementation

The hash-based blocking shuffle has been supported in Flink for a long time. However, compared to the sort-based approach, it can have several weaknesses:

Stability: For batch jobs with high parallelism (tens of thousands of subtasks), the hash-based approach opens many files concurrently while writing or reading data, which can put high pressure on the file system (i.e. maintenance of too many file metas, exhaustion of inodes or file descriptors.) We have encountered many stability issues when running large-scale batch jobs via the hash-based blocking shuffle.
Performance: For large-scale batch jobs, the hash-based approach can produce too many small files. For each data shuffle (or connection), the number of output files is (producer parallelism) * (consumer parallelism), and the average size of each file is (shuffle data size) / (number of files). The random IO caused by writing/reading these fragmented files can influence the shuffle performance a lot, especially on spinning disks. Please see the benchmark results section for more information.

Fewer data files will be created and opened, and more sequential reads are done after introducing the sort-based blocking shuffle implementation. As a result, better stability and performance can be achieved.

Moreover, the sort-based implementation can save network buffers for large-scale batch jobs. For the hash-based implementation, the network buffers needed for each output result partition are proportional to the consumers' parallelism. For the sort-based implementation, the network memory consumption can be decoupled from the parallelism, which means that a fixed size of network memory can satisfy requests for all result partitions, though more network memory may lead to better performance.

Benchmark Results on Stability and Performance

Aside from the problem of consuming too many file descriptors and inodes mentioned in the section above, the hash-based blocking shuffle also has a known issue of creating too many files, which blocks the TaskExecutor's main thread (FLINK-21201). In addition, some large-scale jobs like q78 and q80 of the tpc-ds benchmark failed to run on the hash-based blocking shuffle in our tests because of the "connection reset by peer" exception, which is similar to the issue reported in FLINK-19925. (Reading shuffle data by Netty threads can influence network stability.)

We ran the tpc-ds test suit (10T scale with 1050 max parallelism) for both the hash-based and the sort-based blocking shuffle. The results show that the sort-based shuffle can achieve 2-6 times more performance gains compared to the hash-based shuffle on spinning disks. If we exclude the computation time, some jobs can achieve up to 10 times performance gains. Here are some performance results of our tests:

The throughput per disk of the new sort-based implementation can reach up to 160MB/s for writing and reading on our testing nodes:

Disk Name	Disk SDI	Disk SDJ	Disk SDK
Writing Speed (MB/s)	189	173	186
Reading Speed (MB/s)	112	154	158

Note: The following table shows the settings of our test cluster. Those small shuffle size jobs will exchange their shuffle data purely via memory (page cache) because we have a large available memory size per node. As a result, evident performance differences are only seen between jobs that shuffle a large amount of data.

Number of Nodes	Memory Size Per Node	Cores Per Node	Disks Per Node
12	About 400G	96	3

How to Use This New Feature

The sort-based blocking shuffle is introduced mainly for large-scale batch jobs, but it also works well for batch jobs with low parallelism.

The sort-based blocking shuffle is not enabled by default. You can enable it by setting the taskmanager.network.sort-shuffle.min-parallelism config option to a smaller value. This means the hash-based blocking shuffle will be used for parallelism smaller than this threshold. Otherwise, the sort-based blocking shuffle will be used (since it has no influence on streaming applications.) Setting this option to 1 will disable the hash-based blocking shuffle.

You should use the sort-based blocking shuffle for spinning disks and large-scale batch jobs. Both implementations should be fine for low parallelism (several hundred processes or fewer) on solid-state drives.

There are several other config options that can have an impact on the performance of the sort-based blocking shuffle:

taskmanager.network.blocking-shuffle.compression.enabled: This enables shuffle data compression, which can reduce the network and disk IO with some CPU overhead. We recommend enabling shuffle data compression unless the data compression ratio is low. It works for the sort-based and hash-based blocking shuffles.
taskmanager.network.sort-shuffle.min-buffers: This declares the minimum number of required network buffers that can be used as the in-memory sort-buffer per result partition for data caching and sorting. Increasing the value of this option may improve the blocking shuffle performance. Several hundreds of megabytes of memory are usually enough for large-scale batch jobs.
taskmanager.memory.framework.off-heap.batch-shuffle.size: This configuration defines the maximum memory size that can be used by reading the data of the sort-based blocking shuffle per task manager. Increasing the value of this option may improve the shuffle read performance. Usually, several hundreds of megabytes of memory are enough for large-scale batch jobs. You may also need to increase taskmanager.memory.framework.off-heap.size before you increase this value because this memory is cut from the framework off-heap memory.

For more information about blocking shuffle in Flink, please refer to the official documentation.

Note: Once you get to the optimization mechanism in Part 2, we can see that the IO scheduling relies on the concurrent data read requests of the downstream consumer tasks for more sequential reads. As a result, if the downstream consumer task is running one by one (for example, because of limited resources), the advantage brought by IO scheduling disappears, which can influence performance. We may optimize this scenario further in future versions.

What's Next?

Learn details on the design and implementation of this feature in Part 2 of this article!

Community

Sort-Based Blocking Shuffle Implementation in Flink – Part 1

How Data Gets Passed between Operators

The Motivation behind the Sort-Based Implementation

Benchmark Results on Stability and Performance

How to Use This New Feature

What's Next?

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Message Queue for Apache Kafka