New Features of Alibaba Cloud Remote Shuffle Service: AQE and Throttling

Since the launch of Remote Shuffle Service (RSS) in 2020, Alibaba Cloud EMR has helped many customers deal with problems of performance and stability of Spark jobs and implemented the architecture of memory and computing separation. Alibaba Cloud made RSS open-source in early 2022 to make it more convenient to use and expand. All developers are welcome to help build RSS. Please refer to [1] for the overall architecture of RSS. This article introduces the latest two important features of RSS: support for Adaptive Query Execution (AQE) and throttling.

Support of RSS for AQE

An Introduction to AQE

Adaptive Query Execution (AQE) is an important feature of Spark 3 [2]. The subsequent execution plan is dynamically adjusted by collecting runtime stats to solve the problem that the generated execution plan is not good because the optimizer cannot accurately estimate stats. AQE mainly has three optimization scenarios: partition coalescing, Join strategy switching, and skew Join optimization. All three scenarios impose new requirements on the capabilities of the shuffle framework.

Partition Coalescing

The purpose of partition coalescing is to make the amount of data processed by reducer moderate and even as far as possible. First, the mapper shuffles writes according to the larger number of partitions. AQE framework counts the size of each partition. If the amount of data of multiple partitions is relatively small, these partitions are merged into one and handed over to a reducer for processing. Here is the procedure:

According to the figure above, the optimized Reducer 2 needs to read the data that originally belonged to Reducers 2-4. The requirement for the shuffle framework is that ShuffleReader needs to support the range partition:

def getReader[K, C](
    handle: ShuffleHandle,
    startPartition: Int,
    endPartition: Int,
    context: TaskContext): ShuffleReader[K, C]

Join Strategy Switching

The purpose of switching the Join policy is to correct when the optimizer incorrectly selects SortMerge Join or ShuffleHash Join rather than Broadcast Join, which should be done due to inaccurate stats estimation. Specifically, after the two joined tables have shuffled writes, the AQE framework counts the actual size of the tables. If the small table meets the conditions of Broadcast Join, the small table is broadcast o ut and joined with the local shuffle data of the large table. Here are the steps:

There are two optimizations for switching the Join policy:

Change it to Broadcast Join
Data of the large table is directly read locally by LocalShuffleReader.

In terms of the second optimization, the new requirement for the shuffle framework is to support local reads.

Skew Join Optimization

The purpose of skew Join optimization is to allow skew partitions to be handled by more reducers to avoid long tails. Specifically, after shuffling writes ends, the AQE framework counts the size of each partition and determines whether there is a skew according to specific rules. If there is, the partition is divided into multiple splits, and each split is joined with the corresponding partition of another table (as shown in the following figure):

The method of partition splitting is to accumulate the size of the shuffle output in the order of MapId. The splitting is triggered when the accumulated value exceeds a threshold. The new requirement for the shuffle framework is that ShuffleReader can support range MapId. Combined with the requirements for range partitions of partition coalescing, the interface of ShuffleReader evolves to:

def getReader[K, C](
    handle: ShuffleHandle,
    startMapIndex: Int,
    endMapIndex: Int,
    startPartition: Int,
    endPartition: Int,
    context: TaskContext,
    metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]

Review of RSS Architecture

The core design of RSS is to push shuffle and partition data aggregation. Different mappers push data from the same partition to the same worker for aggregation, and a reducer directly reads the aggregated files (as shown in the following figure):

In addition to the core design, RSS implements multi-copy, full-link fault tolerance, Primary HA, disk fault tolerance, adaptive Pusher, rolling upgrade, and other features. Please see [1] for details.

Support of RSS for Partition Coalescing

The requirement of partition coalescing for the shuffle framework is to support range partitions. Each partition corresponds to a file in RSS, so it is naturally supported (as shown in the following figure):

Support of RSS for Join Policy Switching

The requirement for the shuffle framework to switch the Join policy is to be able to support LocalShuffleReader. Due to the remote attribute of RSS, data is stored in RSS clusters and only exists locally when RSS and computing clusters are mixed. Therefore, local reads are not supported now, but mixed scenarios will be optimized and supported in the future. Note: Although local reads are not supported, the rewriting of Join is not affected. The following figure shows that RSS supports the rewriting optimization of Join:

Support of RSS for Skew Join Optimization

Among the three scenarios of AQE, the support of RSS for Join skew optimization is the most difficult one. The core design of RSS is partition data aggregation. The purpose is to convert random reads of Shuffle Read into sequential reads, thereby improving performance and stability. Multiple mappers are pushed to RSS workers at the same time. RSS is brushed after memory aggregation. Therefore, data from different mappers in the partition file is unordered (as shown in the following figure):

Join skew optimization requires reading range maps, such as reading Map1-2 data. There are two general practices

Read the complete file and discard data that is outside the range
Record the location and MapId of each block by introducing an index file. Only data within the range is read.

The problems with these two methods are clear-cut. Method 1 results in a large number of redundant disk reads, while Method 2 essentially falls back into random reads, losing the core advantage of RSS. In addition, the index file becomes generic overhead, even for non-skewed data. It is difficult to accurately predict whether there is a skew in the Shuffle Write process.

We proposed a new design to solve the two problems above: active split and sort on read.

Active Split

It is very possible that the size of a skewed partition is large. The disk will be exploded directly in extreme cases. The probability of a large partition is still high even in non-skewed scenarios. Therefore, from the perspective of disk SLB, it is necessary to monitor the size of partition files and split actively. The threshold is 256 MB by default.

When a split occurs, RSS reassigns a pair of workers (primary replicas) for the current partition, and subsequent data is pushed to the new workers. We proposed a method of soft split to avoid the impact of split on running mappers. When the split is triggered, RSS asynchronously prepares a new worker, and PSS heats up and updates the information of PartitionLocation of the mapper when it is ready. Therefore, it will not cause any interference to the PushData of the mapper. The following figure shows the whole process:

Sort on Read

RSS adopts a Sort on Read strategy to avoid the problem of random reads. Specifically, the first range read of file split will trigger sorting, while the non-range read will not. Then, the ordered file will be written back to the disk along with its location index. It ensures that subsequent range reads are sequential reads (as shown in the following figure):

We broke up the order in which each sub-reducer reads the splits to avoid multiple sub-reducers waiting for the sort of the same file split (as shown in the following figure):

Sort Optimization

Thanks to Sort on Read, redundant and random reads can be effectively avoided, but the split file (256 MB) needs to be sorted. This section discusses the implementation and overhead of sorting. File sorting includes three steps: reading files, sorting MapId, and writing files. The default size of the RSS block is 256 KB, and the number of blocks is about 1,000. Thus, the sorting process is very fast, and the main overhead is from file reading and writing. There are three schemes for the entire sorting process:

Allocate memory of the file size in advance, read the file as a whole, parse and sort the MapId, and write the blocks back to the disk in the MapId order.
Without allocating memory, seek the location of each block, parse and sort the MapId, and transfer the blocks of the original file to the new file in MapId order.
Allocate small blocks of memory (for example, 256 KB), read the entire file in sequence, parse and sort MapId, and transfer the blocks of the original file to the new file in MapId order.

From the perspective of I/O, at first glance, scheme 1 has sufficient memory and does not use sequential reads and writes. Scheme 2 has random reads and writes. Scheme 3 has random writes. Intuitively, scheme 1 has better performance. However, due to PageCache, it is possible that the original file is cached in PageCache when files are written in scheme 3, so the performance of scheme 3 is better in the test (as shown in the following figure):

At the same time, scheme 3 does not need to occupy additional memory of the process, so RSS uses the algorithm of scheme 3. Meanwhile, we also tested and compared Sort on Read and the method above using random reads, which are not sorted but only indexed (as shown in the following figure):

Overall Process

The following figure shows the overall process of RSS support for Join skew optimization:

RSS Throttling

The main purpose of throttling is to prevent RSS worker memory from being exploded. There are usually two ways of throttling:

Client reserves memory for the worker before pushing data every time. Push is triggered only when the reservation is successful.
The backpressure on the worker side

PushData is a very high-frequency and critical-performance operation. Therefore, if an additional RPC interaction is performed for each push, the overhead is too high. As a result, we adopted a backpressure strategy. There are two sources of incoming data from the perspective of a worker:

Data pushed by the client
Data sent by the primary replica

As shown in the following figure, Worker 2 receives both the data from Partition 3 pushed by mappers and the replica data of Partition 1 sent by Worker 1 and sends the data of Partition 3 to the corresponding secondary replica.

The data pushed from the mappers is released only if the following conditions are met at the same time:

Replication is executed successfully.
The data is written to the disk.

Data pushed from the primary replica is only released if the following condition is met:

The data is written to the disk.

When designing the throttling strategy, we should consider throttling (reducing the inflow of data) and also discharging (releasing memory in time). Specifically, we have defined two memory thresholds corresponding to 85% and 95% memory usage for high levels and only one memory threshold corresponding to 50% memory usage for low levels. When the first-gear threshold of the high level is reached, throttling is triggered to suspend receiving the data pushed by mappers and force the disk to be brushed at the same time to discharge the disk. Only limiting the inflow from mappers does not control the traffic from the primary replica. Therefore, we have defined the second-gear threshold of high levels. When this threshold is reached, receiving data sent by the primary replica will be suspended at the same time. When the level is lower than the low level, the normal state is restored. The following figure shows the whole process:

Performance Testing

We compared the AQE performance of RSS and native External Shuffle Service (ESS) on Spark3.2.0. RSS uses a hybrid mode and does not occupy any additional machine resources. In addition, RSS uses 8 GB of memory, accounting for only 2.3% of the machine's memory, which is 352 GB. The following part describes the specific test environments:

Test Environment

Hardware:

Header machine group 1x ecs.g5.4xlarge

Worker machine group 8x ecs.d2c.24xlarge,96 CPU,352 GB,12x 3700GB HDD

Spark AQE-Related Configurations:

spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.adaptive.coalescePartitions.initialPartitionNum 1000
spark.sql.adaptive.skewJoin.enabled true
spark.sql.adaptive.localShuffleReader.enabled false

RSS-Related Configurations:

RSS_PRIMARY_MEMORY=2g
RSS_WORKER_MEMORY=1g
RSS_WORKER_OFFHEAP_MEMORY=7g

TPCDS 10TB Test Set

We tested 10TB TPCDS. In terms of E2E, ESS takes 11,734s, while RSS single replica and two replicas take 8,971s and 10,110s, respectively, which are faster than ESS by 23.5% and 13.8% (as shown in the following figure). The network bandwidth reached the upper limit when RSS enables two replicas, which is also the main factor that two replicas are lower than a single replica.

The time of each query is compared below:

All developers are welcome to participate in the discussion and construction of RSS.

GitHub: https://github.com/alibaba/RemoteShuffleService

Reference

Adaptive Query Execution: Speeding Up Spark SQL at Runtime: https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html

Community

New Features of Alibaba Cloud Remote Shuffle Service: AQE and Throttling

Support of RSS for AQE

An Introduction to AQE

Partition Coalescing

Join Strategy Switching

Skew Join Optimization

Review of RSS Architecture

Support of RSS for Partition Coalescing

Support of RSS for Join Policy Switching

Support of RSS for Skew Join Optimization

Active Split

Sort on Read

Sort Optimization

Overall Process

RSS Throttling

Performance Testing

Test Environment

TPCDS 10TB Test Set

Reference

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

E-MapReduce Service

ApsaraDB for HBase