Community Blog Alibaba Big Data Practices on Cloud-Native – EMR Spark on ACK

Alibaba Big Data Practices on Cloud-Native – EMR Spark on ACK

This article discusses the practices and challenges of EMR Spark on Alibaba Cloud Kubernetes.

1. Cloud-Native Challenges and Alibaba Practices

Development Trend of Big Data Technology


Challenges of Cloud-Native Development

Computing and Storage Separation

Building an HCFS file system with Object Storage System as the base:

  • Fully compatible with existing HDFS
  • Performance benchmarking HDFS with lower costs

Shuffle, Storage, and Computing Separation

Solving Alibaba Cloud Kubernetes (ACK) hybrid heterogeneous models:

  • No local disk for heterogeneous models
  • Community [Spark-25299] discussion, supporting Spark dynamic resources, which has become the consensus of the industry

Cache Solution

Supporting cross-room, cross-dedicated hybrid cloud effectively:

  • A cache system must be supported within the container.

ACK Scheduling

Resolving scheduling performance bottlenecks:

  • Performance benchmarking Yarn
  • Multi-level queue management


  • Staggered scheduling
  • Yarnon ACK nodes resource mutual awareness

Alibaba Practices – EMR on ACK


Overall Solution Introduction


  • Submission to different execution platforms via data development clusters/scheduling platforms
  • Staggered scheduling, adjusted policies based on business peak and off-peak
  • Cloud-native data lake architecture with powerful elastic expansion and contraction capabilities
  • Hybrid scheduling on and off the cloud via dedicated lines
  • ACK manages heterogeneous clusters with high flexibility

2. Spark Containerization Solution




1. Why Do I Need the Remote Shuffle Service?

  • RSS enables the Spark job without the need for Executor Pod cloud disk. Attaching cloud disks is not conducive to scalability and large-scale production.
  • The disk size cannot be determined in advance. The size of the cloud disk cannot be determined in advance; if it is too big, it wastes space; if it is too small, it will fail to shuffle. RSS is designed specifically for storage and computing separation scenarios.
  • Executor writes shuffle data to the RSS system, which manages shuffle data and can be recycled once the Executor is idle. [SPARK-25299]
  • It can support dynamic resources perfectly to avoid long-tail tasks with data skews that prevent the Executor resources from being released.

2. What Is the Performance, Cost, and Scalability of RSS?

  • RSS is deeply optimized for shuffle and is specially designed for storage and computing separation scenarios and K8s elastic scenarios.
  • For the Shufflefetch phase, the random read in the reduce phase can be converted into the sequential read, which improves the stability and performance of the job significantly.
  • You can use the disk in the original K8s cluster for deployment without adding extra cloud disks for shuffle, which is cost-effective and flexible.

Spark Shuffle


  • Generate numMapper * numReducer and block
  • Sequential write and random read
  • Spill on Write
  • Single copy, stage recalculation required for data loss

EMR Remote Shuffle Service


  • Append write and sequential read
  • Spill on no Write
  • Two copies; Complete once the copy is copied to memory
  • Backup between copies via an intranet. No need for public bandwidth

RSS TeraSort Benchmark


  • Note: Taking the 10T Terasort as an example, the shuffle amount is about 5.6T after compression. In the RSS scenario, the performance of jobs of this magnitude is improved significantly because shuffle read is changed to sequential read.

Effects of Spark on ECI




0 1 0
Share on

Alibaba EMR

52 posts | 3 followers

You may also like


Alibaba EMR

52 posts | 3 followers

Related Products