This article discusses the practices and challenges of EMR Spark on Alibaba Cloud Kubernetes.
1. Cloud-Native Challenges and Alibaba Practices
Development Trend of Big Data Technology
Challenges of Cloud-Native Development
Computing and Storage Separation
Building an HCFS file system with Object Storage System as the base:
- Fully compatible with existing HDFS
- Performance benchmarking HDFS with lower costs
Shuffle, Storage, and Computing Separation
Solving Alibaba Cloud Kubernetes (ACK) hybrid heterogeneous models:
- No local disk for heterogeneous models
- Community [Spark-25299] discussion, supporting Spark dynamic resources, which has become the consensus of the industry
Supporting cross-room, cross-dedicated hybrid cloud effectively:
- A cache system must be supported within the container.
Resolving scheduling performance bottlenecks:
- Performance benchmarking Yarn
- Multi-level queue management
- Staggered scheduling
- Yarnon ACK nodes resource mutual awareness
Alibaba Practices – EMR on ACK
Overall Solution Introduction
- Submission to different execution platforms via data development clusters/scheduling platforms
- Staggered scheduling, adjusted policies based on business peak and off-peak
- Cloud-native data lake architecture with powerful elastic expansion and contraction capabilities
- Hybrid scheduling on and off the cloud via dedicated lines
- ACK manages heterogeneous clusters with high flexibility
2. Spark Containerization Solution
1. Why Do I Need the Remote Shuffle Service?
- RSS enables the Spark job without the need for Executor Pod cloud disk. Attaching cloud disks is not conducive to scalability and large-scale production.
- The disk size cannot be determined in advance. The size of the cloud disk cannot be determined in advance; if it is too big, it wastes space; if it is too small, it will fail to shuffle. RSS is designed specifically for storage and computing separation scenarios.
- Executor writes shuffle data to the RSS system, which manages shuffle data and can be recycled once the Executor is idle. [SPARK-25299]
- It can support dynamic resources perfectly to avoid long-tail tasks with data skews that prevent the Executor resources from being released.
2. What Is the Performance, Cost, and Scalability of RSS?
- RSS is deeply optimized for shuffle and is specially designed for storage and computing separation scenarios and K8s elastic scenarios.
- For the Shufflefetch phase, the random read in the reduce phase can be converted into the sequential read, which improves the stability and performance of the job significantly.
- You can use the disk in the original K8s cluster for deployment without adding extra cloud disks for shuffle, which is cost-effective and flexible.
- Generate numMapper * numReducer and block
- Sequential write and random read
- Spill on Write
- Single copy, stage recalculation required for data loss
EMR Remote Shuffle Service
- Append write and sequential read
- Spill on no Write
- Two copies; Complete once the copy is copied to memory
- Backup between copies via an intranet. No need for public bandwidth
RSS TeraSort Benchmark
Note: Taking the 10T Terasort as an example, the shuffle amount is about 5.6T after compression. In the RSS scenario, the performance of jobs of this magnitude is improved significantly because shuffle read is changed to sequential read.
Effects of Spark on ECI