Developer Content

Alibaba Cloud EMR on ACK provides users with a new way to build a big data platform. Users can deploy open source big data services on Alibaba Cloud Container Service (ACK). Using the advantages of ACK in service deployment and application management of high-performance and scalable containers, users only need to focus on the big data job itself. Users can easily execute Spark, Presto, and Flink jobs on the ACK cluster. They are 100% compatible with open source and have better performance than open source.

1、 Background

Technology trends

• Separation of storage and computing, and evolution to cloud native

• Online business, AI and big data are connected to the ACK cluster in a unified way, peak shift scheduling, offline online mixing, and improve machine utilization

• Unified operation and maintenance access, unified operation and maintenance tool chain, and unified monitoring system

• Cluster-centric ->job-centric

• Multi-version support, such as running Spark2. x, Spark3. x at the same time

Cloud native faces challenges

• Separation of computing and storage: How to build an HCFS file system based on object storage OSS

• Need to be fully compatible with existing HDFS

• Performance benchmarking HDFS, cost reduction

• Computing engine shuffle data storage and calculation separation: how to solve ACK mixed heterogeneous models

• Heterogeneous models have no local disk

• Community [Spark-25299] discussion, support Spark dynamic resources, and become industry consensus

• ACK scheduling capability: how to solve scheduling performance bottlenecks

• Performance benchmarking Yarn

• Multi-level queue management

• Peak shifting dispatching

• Arrange the peaks and valleys of various businesses with the help of K8s operating system capability

EMR on ACK advantages

• Remote Shuffle Service provides a storage and calculation separation scheme for intermediate shuffle data

• Enables computing nodes to eliminate the need for local and cloud disks

• Support to open Spark dynamic resource function, Spark-25299 ultimate solution

• JindoFS provides lake acceleration solution for OSS storage

• Block mode 1TB TPCDS scenario has more than 15% performance improvement

• Scheduling level supports Scheduler Framework V2

• The scheduling performance is more than 3x higher than that of the community

• Provide multi-level queue management

• Enhanced engine capability

• In the 10TB TPCDS Benchmark scenario, EMR Spark has 3x performance improvement over the community

• Hudi and DeltaLake are better than community functions

• Complete peak-shifting scheduling scheme

2、 EMR containerized architecture

EMR on ACK architecture

• Lightweight control and docking with existing data platforms

• Submit to different execution platforms through data development cluster/scheduling platform

• Peak shifting scheduling, adjusted according to the business peak and low peak strategy

• Cloud native data lake architecture, strong capacity expansion and contraction of ACK

• ACK manages heterogeneous model clusters with good flexibility

3、 Product introduction

Product homepage

Reference link: https://www.aliyun.com/product/emapreduce

EMR on ACK Beta, go to experience>>

New cluster

• Region: Hangzhou, Shanghai, Beijing, Shenzhen and other regions are currently open (under continuous opening)

• Cluster types: Spark, Shuffle Service, Presto

• Spark - a general distributed big data processing engine

• ETL, offline batch processing, data modeling and other capabilities are provided

• Shuffle Service - Provides optimized Shuffle service for EMR computing engine

• Solve the dependence on local disk under Kubernetes

• Solve IO bottleneck of network and disk of large-scale computing cluster

• Support the architecture of computing and storage separation, and can serve multiple EMR clusters

• Presto - distributed SQL interactive query engine based on memory

• Support multiple data sources

• It is suitable for complex analysis of petabytes of massive data and query across data sources

• Component version: Spark (3.1.1)

• Exclusive node:

• The existing ACK cluster shares some nodes to EMR

• Create a new ACK cluster, and select the entire cluster as the exclusive node

• OSS bucket: used to store jobs, logs, jar packages and other information

Cluster management

• Cluster ID/name: click to enter job management

• Cluster status: detect whether the cluster is available

• ACK cluster: can be associated with existing ACK cluster

• Configuration: Spark job configuration

• Release: release space

New release of EMR on ACK

Related Articles

A detailed explanation of Hadoop core architecture HDFS

What Does IOT Mean

6 Optional Technologies for Data Storage

What Is Blockchain Technology

Explore More Special Offers

Short Message Service(SMS) & Mail Service

Sales Support

Technical Support

Connect & Report Abuse