Community Blog E-MapReduce: Introduction and Scenarios

E-MapReduce: Introduction and Scenarios

Here shows what E-MapReduce is, how to manage EMR clusters in Alibaba E-MapReduce and the secrets behind the optimized SQL performance of EMR Spark.

What Is E-MapReduce?

Alibaba Cloud Elastic MapReduce, also known as EMR or E-MapReduce, offers a fully managed service which allows you to create Hadoop clusters for Big Data applications within minutes. It is built on ECS and uses open source tools like Apache Hadoop and Spark (covered in the first article) which forms the core of E-MapReduce to quickly process and analyze huge amounts of data through a user-friendly web interface.

Why Choose E-MapReduce?

E-MapReduce takes care of most of the basic tasks required for cluster creation and provisioning, while at the same time it provides an integrated framework for managing and using clusters. It utilizes complete capabilities of Hadoop and Spark, so you need not provision Hadoop right from scratch. Based on Spark –means you can even stream large volumes of data. It easily integrates with other products of Alibaba Cloud such as Alibaba Elastic Computing Services (ECS) and OSS.

Diving into Big Data: EMR Cluster Management

In this tutorial, you will learn how to navigate through the Alibaba E-MapReduce interface, which can make creating and managing clusters easier and can ultimately help you to gain insights from your data.

It's important to understand the process of Big Data Analytics on Alibaba Cloud E-MapReduce. Similarly, it's equally important to manage the environment that you are using for everything as well. Managing a Hadoop cluster, similar to maintaining high availability, starting and stopping of services, and scaling out for computational issues, is a mandatory piece of providing a smooth way to process big data with uninterrupted services. These actions are made easier in Alibaba Cloud, of course, because you manage everything by using the web interface in a convenient fashion.

This article will additionally consider various methods for creating an EMR cluster, as well as services running on initiating a cluster, expanding a cluster, releasing a cluster, among other things.

The advantages of creating a cluster using EMR:

  1. Cluster Creation: Quickly deploy different types of clusters like Hadoop or Kafka within minutes enabling you to concentrate and spend more time in processing and analytics
  2. Scheduled Cluster creation: This is something interesting as you can execute plans to create and release clusters in a scheduled manner.
  3. Cluster expansion: Quickly add any number of nodes to the existing cluster.
  4. Autoscaling/Dynamic Expansion: Scale out clusters at scheduled time, so you need not worry of the resources
  5. Automatic Service Deployment: Easily add, configure, and monitor services running on the cluster.
  6. Job Orchestration: Easier job scheduling. Sends alert on failure of a job so that one can re-execute it or set automatic re-execute.

The Secrets Behind the Optimized SQL Performance of EMR Spark

The Alibaba Cloud EMR team has invested a great deal of effort and R&D resources in products, ease of use, and security to develop the popular big data product Alibaba Cloud EMR. The team has also made long-term investments and continuous efforts to develop engines that are fully compatible with open-source software. The team leverages sophisticated technologies to create technical barriers for our products. This gives customers higher cost efficiency when using open-source software stacks. Customers can also smoothly migrate their businesses to the cloud and minimize the costs of business operations on the cloud.

The outstanding achievements made in TPC-DS Perf by the Alibaba Cloud EMR team prove the team's technical depth and prowess in Spark engine development. A series of articles will be published to introduce the optimizations and ideas that allowed us to perform so well in TPC-DS Perf in 2020. If you are developing a Spark engine or related applications in the community, you can read these articles and tell us what you think. You are also welcome to send us your resume for an opportunity to join the Alibaba Cloud EMR team.

Comparing Open-Source Spark and EMR Spark

After submitting the results, we used open-source Spark 2.4.3 to test 99 TPC-DS queries. You can compare the performance data in the following figures.

**Nearly 300% Performance Improvement in the Load Phase

Comparing Open-Source Spark and EMR Spark

**Nearly 600% Performance Improvement in the PT Phase

Comparing Open-Source Spark and EMR Spark

Please Note - The performance of Spark community edition 2.4.3 when executing Query 14 and Query 95 could not be tested due to an out of memory (OOM) error, so these two queries were excluded from our calculations.

The Queries that took Spark Community Edition 2.4.3 more than 200 Seconds to Execute were Singled Out for Comparison with the Corresponding Queries executed by EMR Spark.

Comparing Open-Source Spark and EMR Spark

Please Note - Among these queries, Query 78 saw a 300% performance improvement in EMR Spark, which was the lowest performance improvement among any of the queries. Query 57 performance was improved nearly 100 times over.

Related Products


EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.

Alibaba Cloud Elastic MapReduce (EMR) is a big data processing solution that runs on the Alibaba Cloud platform. EMR is built on Alibaba Cloud ECS instances and is based on open-source Apache Hadoop and Apache Spark. EMR allows you to use the Hadoop and Spark ecosystem components, such as Apache Hive, Apache Kafka, Flink, Druid, and TensorFlow, to analyze and process data. You can use EMR to process data stored on different Alibaba Cloud data storage service, such as Object Storage Service (OSS), Log Service (SLS), and Relational Database Service (RDS).

Related Documentations

Architecture of E-MapReduce

This topic describes the architecture of E-MapReduce (EMR).

The following figure shows the architecture.

Architecture of E-MapReduce

EMR clusters are created based on the Hadoop ecosystem. EMR clusters can exchange data seamlessly with Alibaba Cloud services such as Object Storage Service (OSS) and ApsaraDB Relational Database Service (RDS). This enables you to share and transmit data among multiple systems to meet different business demands.

Use scenarios of E-MapReduce

E-MapReduce (EMR) clusters are suitable for all the scenarios supported by the Hadoop ecosystem and Spark.

EMR is a cluster service based on Hadoop and Spark. You can use the Alibaba Cloud Elastic Compute Service (ECS) instances on which EMR clusters are deployed as your dedicated physical machines.

Limits of E-MapReduce

When you use E-MapReduce (EMR), cluster instability may occur or clusters may become unavailable due to unexpected operations. Take note of the information in this topic to avoid these issues. This topic describes the limits of EMR.

0 0 0
Share on

Alibaba Clouder

2,605 posts | 747 followers

You may also like


Alibaba Clouder

2,605 posts | 747 followers

Related Products

  • Auto Scaling

    Auto Scaling automatically adjusts computing resources based on your business cycle

    Learn More
  • ApsaraDB for MyBase

    ApsaraDB Dedicated Cluster provided by Alibaba Cloud is a dedicated service for managing databases on the cloud.

    Learn More
  • Elastic High Performance Computing Solution

    High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.

    Learn More
  • Super Computing Cluster

    Super Computing Service provides ultimate computing performance and parallel computing cluster services for high-performance computing through high-speed RDMA network and heterogeneous accelerators such as GPU.

    Learn More