Alibaba Cloud Elastic MapReduce, also known as EMR or E-MapReduce, offers a fully managed service which allows you to create Hadoop clusters for Big Data applications within minutes. It is built on ECS and uses open source tools like Apache Hadoop and Spark (covered in the first article) which forms the core of E-MapReduce to quickly process and analyze huge amounts of data through a user-friendly web interface.
E-MapReduce takes care of most of the basic tasks required for cluster creation and provisioning, while at the same time it provides an integrated framework for managing and using clusters. It utilizes complete capabilities of Hadoop and Spark, so you need not provision Hadoop right from scratch. Based on Spark –means you can even stream large volumes of data. It easily integrates with other products of Alibaba Cloud such as Alibaba Elastic Computing Services (ECS) and OSS.
In this tutorial, you will learn how to navigate through the Alibaba E-MapReduce interface, which can make creating and managing clusters easier and can ultimately help you to gain insights from your data.
It's important to understand the process of Big Data Analytics on Alibaba Cloud E-MapReduce. Similarly, it's equally important to manage the environment that you are using for everything as well. Managing a Hadoop cluster, similar to maintaining high availability, starting and stopping of services, and scaling out for computational issues, is a mandatory piece of providing a smooth way to process big data with uninterrupted services. These actions are made easier in Alibaba Cloud, of course, because you manage everything by using the web interface in a convenient fashion.
This article will additionally consider various methods for creating an EMR cluster, as well as services running on initiating a cluster, expanding a cluster, releasing a cluster, among other things.
The advantages of creating a cluster using EMR:
The Alibaba Cloud EMR team has invested a great deal of effort and R&D resources in products, ease of use, and security to develop the popular big data product Alibaba Cloud EMR. The team has also made long-term investments and continuous efforts to develop engines that are fully compatible with open-source software. The team leverages sophisticated technologies to create technical barriers for our products. This gives customers higher cost efficiency when using open-source software stacks. Customers can also smoothly migrate their businesses to the cloud and minimize the costs of business operations on the cloud.
The outstanding achievements made in TPC-DS Perf by the Alibaba Cloud EMR team prove the team's technical depth and prowess in Spark engine development. A series of articles will be published to introduce the optimizations and ideas that allowed us to perform so well in TPC-DS Perf in 2020. If you are developing a Spark engine or related applications in the community, you can read these articles and tell us what you think. You are also welcome to send us your resume for an opportunity to join the Alibaba Cloud EMR team.
After submitting the results, we used open-source Spark 2.4.3 to test 99 TPC-DS queries. You can compare the performance data in the following figures.
**Nearly 300% Performance Improvement in the Load Phase
**Nearly 600% Performance Improvement in the PT Phase
Please Note - The performance of Spark community edition 2.4.3 when executing Query 14 and Query 95 could not be tested due to an out of memory (OOM) error, so these two queries were excluded from our calculations.
The Queries that took Spark Community Edition 2.4.3 more than 200 Seconds to Execute were Singled Out for Comparison with the Corresponding Queries executed by EMR Spark.
Please Note - Among these queries, Query 78 saw a 300% performance improvement in EMR Spark, which was the lowest performance improvement among any of the queries. Query 57 performance was improved nearly 100 times over.
EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.
Alibaba Cloud Elastic MapReduce (EMR) is a big data processing solution that runs on the Alibaba Cloud platform. EMR is built on Alibaba Cloud ECS instances and is based on open-source Apache Hadoop and Apache Spark. EMR allows you to use the Hadoop and Spark ecosystem components, such as Apache Hive, Apache Kafka, Flink, Druid, and TensorFlow, to analyze and process data. You can use EMR to process data stored on different Alibaba Cloud data storage service, such as Object Storage Service (OSS), Log Service (SLS), and Relational Database Service (RDS).
This topic describes the architecture of E-MapReduce (EMR).
The following figure shows the architecture.
EMR clusters are created based on the Hadoop ecosystem. EMR clusters can exchange data seamlessly with Alibaba Cloud services such as Object Storage Service (OSS) and ApsaraDB Relational Database Service (RDS). This enables you to share and transmit data among multiple systems to meet different business demands.
E-MapReduce (EMR) clusters are suitable for all the scenarios supported by the Hadoop ecosystem and Spark.
EMR is a cluster service based on Hadoop and Spark. You can use the Alibaba Cloud Elastic Compute Service (ECS) instances on which EMR clusters are deployed as your dedicated physical machines.
When you use E-MapReduce (EMR), cluster instability may occur or clusters may become unavailable due to unexpected operations. Take note of the information in this topic to avoid these issues. This topic describes the limits of EMR.
Alibaba Clouder - November 14, 2017
Alibaba EMR - August 19, 2020
Alibaba Clouder - January 11, 2018
Alibaba Clouder - September 29, 2019
Alibaba Clouder - October 15, 2019
Alibaba EMR - March 1, 2021
Auto Scaling automatically adjusts computing resources based on your business cycleLearn More
ApsaraDB Dedicated Cluster provided by Alibaba Cloud is a dedicated service for managing databases on the cloud.Learn More
High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.Learn More
Super Computing Service provides ultimate computing performance and parallel computing cluster services for high-performance computing through high-speed RDMA network and heterogeneous accelerators such as GPU.Learn More
More Posts by Alibaba Clouder