×
Community Blog What is Alibaba Cloud EMR?

What is Alibaba Cloud EMR?

This article illustrates the definition of EMR, its advantages, architecture, and benefit.

What is EMR?

EMR is an abbreviation of E-MapReduce Service. EMR is a big data processing solution provided by Alibaba Cloud. EMR is built on Alibaba Cloud Elastic Compute Service (ECS) and developed based on both open-source Apache Hadoop and Apache Spark. It allows you to conveniently use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data. EMR can also read data from or write data to other Alibaba Cloud storage systems and database systems, such as Object Storage Service (OSS) and ApsaraDB RDS.

SmartData is a storage service for the EMR Jindo engine. SmartData provides centralized storage, caching, and computing optimization for EMR computing engines and extends storage features.

EMR Advantages over Open Source Big Data Ecosystems

If you use an open-source distributed processing system, such as Hadoop or Spark, to process data without using EMR, you must perform all the steps in the following figure.

emr

In this procedure, only the last three steps are related to your application logic. The first seven steps are all preparations, which are complex and time-consuming. EMR integrates all the required cluster management tools to provide the following features: host selection, environment deployment, cluster building, cluster configuration, cluster running, job configuration, job running, cluster management, and performance monitoring. This frees you from all the tedious procurement, preparation, and O&M work required to build clusters. You need only focus on the processing logic of your applications.

EMR also offers different combinations of cluster services to meet your business requirements. For example, to perform daily data measurement and batch computing, you need only to run the Hadoop service for EMR. If you also want to perform stream computing and real-time computing, you can add the Spark service.

Composition of EMR Clusters

Clusters are the core user-oriented component of EMR. An EMR cluster is a Hadoop or Spark cluster that is deployed on one or more ECS instances. For example, a Hadoop cluster consists of some daemon processes, such as NameNode, DataNode, ResouceManager, and NodeManager. These daemon processes run on the ECS instances of the cluster. Each ECS instance corresponds to a node. The NameNode and ResourceManager processes run on master nodes, whereas the DataNode and NodeManager processes run on core and task nodes.

The following figure shows an EMR cluster that consists of one master node and three core and task nodes.

emr_cluster

The Architecture of EMR

EMR clusters are created based on the Hadoop ecosystem. EMR clusters can exchange data seamlessly with Alibaba Cloud services such as Object Storage Service (OSS) and ApsaraDB Relational Database Service (RDS). This enables you to share and transmit data among multiple systems to meet different business demands.

emr_architecture

The Benefits of EMR

EMR provides an integrated solution to manage clusters, which frees you up from the complex management of clusters. EMR has some practical strength over self-managed clusters.

  • Ease of use: You can simply select the ECS instance specifications, disks, and software, and trigger automated deployment.
  • Low price: You can create a cluster as required and release it if it is no longer needed. You can also dynamically add a node to a cluster.
  • Deep integration: EMR is integrated with other Alibaba Cloud services such as Object Storage Service (OSS), Message Service (MNS), ApsaraDB Relational Database Service (RDS), and MaxCompute. This enables these services to act as the input source or output destination of the Hadoop or Spark compute engine in EMR.
  • Security: EMR is integrated with Resource Access Management (RAM), which allows you to use Alibaba Cloud accounts and RAM users to isolate permissions on services.

Related Blog

Diving into Big Data: EMR Cluster Management

It's important to understand the process of Big Data Analytics on Alibaba Cloud E-MapReduce. Similarly, it's equally important to manage the environment that you are using for everything as well. Managing a Hadoop cluster, similar to maintaining high availability, starting and stopping of services, and scaling out for computational issues, is a mandatory piece of providing a smooth way to process big data with uninterrupted services. These actions are made easier in Alibaba Cloud, of course, because you manage everything by using the web interface in a convenient fashion.

For people who are new to using Alibaba E-MapReduce, this article specifically addresses EMR cluster management. In contrast to the previous article, Diving into Big Data: Getting Started with OSS and EMR, in which we have seen how to create a cluster in EMR as an initial step, this article will additionally consider various methods for creating an EMR cluster, as well as services running on initiating a cluster, expanding a cluster, releasing a cluster, among other things.

Drilling into Big Data – Getting started with OSS and EMR

EMR takes care of most of the basic tasks required for cluster creation and provisioning, while at the same time it provides an integrated framework for managing and using clusters. It utilizes the complete capabilities of Hadoop and Spark, so you need not provision Hadoop right from scratch. Based on Spark –means you can even stream large volumes of data. It easily integrates with other products of Alibaba Cloud such as Alibaba Elastic Computing Services (ECS) and OSS.

Related Product

E-MapReduce

EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.

0 0 0
Share on

Alibaba Clouder

2,630 posts | 645 followers

You may also like

Comments