Alibaba Cloud E-MapReduce (EMR) is a big data processing solution provided by Alibaba Cloud.

Introduction

EMR is built on Alibaba Cloud Elastic Compute Service (ECS) and developed based on both open source Apache Hadoop and Apache Spark. It allows you to conveniently use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data. EMR can also read data from or write data to other Alibaba Cloud storage systems and database systems, such as Object Storage Service (OSS) and ApsaraDB RDS.

SmartData is a storage service for the EMR Jindo engine. SmartData provides centralized storage, caching, and computing optimization for EMR computing engines and extends storage features.
Note

Advantages over open source big data ecosystems

If you use an open source distributed processing system, such as Hadoop or Spark, to process data without using EMR, you must perform all the steps in the following figure.Procedure

In this procedure, only the last three steps are related to your application logic. The first seven steps are all preparations, which are complex and time-consuming. EMR integrates all the required cluster management tools to provide the following features: host selection, environment deployment, cluster building, cluster configuration, cluster running, job configuration, job running, cluster management, and performance monitoring. This frees you from all the tedious procurement, preparation, and O&M work required to build clusters. You need only to focus on the processing logic of your applications.

EMR also offers different combinations of cluster services to meet your business requirements. For example, to perform daily data measurement and batch computing, you need only to run the Hadoop service for EMR. If you also want to perform stream computing and real-time computing, you can add the Spark service.

Composition of EMR clusters

Clusters are the core user-oriented component of EMR. An EMR cluster is a Hadoop or Spark cluster that is deployed on one or more ECS instances. For example, a Hadoop cluster consists of some daemon processes, such as NameNode, DataNode, ResouceManager, and NodeManager. These daemon processes run on the ECS instances of the cluster. Each ECS instance corresponds to a node. The NameNode and ResourceManager processes run on master nodes, whereas the DataNode and NodeManager processes run on core and task nodes.

The following figure shows an EMR cluster that consists of one master node and three core and task nodes.Master_Slave