Alibaba Cloud E-MapReduce (EMR) is a big data processing solution provided by Alibaba Cloud.
EMR is built on top of Alibaba Cloud Elastic Compute Service (ECS) and developed based on open source Apache Hadoop and Apache Spark. It allows you to conveniently use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data. EMR can also read data from or write data to other Alibaba Cloud storage systems and database systems, such as Object Storage Service (OSS) and ApsaraDB RDS.
- For more information about Apache Hadoop, visit the Apache Hadoop official website.
- For more information about Apache Spark, visit the Apache Spark official website.
- For more information about Apache Hive, visit the Apache Hive official website.
- For more information about Apache HBase, visit the Apache HBase official website.
- For more information about SmartData, see SmartData.
Advantages over open source big data ecosystems
If you use an open source distributed processing system, such as Hadoop or Spark, to process data without using EMR, you must perform all the steps in the following figure.
In this procedure, only the last three steps are related to your application logic. The first seven steps are all preparations, which are complex and time-consuming. EMR integrates all the required cluster management tools to provide the following features: host selection, environment deployment, cluster building, cluster configuration, cluster running, job configuration, job running, cluster management, and performance monitoring. This frees you from all the tedious procurement, preparation, and O&M work required to build clusters. You need only to focus on the processing logic of your applications.
EMR also offers different combinations of cluster services to meet your business requirements. For example, to perform daily data measurement and batch computing, you need only to run the Hadoop service for EMR. If you also want to perform stream computing and real-time computing, you can add the Spark service.
Composition of EMR clusters
Clusters are the core user-oriented component of EMR. An EMR cluster is a Hadoop, Flink, Druid, or ZooKeeper cluster that is deployed on one or more ECS instances. For example, a Hadoop cluster consists of some daemon processes, such as NameNode, DataNode, ResourceManager, and NodeManager. These daemon processes run on the ECS instances of the cluster. Each ECS instance corresponds to a node.
The following figure shows a Hadoop cluster that consists of master nodes, core nodes, and task nodes and is associated with a gateway cluster.
- Master nodes: run the following services and processes: NameNode of HDFS, JournalNode of HDFS, ZooKeeper, ResourceManager of YARN, and HMaster of HBase. You can determine whether to create a high-availability (HA) cluster based on the use scenario of the cluster. We recommend that you create a non-HA cluster in a test environment and create an HA cluster in a production environment. In an HA cluster, you can deploy two or three master nodes. If you deploy two master nodes, JournalNode of HDFS and ZooKeeper are deployed on the emr-worker-1 node. We recommend that you deploy three master nodes when you create an HA cluster in a production environment.
- Core nodes: run the DataNode process of HDFS and the NodeManager process of YARN to support data storage and computing. Core nodes do not support auto scaling.
- Task nodes: run the NodeManager process of YARN to support computing. You can flexibly add or remove task nodes based on your business requirements.
- Gateway cluster: is configured with Hadoop client files. You can use a gateway cluster to submit jobs to the Hadoop cluster associated with this gateway cluster. This way, you do not need to log on to the Hadoop cluster, which avoids security issues and environment isolation issues. You must create a Hadoop cluster first, and then create a gateway cluster and associate it with the Hadoop cluster.