Alibaba Cloud E-MapReduce (EMR) is a big data processing solution provided by Alibaba Cloud.
EMR is built on Alibaba Cloud Elastic Compute Service (ECS) and developed based on both open source Apache Hadoop and Apache Spark. It allows you to conveniently use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data. EMR can also read data from or write data to other Alibaba Cloud storage systems and database systems, such as Object Storage Service (OSS) and ApsaraDB RDS.
- For more information about Apache Hadoop, visit the Apache Hadoop official website.
- For more information about Apache Spark, visit the Apache Spark official website.
- For more information about Apache Hive, visit the Apache Hive official website.
- For more information about Apache HBase, visit the Apache HBase official website.
- For more information about SmartData, see SmartData.
Advantages over open source big data ecosystems
In this procedure, only the last three steps are related to your application logic. The first seven steps are all preparations, which are complex and time-consuming. EMR integrates all the required cluster management tools to provide the following features: host selection, environment deployment, cluster building, cluster configuration, cluster running, job configuration, job running, cluster management, and performance monitoring. This frees you from all the tedious procurement, preparation, and O&M work required to build clusters. You need only to focus on the processing logic of your applications.
EMR also offers different combinations of cluster services to meet your business requirements. For example, to perform daily data measurement and batch computing, you need only to run the Hadoop service for EMR. If you also want to perform stream computing and real-time computing, you can add the Spark service.
Composition of EMR clusters
Clusters are the core user-oriented component of EMR. An EMR cluster is a Hadoop or Spark cluster that is deployed on one or more ECS instances. For example, a Hadoop cluster consists of some daemon processes, such as NameNode, DataNode, ResouceManager, and NodeManager. These daemon processes run on the ECS instances of the cluster. Each ECS instance corresponds to a node. The NameNode and ResourceManager processes run on master nodes, whereas the DataNode and NodeManager processes run on core and task nodes.