E-MapReduce (EMR) is a big data processing solution provided by Alibaba Cloud.
EMR is built on top of Alibaba Cloud Elastic Compute Service (ECS) and developed based on open source Apache Hadoop and Apache Spark. EMR allows you to use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data in a convenient manner. EMR can also read data from or write data to other Alibaba Cloud storage systems and database systems, such as Object Storage Service (OSS) and ApsaraDB RDS.
- For more information about Apache Hadoop, visit the Apache Hadoop official website.
- For more information about Apache Spark, visit the Apache Spark official website.
- For more information about Apache Hive, visit the Apache Hive official website.
- For more information about Apache HBase, visit the Apache HBase official website.
- For more information about SmartData and JindoData, see SmartData and Overview.
Advantages over open source big data ecosystems
If you use an open source distributed processing system, such as Hadoop or Spark, to process data without using EMR, you must perform all the steps in the following figure.
In this procedure, only the last three steps are related to your application logic. The first seven steps are all preparations, which are complex and time-consuming. EMR integrates all the required cluster management tools to provide the following features: host selection, environment deployment, cluster building, cluster configuration, cluster running, job configuration, job running, cluster management, and performance monitoring. This frees you from all the tedious procurement, preparation, and O&M work required to build clusters. You need to only focus on the processing logic of your applications.
EMR also offers different combinations of cluster services to meet your business requirements. For example, to perform daily data measurement and batch computing, you need to only run the Hadoop service for EMR. If you also want to perform stream computing and real-time computing, you can add the Spark service.
Composition of EMR clusters
Clusters are the core user-oriented component of EMR. An EMR cluster is a Hadoop, Flink, Druid, or ZooKeeper cluster that is deployed on one or more ECS instances. For example, a Hadoop cluster consists of some daemon processes, such as NameNode, DataNode, ResourceManager, and NodeManager. The daemon processes run on the ECS instances of the cluster. Each ECS instance corresponds to a node.
The following figure shows a Hadoop cluster that consists of master nodes, core nodes, and task nodes and is associated with a gateway cluster.
- Master nodes: run the following services and processes: NameNode of HDFS, JournalNode of HDFS, ZooKeeper, ResourceManager of YARN, and HMaster of HBase. You can determine whether to create a high availability (HA) cluster based on the use scenario of the cluster. We recommend that you create a non-HA cluster in a test environment and create an HA cluster in a production environment. In an HA cluster, you can deploy two or three master nodes. If you deploy two master nodes, JournalNode of HDFS and ZooKeeper are deployed on the emr-worker-1 node. When you create an HA cluster in a production environment, we recommend that you deploy three master nodes.
- Core nodes: run the DataNode process of HDFS and the NodeManager process of YARN to support data storage and computing. Core nodes do not support auto scaling.
- Task nodes: run the NodeManager process of YARN to support computing. You can add or remove task nodes in a flexible manner based on your business requirements.
- Gateway cluster: is configured with Hadoop client files. You can use a gateway cluster to submit jobs to the Hadoop cluster that is associated with the gateway cluster. This way, you do not need to log on to the Hadoop cluster. This helps avoid security issues and environment isolation issues. You must create a Hadoop cluster, create a gateway cluster, and then associate the gateway cluster with the Hadoop cluster.
Comparison between Alibaba Cloud EMR clusters and self-managed Hadoop clusters
The following table compares Alibaba Cloud EMR clusters and self-managed Hadoop clusters.
|Comparison item||Alibaba Cloud EMR cluster||Self-managed Hadoop cluster|
|Cost||You are charged for the resources on a subscription or pay-as-you-go basis. You can adjust the resources in an EMR cluster in a flexible manner and store data at different layers. The resource utilization is high. No additional software license fees are generated.||Resources are estimated in advance, and the resources are relatively fixed. The resource utilization is low. A Hadoop distribution is used. Therefore, additional license fees are generated.|
|Performance||The performance is significantly improved.||Open source Hadoop is used. You need to optimize performance based on your business requirements.|
|Usability||EMR Hadoop clusters can be started in minutes to quickly respond to business requirements.||You must purchase servers and deploy Hadoop components. It may take several weeks to create a self-managed cluster.|
|Elasticity||You can temporarily start and delete clusters based on jobs. Cluster resources can be dynamically adjusted by cluster load or in the specified period of time. JindoFS uses a compute-storage separated architecture. You can separately scale computing and storage resources.||A compute-storage integrated architecture is used. Resources are relatively fixed and cannot be adjusted in a flexible manner.|
|Security||Enterprises can manage resources based on the multi-tenancy capability that is provided by EMR clusters, manage permissions on tables, columns, and rows, and audit logs. Data encryption is supported.||You must configure the multi-tenancy capability. Take note that the multi-tenancy capability requires optimization and cannot meet the requirements of enterprises.|
|Reliability||EMR clusters are verified in the environments of large-scale enterprises. EMR clusters are continuously upgraded based on open source software versions and pass professional compatibility tests. Therefore, EMR clusters provide better user experience than self-managed clusters.||You must upgrade open source Hadoop, verify the version compatibility of different components, and fix bugs.|
|Service support||Professional and senior big data teams can provide after-sales support.||Service support is unavailable, and additional license fees are generated for the Hadoop distribution that you use.|
Comparison between EMR on ECS and EMR on ACK
- If you use an EMR cluster that is hosted on Elastic Compute Service (ECS) instances to run jobs, you can also run your Spark and Presto jobs on an ACK cluster. This way, different applications can share one ACK cluster, and computing resources are shared across zones.
- If you run big data jobs, such as Spark and Presto jobs, on ACK clusters, EMR on ACK can automatically deploy and manage the clusters. If EMR on ACK is used together with EMR Shuffle Service, the performance of Spark jobs can be significantly improved.
|EMR on ECS||When you create an EMR cluster, the EMR system deploys components of the open source Hadoop ecosystem on ECS instances based on your configurations and starts the components as services in the cluster. You can perform O&M operations on the services and ECS instances of the EMR cluster in the EMR console. |
When you run big data jobs, the jobs are committed to EMR clusters.
|EMR on ACK||Before you use EMR on ACK, make sure that an ACK cluster is deployed. After the ACK cluster is deployed, you can create an EMR cluster to deploy big data components based on ACK resources and run the components in related containers.|