Alibaba Cloud Elastic MapReduce (or E-MapReduce) is a big data processing solution that facilitates the processing and analysis of massive amounts of data.
Built on Alibaba Cloud Elastic Compute Service (ECS) and based on open-source Apache Hadoop and Apache Spark, E-MapReduce flexibly manages your data in a wide range of scenarios, such as trend analysis, data warehousing, and online and offline data processing. It also makes it easy for you to import and export data to and from other cloud storage systems and database systems, such as Alibaba Cloud OSS and Alibaba Cloud RDS.
In general, to use a distributed processing system such as Hadoop or Spark, follow these steps:
- Evaluate the business characteristics.
- Select a machine type.
- Purchase a machine.
- Prepare the hardware environment.
- Install an operating system.
- Deploy applications (such as Hadoop and Spark).
- Start a cluster.
- Write applications.
- Run a job.
- Obtain data or perform another operation.
Steps 1-7 are preliminary tasks and may take some time to complete. Steps 8-10, however, concern application logic. E-MapReduce provides an integrated set of cluster management tools, including those used to build, configure, run, and manage clusters, configure and run jobs, as well as select hosts, deploy environments, and monitor performance.
With E-MapReduce, processes such as procurement, preparation, operation, and maintenance are all managed, allowing you to focus on the processing logic of your applications. E-MapReduce also provides flexible combination modes, allowing you to select different cluster services according to your needs. For example, if you want to receive daily statistics or perform simple batch operations, you can choose to only run Hadoop services in E-MapReduce. If you then want to implement stream-oriented and real-time computing at a later stage, you can add in Spark.
Structure of E-MapReduce
Clusters are the core component of E-MapReduce. An E-MapReduce cluster is essentially a Spark or Hadoop cluster that consists of multiple Alibaba Cloud ECS instances. For example, in Hadoop, the daemons that typically run on each ECS instance (such as namenode, datanode, resourcemanager, and nodemanager) form a Hadoop cluster. The nodes that run namenode and resourcemanager are known as master nodes, while those that run datanode and nodemanager are called slave nodes.
The following figure shows an E-MapReduce cluster that consists of one master node and three slave nodes: