An E-MapReduce (EMR) cluster consists of three categories of nodes: master, core, and task.

The three categories of nodes run different processes to complete different tasks.
  • Master node: runs the NameNode process of Hadoop HDFS and the ResourceManager process of Hadoop YARN.
  • Core node: runs the DataNode process of Hadoop HDFS and the NodeManager process of Hadoop YARN.
  • Task node: runs the NodeManager process of Hadoop YARN and performs only computing.
Before you create a cluster, you must determine the specifications of Elastic Compute Service (ECS) instances for each node category. Instances of the same category are in the same group. After you create a cluster, you can add ECS instances to the core node group or task node group to scale out the cluster.
Note In EMR V3.2.0 and later, clusters support task nodes.

Master node

A master node is deployed with the management components of cluster services, such as ResourceManager of Hadoop YARN.

You can access web UIs to view the running status of services in a cluster. To test or run a job in a cluster, you can connect to the master node and submit the job on the command line. For more information about how to connect to a master node, see Connect to the master node of an EMR cluster in SSH mode.

Core node

Core nodes in a cluster are managed by the master node. Core nodes run the DataNode process of Hadoop HDFS to store all data of a cluster. They also run computing service processes such as NodeManager of Hadoop YARN to run computing tasks.

To cope with the increase of data storage and computing workloads, you can scale out core nodes at any time without affecting the running of the cluster. Core nodes can use different storage media to store data. For more information, see Local disks and Block Storage overview.

Task node

Task nodes run only computing tasks. They cannot be used to store HDFS data. If the core nodes of a cluster offer sufficient computing capabilities, task nodes are not required. If the computing capabilities of the core nodes in a cluster become insufficient, you can add task nodes to the cluster at any time. You can run Hadoop MapReduce tasks and Spark executors on these task nodes to provide extra computing capabilities.

Task nodes can be added to or removed from a cluster at any time without any impact on the running of a cluster.