All Products
Search
Document Center

Elastic Compute Service:Build a Hadoop environment

Last Updated:May 15, 2026

Deploy distributed or pseudo-distributed Hadoop clusters on ECS Linux instances for big data storage and processing.

Background

Apache Hadoop is a framework for distributed processing of large-scale datasets across computer clusters. It scales from a single server to thousands of machines, each providing local computation and storage. Hadoop detects and handles failures at the application layer, delivering high availability without relying on hardware redundancy.

The core components of Hadoop are Hadoop Distributed File System (HDFS) and MapReduce:

  • HDFS: A distributed file system for storing and reading application data.

  • MapReduce: A distributed computing framework that splits tasks into Map and Reduce phases and uses a task scheduler (JobTracker) to execute them across cluster nodes.

Feature

Pseudo-distributed mode

Fully-distributed mode

Number of nodes

Single node. All services run on one machine.

Multiple nodes. Services are distributed across multiple machines.

Resource utilization

Uses a single machine's resources.

Leverages computing and storage resources across multiple machines.

Fault tolerance

Low. A single point of failure makes the entire cluster unavailable.

High. Supports data replication and high availability configurations.

Scenarios

  • Development and testing

  • Learning and training

  • Small projects

  • Production environments

  • High availability and fault tolerance scenarios

  • Multi-user and multitasking

  • Large-scale data storage and processing

Quick deployment

Click Run Now to open Terraform Explorer, where you can view and run Terraform code to automatically build a Hadoop environment on an ECS instance.

Prerequisites

Your ECS instances must meet the following requirements:

Environment

Requirement

Instance

Pseudo-distributed

1 instance

Distributed

3 or more instances

Note

Add the instances to a deployment set that uses the High Availability strategy to improve availability and simplify cluster management.

Operating system

Linux

Public IP address

The instance is assigned a public IP address or associated with an Elastic IP Address (EIP).

Instance security group

Allow inbound traffic on ports 22, 443, 8088 (Hadoop YARN web UI), and 9870 (Hadoop NameNode web UI).

Note

For distributed deployments, also allow traffic on port 9868 (Hadoop Secondary NameNode web UI).

See Manage security group rules.

Java Development Kit (JDK)

This topic uses Hadoop 3.2.4 and Java 8. For other versions, see Hadoop Java Versions.

Hadoop version

Java version

Hadoop 3.3

Java 8 and Java 11

Hadoop 3.0.x~3.2.x

Java 8

Hadoop 2.7.x~2.10.x

Java 7 and Java 8

Procedure

Distributed

Note

Plan the nodes before deploying Hadoop. This example uses three instances: hadoop001 is the master node, and hadoop002 and hadoop003 are worker nodes.

Functional component

hadoop001

hadoop002

hadoop003

HDFS

  • NameNode

  • DataNode

DataNode

  • SecondaryNameNode

  • DataNode

YARN

NodeManager

  • ResourceManager

  • NodeManager

NodeManager

Step 1: Install the JDK

Important

Install JDK on all nodes.

  1. Connect to the ECS instance as a regular user.

    See Connect to a Linux instance using Workbench.

    Important

    The Hadoop community does not recommend running Hadoop as the root user due to security and permission issues. Use a non-root user such as ecs-user.

  2. Download the JDK 1.8 installation package.

    wget https://download.java.net/openjdk/jdk8u41/ri/openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
  3. Decompress the JDK 1.8 installation package.

    tar -zxvf openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
  4. Move and rename the JDK installation folder.

    This example renames the JDK installation folder to java8. You can use a different name.

    sudo mv java-se-8u41-ri/ /usr/java8
  5. Configure the Java environment variables.

    If you renamed the JDK installation folder, replace java8 in the following commands with the actual name.

    sudo sh -c "echo 'export JAVA_HOME=/usr/java8' >> /etc/profile"
    sudo sh -c 'echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile'
    source /etc/profile
  6. Verify that the JDK is installed.

    java -version

    The following output indicates a successful installation.

    image

Step 2: Configure passwordless SSH logon

Important

Perform this operation on all instances.

Set up passwordless SSH logon so nodes can connect without password authentication, simplifying cluster management.

  1. Configure hostnames and host resolution.

    sudo vim /etc/hosts

    Add the <Primary private IP address> <Hostname> information for all instances to the /etc/hosts file. For example:

    <Primary private IP address> hadoop001
    <Primary private IP address> hadoop002
    <Primary private IP address> hadoop003
  2. Create a public key and a private key.

    ssh-keygen -t rsa

    image

  3. Run ssh-copy-id <Hostname>, and replace the hostname with the correct name. For example:

    On hadoop001, run ssh-copy-id hadoop001, ssh-copy-id hadoop002, and ssh-copy-id hadoop003. After each command, enter yes and the password for the corresponding instance.

    ssh-copy-id hadoop001
    ssh-copy-id hadoop002
    ssh-copy-id hadoop003

    Passwordless logon is configured when the output matches the following.

    image

Step 3: Install Hadoop

Important

Run the following commands on all instances.

  1. Download the Hadoop installation package.

    wget http://mirrors.cloud.aliyuncs.com/apache/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
  2. Decompress the Hadoop installation package to /opt/hadoop.

    sudo tar -zxvf hadoop-3.2.4.tar.gz -C /opt/
    sudo mv /opt/hadoop-3.2.4 /opt/hadoop
  3. Configure the Hadoop environment variables.

    sudo sh -c "echo 'export HADOOP_HOME=/opt/hadoop' >> /etc/profile"
    sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/bin' >> /etc/profile"
    sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/sbin' >> /etc/profile"
    source /etc/profile
  4. Modify the yarn-env.sh and hadoop-env.sh configuration files.

    sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/yarn-env.sh'
    sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/hadoop-env.sh'
  5. Verify that Hadoop is installed.

    hadoop version

    The following output indicates a successful installation.

    image

Step 4: Configure Hadoop

Important

Modify Hadoop configuration files on all nodes.

  1. Modify the Hadoop configuration file core-site.xml.

    1. Open the file for editing.

      sudo vim /opt/hadoop/etc/hadoop/core-site.xml
    2. In the <configuration></configuration> node, insert the following content.

         <!--Specify the NameNode address-->
         <property>
               <name>fs.defaultFS</name>
               <value>hdfs://hadoop001:8020</value>
         </property>
         <!--Specify the directory to store files generated by Hadoop-->
         <property>
               <name>hadoop.tmp.dir</name>
               <value>/opt/hadoop/data</value>
         </property>
         <!--Configure the static user for HDFS web logon as hadoop-->
         <property>
                <name>hadoop.http.staticuser.user</name>
                <value>hadoop</value>
         </property>
  2. Modify the Hadoop configuration file hdfs-site.xml.

    1. Open the file for editing.

      sudo vim /opt/hadoop/etc/hadoop/hdfs-site.xml
    2. In the <configuration></configuration> node, insert the following content.

          <!-- NameNode web endpoint -->
          <property>
              <name>dfs.namenode.http-address</name>  
              <value>hadoop001:9870</value>
          </property>
          <!-- Secondary NameNode web endpoint -->
          <property>
              <name>dfs.namenode.secondary.http-address</name>
              <value>hadoop003:9868</value>
          </property>
  3. Modify the Hadoop configuration file yarn-site.xml.

    1. Open the file for editing.

      sudo vim /opt/hadoop/etc/hadoop/yarn-site.xml
    2. In the <configuration></configuration> node, insert the following content.

           <!--The method for NodeManager to obtain data is shuffle-->
           <property>
      	 <name>yarn.nodemanager.aux-services</name>
      	 <value>mapreduce_shuffle</value>
           </property>
      
          <!--Specify the YARN (ResourceManager) address-->     
          <property>
      	 <name>yarn.resourcemanager.hostname</name>
      	 <value>hadoop002</value>
          </property> 
          
          <!--Specify the whitelist of environment variables that NodeManager can pass to containers-->
          <property>
          	<name>yarn.nodemanager.env-whitelist</name>
          	<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
          </property>
  4. Modify the Hadoop configuration file mapred-site.xml.

    1. Open the file for editing.

      sudo vim /opt/hadoop/etc/hadoop/mapred-site.xml
    2. In the <configuration></configuration> node, insert the following content.

         <!--Tell Hadoop to run MapReduce (MR) on YARN-->
         <property>
      	 <name>mapreduce.framework.name</name>
      	 <value>yarn</value>
         </property>
  5. Modify the Hadoop configuration file workers.

    1. Open the file for editing.

      sudo vim /opt/hadoop/etc/hadoop/workers
    2. In the workers file, insert the instance information.

      hadoop001
      hadoop002
      hadoop003

Step 5: Start Hadoop

  1. Initialize the namenode.

    Warning

    Initialize namenode on all three instances only during first startup.

    hadoop namenode -format
  2. Start Hadoop.

    Important
    • The Hadoop community does not recommend running Hadoop as the root user due to security and permission issues. Use a non-root user such as ecs-user.

    • If you must run Hadoop as root, understand the access control model and associated risks before modifying the following configuration files.

      Note: Running Hadoop as root introduces serious security risks, including but not limited to data breaches, increased vulnerability to malware that can obtain root privileges, and unexpected permission issues. See the official Hadoop documentation.

    Modify the configuration files to allow the root user to start Hadoop services.

    The following configuration files are typically in the /opt/hadoop/sbin directory.

    1. In the start-dfs.sh and stop-dfs.sh files, add the following parameters.

      HDFS_DATANODE_USER=root
      HADOOP_SECURE_DN_USER=hdfs
      HDFS_NAMENODE_USER=root
      HDFS_SECONDARYNAMENODE_USER=root

      image

    2. In the start-yarn.sh and stop-yarn.sh files, add the following parameters.

      YARN_RESOURCEMANAGER_USER=root
      HADOOP_SECURE_DN_USER=yarn
      YARN_NODEMANAGER_USER=root

      image

    1. On hadoop001, run start-dfs.sh to start the HDFS service.

      This script starts NameNode, SecondaryNameNode, and DataNode.

      start-dfs.sh

      HDFS is running when the jps output matches the following.

      image

    2. On hadoop002, run start-yarn.sh to start the YARN service.

      This script starts ResourceManager and NodeManager.

      start-yarn.sh

      YARN is running when the output matches the following.

      image

  3. View the started processes on all three nodes.

    jps

    The started processes are as follows.

    image

  4. In your browser, enter http://<Public IP address of hadoop002 ECS instance>:8088 to access the YARN web UI.

    View cluster resource usage, MapReduce job status, and queue information.

    Important

    Allow inbound traffic on port 8088 in the security group. Otherwise, the web UI is inaccessible. See Add a security group rule.

    image

  5. In your browser, enter http://<Public IP address of hadoop001 ECS instance>:9870 to access the NameNode web UI. Enter http://<Public IP address of hadoop003 ECS instance>:9868 to access the SecondaryNameNode web UI.

    View HDFS status, cluster health, active nodes, and NameNode logs.

    The following page indicates that the distributed environment is built.

    Important

    Allow inbound traffic on port 9870 in the security group. Otherwise, the web UI is inaccessible. See Add a security group rule.

    image

Pseudo-distributed

Step 1: Install the JDK

  1. Connect to the ECS instance as a regular user.

    See Connect to a Linux instance using Workbench.

    Important

    The Hadoop community does not recommend running Hadoop as the root user due to security and permission issues. Use a non-root user such as ecs-user.

  2. Download the JDK 1.8 installation package.

    wget https://download.java.net/openjdk/jdk8u41/ri/openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
  3. Decompress the JDK 1.8 installation package.

    tar -zxvf openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
  4. Move and rename the JDK installation folder.

    This example renames the JDK installation folder to java8. You can use a different name.

    sudo mv java-se-8u41-ri/ /usr/java8
  5. Configure the Java environment variables.

    If you renamed the JDK installation folder, replace java8 in the following commands with the actual name.

    sudo sh -c "echo 'export JAVA_HOME=/usr/java8' >> /etc/profile"
    sudo sh -c 'echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile'
    source /etc/profile
  6. Verify that the JDK is installed.

    java -version

    The following output indicates a successful installation.

    image

Step 2: Configure passwordless SSH logon

Note

A single node also requires passwordless SSH logon. Otherwise, starting NameNode and DataNode fails with a permission denied error.

  1. Create a public key and a private key.

    ssh-keygen -t rsa

    image

  2. Add the public key to the authorized_keys file.

    cd .ssh
    cat id_rsa.pub >> authorized_keys

Step 3: Install Hadoop

  1. Download the Hadoop installation package.

    wget http://mirrors.cloud.aliyuncs.com/apache/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
  2. Decompress the Hadoop installation package to /opt/hadoop.

    sudo tar -zxvf hadoop-3.2.4.tar.gz -C /opt/
    sudo mv /opt/hadoop-3.2.4 /opt/hadoop
  3. Configure the Hadoop environment variables.

    sudo sh -c "echo 'export HADOOP_HOME=/opt/hadoop' >> /etc/profile"
    sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/bin' >> /etc/profile"
    sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/sbin' >> /etc/profile"
    source /etc/profile
  4. Modify the yarn-env.sh and hadoop-env.sh configuration files.

    sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/yarn-env.sh'
    sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/hadoop-env.sh'
  5. Verify that Hadoop is installed.

    hadoop version

    The following output indicates a successful installation.

    image

Step 4: Configure Hadoop

  1. Modify the Hadoop configuration file core-site.xml.

    1. Open the file for editing.

      sudo vim /opt/hadoop/etc/hadoop/core-site.xml
    2. In the <configuration></configuration> node, insert the following content.

          <property>
              <name>hadoop.tmp.dir</name>
              <value>file:/opt/hadoop/tmp</value>
              <description>location to store temporary files</description>
          </property>
          <property>
              <name>fs.defaultFS</name>
              <value>hdfs://localhost:9000</value>
          </property>
  2. Modify the Hadoop configuration file hdfs-site.xml.

    1. Open the file for editing.

      sudo vim /opt/hadoop/etc/hadoop/hdfs-site.xml
    2. In the <configuration></configuration> node, insert the following content.

          <property>
              <name>dfs.replication</name>
              <value>1</value>
          </property>
          <property>
              <name>dfs.namenode.name.dir</name>
              <value>file:/opt/hadoop/tmp/dfs/name</value>
          </property>
          <property>
              <name>dfs.datanode.data.dir</name>
              <value>file:/opt/hadoop/tmp/dfs/data</value>
          </property>

Step 5: Start Hadoop

  1. Initialize the namenode.

    hadoop namenode -format
  2. Start Hadoop.

    Important
    • The Hadoop community does not recommend running Hadoop as the root user due to security and permission issues. Use a non-root user such as ecs-user.

    • If you must run Hadoop as root, understand the access control model and associated risks before modifying the following configuration files.

      Note: Running Hadoop as root introduces serious security risks, including but not limited to data breaches, increased vulnerability to malware that can obtain root privileges, and unexpected permission issues. See the official Hadoop documentation.

    Modify the configuration files to allow the root user to start Hadoop services.

    The following configuration files are typically in the /opt/hadoop/sbin directory.

    1. In the start-dfs.sh and stop-dfs.sh files, add the following parameters.

      HDFS_DATANODE_USER=root
      HADOOP_SECURE_DN_USER=hdfs
      HDFS_NAMENODE_USER=root
      HDFS_SECONDARYNAMENODE_USER=root

      image

    2. In the start-yarn.sh and stop-yarn.sh files, add the following parameters.

      YARN_RESOURCEMANAGER_USER=root
      HADOOP_SECURE_DN_USER=yarn
      YARN_NODEMANAGER_USER=root

      image

    1. Start the HDFS service.

      This script starts NameNode, SecondaryNameNode, and DataNode.

      start-dfs.sh

      HDFS is running when the output matches the following.

      image

    2. Start the YARN service.

      This script starts ResourceManager, NodeManager, and ApplicationHistoryServer.

      start-yarn.sh

      YARN is running when the output matches the following.

      image

  3. View the started processes.

    jps

    The started processes are as follows.

    image

  4. In your browser, enter http://<Public IP address of the ECS instance>:8088 to access the YARN web UI.

    View cluster resource usage, MapReduce job status, and queue information.

    Important

    Allow inbound traffic on port 8088 in the security group. Otherwise, the web UI is inaccessible. See Add a security group rule.

    image

  5. In your browser, enter http://<Public IP address of the ECS instance>:9870 to access the NameNode web UI.

    View HDFS status, cluster health, active nodes, and NameNode logs.

    The following page indicates that the pseudo-distributed environment is built.

    Important

    Allow inbound traffic on port 9870 in the security group. Otherwise, the web UI is inaccessible. See Add a security group rule.

    image.png

More operations

Create a snapshot-consistent group

For distributed Hadoop, use a snapshot-consistent group to ensure data consistency across the cluster. See Create a snapshot-consistent group.

Hadoop-related operations

For HDFS operations, see Common HDFS commands.

References