Build a Hadoop environment - Elastic Compute Service - Alibaba Cloud Documentation Center

Hadoop is an open source, distributed, Java-based software framework that is developed by the Apache Foundation. This topic describes how to quickly build a distributed Hadoop environment and a pseudo-distributed Hadoop environment on Linux Elastic Compute Service (ECS) instances.

Background information

The Apache Hadoop software library is a framework that allows you to process large data sets across clusters of computers in a distributed manner by using simple programming models. The framework can scale up from a single server to thousands of machines that provide local computation and storage capabilities. Hadoop does not rely on hardware to deliver high availability. It is designed to detect and handle failures at the application layer to deliver a highly available service on top of a cluster of computers that are prone to failures.

Hadoop Distributed File System (HDFS) and MapReduce are vital components of Hadoop.

HDFS is a distributed file system that is used for distributed storage of application data and provides access to application data.
MapReduce is a distributed computing framework that distributes computing jobs across servers in a Hadoop cluster. Computing jobs are split into map tasks and reduce tasks. JobTracker schedules the tasks for distributed processing.

Item	Pseudo-distributed mode	Distributed mode
Number of nodes	Single node. A single node hosts all services.	Multiple nodes. Services are distributed across multiple nodes.
Resource utilization	Use the resources of a single machine.	Utilize the computing and storage resources of multiple machines.
Fault tolerance	Low. A single point of failure makes the cluster unavailable.	High. Data replication and high availability configurations are supported.
Scenarios	Development and testing Learning and training Small-scale projects	Production environments Scenarios that require high availability and fault tolerance Multi-user and multi-task environments Large-scale data storage and processing

Prerequisites

An ECS instance that meets the requirements described in the following table is created.

Environment	Requirement
Instance	Pseudo-distributed mode	One instance
Instance	Distributed mode	Three or more instances Note To improve availability, disaster recovery, and Hadoop cluster management, we recommend that you add the instances to a deployment set that uses the High Availability strategy.
Operating system	Linux
Public IP address	Each ECS instance is assigned a public IP address by the system or is associated with an elastic IP address (EIP).
Instance security group	Inbound rules are added to the security groups associated with the ECS instances to open ports 22, 443, 8088, and 9870. Port 8088 is the default web UI port for Hadoop Yet Another Resource Negotiator (YARN). Port 9870 is the default web UI port for Hadoop NameNode. Note If you deploy a distributed Hadoop environment, you must also open port 9868, which is the Hadoop Secondary NameNode custom Web UI port. For more information, see Manage security group rules.
Java Development Kit (JDK) In this topic, Hadoop 3.2.4 and Java 8 are used. To use other Hadoop and Java versions, visit the official Hadoop website for instructions. For more information, see Hadoop Java Versions.	Hadoop version	Java version
	Hadoop 3.3	Java 8 and Java 11
	Hadoop 3.0.x~3.2.x	Java 8
	Hadoop 2.7.x~2.10.x	Java 7 and Java 8

Procedure

Distributed mode

Note

You must plan nodes before you deploy Hadoop. In this example, three instances are used to deploy Hadoop. The hadoop001 node serves as the master node, and the hadoop002 and hadoop003 nodes serve as worker nodes.

Functional component	hadoop001	hadoop002	hadoop003
HDFS	NameNode DataNode	DataNode	SecondaryNameNode DataNode
YARN	NodeManager	ResourceManager NodeManager	NodeManager

Step 1: Install JDK

Important

The JDK environment must be installed on all nodes.

Connect to each ECS instance that you created.
For more information, see Use Workbench to connect to a Linux instance over SSH.
Important
To ensure system security and stability, Apache does not recommend that you start Hadoop as the root user. You may fail to start Hadoop as the root user due to a permissions issue. You can start Hadoop as a non-root user, such as ecs-user.

Run the following command to download the JDK 1.8 installation package:

wget https://download.java.net/openjdk/jdk8u41/ri/openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz

Run the following command to decompress the downloaded JDK 1.8 installation package:
```
tar -zxvf openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
```
Run the following command to move and rename the folder to which the JDK installation files are extracted.
In this example, the folder is renamed java8. You can specify a different name for the folder based on your business requirements.
```
sudo mv java-se-8u41-ri/ /usr/java8
```
Run the following commands to configure Java environment variables.
If the name you specified for the folder to which the JDK installation files are extracted is not java8, replace java8 in the following commands with the actual folder name.
```
sudo sh -c "echo 'export JAVA_HOME=/usr/java8' >> /etc/profile"
sudo sh -c 'echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile'
source /etc/profile
```
Run the following command to check whether JDK is installed:
```
java -version
```
The following command output indicates that JDK is installed.

Step 2: Configure password-free SSH logon

Important

Perform this step on all instances.

The password-free SSH logon feature allows nodes to seamlessly connect to each other, without the need to enter a password to verify identity. This allows you to manage and maintain Hadoop clusters in a more convenient and efficient manner.

Configure hostnames and a communication method.
```
sudo vim /etc/hosts
```
Add <Primary private IP-address> <Hostname> of each instance to the /etc/hosts file. Examples:
```
<Primary private IP address> hadoop001
<Primary private IP address> hadoop002
<Primary private IP address> hadoop003
```
Run the following command to create a public key and a private key:
```
ssh-keygen -t rsa
```
Run the ssh-copy-id <Hostname> command to replace the hostname with the actual hostname of each instance. Examples:
Run the ssh-copy-id hadoop001, ssh-copy-id hadoop002, and ssh-copy-id hadoop003 commands in sequence on the hadoop001 node. After you run each command, enter yes and the password of the corresponding ECS instance.
```
ssh-copy-id hadoop001
ssh-copy-id hadoop002
ssh-copy-id hadoop003
```
If the following command output is returned, the password-free logon configuration is successful.

Step 3: Install Hadoop

Important

Run the following commands on all ECS instances.

Run the following command to download the Hadoop installation package:

wget http://mirrors.cloud.aliyuncs.com/apache/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz

Run the following commands to decompress the Hadoop installation package to the /opt/hadoop path:
```
sudo tar -zxvf hadoop-3.2.4.tar.gz -C /opt/
sudo mv /opt/hadoop-3.2.4 /opt/hadoop
```

Run the following commands to configure Hadoop environment variables:

sudo sh -c "echo 'export HADOOP_HOME=/opt/hadoop' >> /etc/profile"
sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/bin' >> /etc/profile"
sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/sbin' >> /etc/profile"
source /etc/profile

Run the following commands to modify the yarn-env.sh and hadoop-env.sh configuration files:

sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/yarn-env.sh'
sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/hadoop-env.sh'

Run the following command to check whether Hadoop is installed:
```
hadoop version
```
The following command output indicates that Hadoop is installed.

Step 4: Configure Hadoop

Important

Modify the configuration files of Hadoop on all instances.

Modify the core-site.xml configuration file of Hadoop.

Run the following command to open the core-site.xml configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/core-site.xml
```

In the <configuration></configuration> section, add the following content:

!" -- Specify the address of the namenode. -->
   <property>
         <name>fs.defaultFS</name>
         <value>hdfs://hadoop001:8020</value>
   </property>
   <! -- Specify the directory in which files generated when you use Hadoop are stored. -->
   <property>
         <name>hadoop.tmp.dir</name>
         <value>/opt/hadoop/data</value>
   </property>
   <! -- Configure hadoop as the static user for HDFS web page logon. -->
   <property>
          <name>hadoop.http.staticuser.user</name>
          <value>hadoop</value>
   </property>

Modify the hdfs-site.xml configuration file of Hadoop.

Run the following command to open the hdfs-site.xml configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/hdfs-site.xml
```

In the <configuration></configuration> section, add the following content:

    <!-- The NameNode web address.-->
    <property>
        <name>dfs.namenode.http-address</name>  
        <value>hadoop001:9870</value>
    </property>
    <! -- The web address of the secondary NameNode.-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop003:9868</value>
    </property>

Modify the yarn-site.xml configuration file of Hadoop.

Run the following command to open the yarn-site configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/yarn-site.xml
```

In the <configuration></configuration> section, add the following content:

     <!--Specify the shuffle mode in which NodeManager obtains data.-->
     <property>
	 <name>yarn.nodemanager.aux-services</name>
	 <value>mapreduce_shuffle</value>
     </property>

    <! -- Specify the address of YARN (ResourceManager). -->     
    <property>
	 <name>yarn.resourcemanager.hostname</name>
	 <value>hadoop002</value>
    </property> 
    
    <! -- Specify a whitelist of environment variables that NodeManager allows to pass to containers. -->
    <property>
    	<name>yarn.nodemanager.env-whitelist</name>
    	<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>

Modify the mapred-site.xml configuration file of Hadoop.
1. Run the following command to open the mapred-site.xml configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/mapred-site.xml
```
2. In the <configuration></configuration> section, add the following content:
```
   
   <property>
	 <name>mapreduce.framework.name</name>
	 <value>yarn</value>
   </property>
```
Modify the workers configuration file of Hadoop.
1. Run the following command to open the workers configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/workers
```
2. In the workers file, add the instance information.
```
hadoop001
hadoop002
hadoop003
```

Step 5: Start Hadoop

Run the following command to initialize NameNode:
Warning
The NameNode initialization is required only for the first Hadoop startup. Perform this step on all instances.
```
hadoop namenode -format
```
Start Hadoop.
Important
- To ensure system security and stability, Apache does not recommend that you start Hadoop as the root user. You may fail to start Hadoop as the root user due to a permissions issue. You can start Hadoop as a non-root user, such as ecs-user.
- If you want to start Hadoop as the root user, you must be familiar with the instructions for managing Hadoop permissions and the associated risks and modify the following configuration files.
  Take note that when you start Hadoop as the root user, severe security risks may arise. The security risks include but are not limited to data leaks, vulnerabilities that can be exploited by malware to obtain the highest system permissions, and unexpected permissions issues or operations. For information about Hadoop permissions, see Hadoop in Secure Mode.
Modify configuration files to allow you to start Hadoop as the root user
In most cases, the following configuration files are stored in the /opt/hadoop/sbin directory.
Add the following parameters to the start-dfs.sh and stop-dfs.sh configuration files:
HDFS_DATANODE_USER=root HADOOP_SECURE_DN_USER=hdfs HDFS_NAMENODE_USER=root HDFS_SECONDARYNAMENODE_USER=root
Add the following parameters to the start-yarn.sh and stop-yarn.sh configuration files:
YARN_RESOURCEMANAGER_USER=root HADOOP_SECURE_DN_USER=yarn YARN_NODEMANAGER_USER=root
1. Run the start-dfs.sh command on the hadoop001 node to start HDFS.
  The start-dfs.sh command starts the HDFS service by starting components, such as NameNode, SecondaryNameNode, and DataNode.
```
start-dfs.sh
```
  Run the jps command. The following command output indicates that HDFS is started.
2. Run the start-yarn.sh command on the hadoop002 node to start the YARN service.
  The start-yarn.sh command starts the YARN service by starting components, such as ResourceManager and NodeManager.
```
start-yarn.sh
```
  The following command output indicates that YARN is started.
Run the following command to view the processes that are started:
```
jps
```
The processes shown in the following figure are started.
Enter http://<Public IP address of the hadoop002 node>:8088 in the address bar of a web browser on your on-premises computer to access the web UI of YARN.
The web UI shows information about the entire cluster, including resource usage, status of applications (such as MapReduce jobs), and queue information.
Important
Make sure that an inbound rule is added to a security group to which the ECS instance belongs to open port 8088. Otherwise, you cannot access the web UI. For information about how to add a security group rule, see Add a security group rule.
Enter http://<Public IP address of the hadoop001 node>:9870 in the address bar of a web browser on your on-premises computer to access the web UI of NameNode. Enter http://<Public IP address of the hadoop003 node>:9868 to access the web UI of SecondaryNameNode.
The web UI shows information about the HDFS file system, including the file system status, cluster health, active nodes, and NameNode logs.
The page shown in the following figure indicates that the distributed Hadoop environment is built.
Important
Make sure that an inbound rule is added to a security group to which the ECS instance belongs to open port 9870. Otherwise, you cannot access the web UI. For information about how to add a security group rule, see Add a security group rule.

Pseudo-distributed mode

Step 1: Install JDK

Connect to the ECS instance that you created.
For more information, see Use Workbench to connect to a Linux instance over SSH.
Important
To ensure system security and stability, Apache does not recommend that you start Hadoop as the root user. You may fail to start Hadoop as the root user due to a permissions issue. You can start Hadoop as a non-root user, such as ecs-user.

Run the following command to download the JDK 1.8 installation package:

wget https://download.java.net/openjdk/jdk8u41/ri/openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz

Run the following command to decompress the downloaded JDK 1.8 installation package:
```
tar -zxvf openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
```
Run the following command to move and rename the folder to which the JDK installation files are extracted.
In this example, the folder is renamed java8. You can specify a different name for the folder based on your business requirements.
```
sudo mv java-se-8u41-ri/ /usr/java8
```
Run the following commands to configure Java environment variables.
If the name you specified for the folder to which the JDK installation files are extracted is not java8, replace java8 in the following commands with the actual folder name.
```
sudo sh -c "echo 'export JAVA_HOME=/usr/java8' >> /etc/profile"
sudo sh -c 'echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile'
source /etc/profile
```
Run the following command to check whether JDK is installed:
```
java -version
```
The following command output indicates that JDK is installed.

Step 2: Configure password-free SSH logon

Note

You must configure SSH password-free logon for the single node. Otherwise, a Permission Denied error occurs when you start NameNode and DataNode.

Run the following command to create a public key and a private key:
```
ssh-keygen -t rsa
```
Run the following commands to add the public key to the authorized_keys file:
```
cd .ssh
cat id_rsa.pub >> authorized_keys
```

Step 3: Install Hadoop

Run the following command to download the Hadoop installation package:

wget http://mirrors.cloud.aliyuncs.com/apache/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz

Run the following commands to decompress the Hadoop installation package to the /opt/hadoop path:
```
sudo tar -zxvf hadoop-3.2.4.tar.gz -C /opt/
sudo mv /opt/hadoop-3.2.4 /opt/hadoop
```

Run the following commands to configure Hadoop environment variables:

sudo sh -c "echo 'export HADOOP_HOME=/opt/hadoop' >> /etc/profile"
sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/bin' >> /etc/profile"
sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/sbin' >> /etc/profile"
source /etc/profile

Run the following commands to modify the yarn-env.sh and hadoop-env.sh configuration files:

sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/yarn-env.sh'
sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/hadoop-env.sh'

Run the following command to check whether Hadoop is installed:
```
hadoop version
```
The following command output indicates that Hadoop is installed.

Step 4: Configure Hadoop

Modify the core-site.xml configuration file of Hadoop.

Run the following command to open the core-site.xml configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/core-site.xml
```

In the <configuration></configuration> section, add the following content:

    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop/tmp</value>
        <description>location to store temporary files</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>

Modify the hdfs-site.xml configuration file of Hadoop.

Run the following command to open the hdfs-site.xml configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/hdfs-site.xml
```

In the <configuration></configuration> section, add the following content:

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/tmp/dfs/data</value>
    </property>

Step 5: Start Hadoop

Run the following command to initialize NameNode:
```
hadoop namenode -format
```
Start Hadoop.
Important
- To ensure system security and stability, Apache does not recommend that you start Hadoop as the root user. You may fail to start Hadoop as the root user due to a permissions issue. You can start Hadoop as a non-root user, such as ecs-user.
- If you want to start Hadoop as the root user, you must be familiar with the instructions for managing Hadoop permissions and the associated risks and modify the following configuration files.
  Take note that when you start Hadoop as the root user, severe security risks may arise. The security risks include but are not limited to data leaks, vulnerabilities that can be exploited by malware to obtain the highest system permissions, and unexpected permissions issues or operations. For information about Hadoop permissions, see Hadoop in Secure Mode.
Modify configuration files to allow you to start Hadoop as the root user
In most cases, the following configuration files are stored in the /opt/hadoop/sbin directory.
Add the following parameters to the start-dfs.sh and stop-dfs.sh configuration files:
HDFS_DATANODE_USER=root HADOOP_SECURE_DN_USER=hdfs HDFS_NAMENODE_USER=root HDFS_SECONDARYNAMENODE_USER=root
Add the following parameters to the start-yarn.sh and stop-yarn.sh configuration files:
YARN_RESOURCEMANAGER_USER=root HADOOP_SECURE_DN_USER=yarn YARN_NODEMANAGER_USER=root
1. Run the following command to start the HDFS service.
  The start-dfs.sh command starts the HDFS service by starting components, such as NameNode, SecondaryNameNode, and DataNode.
```
start-dfs.sh
```
  The following command output indicates that the HDFS service is started.
2. Run the following command to start the YARN service.
  The start-yarn.sh command starts the YARN service by starting components, such as ResourceManager, NodeManager, and ApplicationHistoryServer.
```
start-yarn.sh
```
  The following command output indicates that YARN is started.
Run the following command to view the processes that are started:
```
jps
```
The processes shown in the following figure are started.
Enter http://<Public IP address of the ECS instance>:8088 in the address bar of a web browser on your on-premises computer to access the web UI of YARN.
The web UI shows information about the entire cluster, including resource usage, status of applications (such as MapReduce jobs), and queue information.
Important
Make sure that an inbound rule is added to a security group to which the ECS instance belongs to open port 8088. Otherwise, you cannot access the web UI. For information about how to add a security group rule, see Add a security group rule.
Enter http://<Public IP address of the ECS instance>:9870 in the address bar of a web browser on your on-premises computer to access the web UI of NameNode.
The web UI shows information about the HDFS file system, including the file system status, cluster health, active nodes, and NameNode logs.
The page shown in the following figure indicates that the distributed Hadoop environment is built.
Important
Make sure that an inbound rule is added to a security group to which the ECS instance belongs to open port 9870. Otherwise, you cannot access the web UI. For information about how to add a security group rule, see Add a security group rule.

Related operations

Create a snapshot-consistent group

If you use distributed Hadoop, we recommend that you create a snapshot-consistent group to ensure data consistency and reliability in the Hadoop cluster and provide a stable environment for subsequent data processing and analysis. For more information, see Create a snapshot-consistent group.

Hadoop-related operations

For information about how to use HDFS components, see Common HDFS commands.

References

For information about how to use the big data environment that is integrated with Alibaba Cloud ECS, see Getting started with EMR.
For information about how to use the data lake development and governance environment, see Getting started with DataWorks.