Building a Hadoop Environment Based on ECS Instances

This article describes how to build a Hadoop pseudo-distributed environment on an Elastic Compute Service (ECS) instance that runs a Linux operating system.

By Alibaba Cloud ECS Team

Prerequisites

1. An ECS instance has been created. For more information, see Create an instance by using the wizard. In this tutorial, the ECS instance is configured as follows:

• Instance Type: ecs.g6.large
• Operating system: CentOS 7.7 64-bit public image
• Network type: Virtual Private Cloud (VPC)
• IP address: public IP address

Note: The commands used may vary depending on your instance's operating system and software versions. If your software version or operating system differs from the ones mentioned above, please adjust the commands accordingly.

2. The ECS instance has been added to security groups with rules allowing traffic on ports 8088 and 50070, which are used by Hadoop. To learn how to add a security group rule, please refer to Add a security group rule.

Background Information

Hadoop is an open-source Apache framework written in Java that enables efficient processing and storage of large datasets across clusters. It allows users to develop distributed programs without needing to understand the underlying infrastructure. Hadoop Distributed File System (HDFS) and MapReduce are essential components of Hadoop.

HDFS is a distributed file system that enables distributed storage and retrieval of application data.

MapReduce is a distributed computing framework that distributes computing jobs across servers in a Hadoop cluster. These jobs are split into map and reduce tasks, which are scheduled for distributed processing by the JobTracker.

For more information, visit the Apache Hadoop website.

Procedure

Perform the following steps to build a Hadoop pseudo-distributed environment on the ECS instance:

Step 1: Install Java Development Kit (JDK)
Step 2: Install Hadoop
Step 3: Configure Hadoop
Step 4: Configure Password-free SSH Logon
Step 5: Start Hadoop

Step 1: Install Java Development Kit (JDK)

1. Connect to the created ECS instance. For more information, see Connection methods.

2. Run the following command to download the JDK 1.8 installation package:

wget https://download.java.net/openjdk/jdk8u41/ri/openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz

3. Run the following command to decompress the downloaded JDK 1.8 installation package:

tar -zxvf openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz

4. Run the following command to move and rename the JDK installation package: In this example, the installation package is renamed java8. You can specify a different name for the folder based on your business requirements.

mv java-se-8u41-ri/ /usr/java8

5. Run the following command to configure Java environment variables: If your specified name of the folder to which the JDK 1.8 installation files are extracted is not java8, replace java8 in the following commands with the actual folder name:

echo 'export JAVA_HOME=/usr/java8' >> /etc/profile
echo 'export PATH=$PATH:$JAVA_HOME/bin' >> /etc/profile
source /etc/profile

6. Run the following command to check whether Java is installed:

java -version

A command output similar to the following one indicates that JDK 1.8 is installed:

openjdk version "1.8.0_41"
OpenJDK Runtime Environment (build 1.8.0_41-b04)
OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)

Step 2: Install Hadoop

1. Run the following command to download the Hadoop installation package:

wget https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz

2. Run the following command to decompress the Hadoop installation package to the /opt/hadoop path:

tar -zxvf hadoop-2.10.1.tar.gz -C /opt/
mv /opt/hadoop-2.10.1 /opt/hadoop

3. Run the following command to configure Hadoop environment variables:

echo 'export HADOOP_HOME=/opt/hadoop/' >> /etc/profile
echo 'export PATH=$PATH:$HADOOP_HOME/bin' >> /etc/profile
echo 'export PATH=$PATH:$HADOOP_HOME/sbin' >> /etc/profile
source /etc/profile

4. Run the following commands to modify the yarn-env.sh and hadoop-env.sh configuration file:

echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/yarn-env.sh
echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/hadoop-env.sh

5. Run the following command to check whether Hadoop is installed:

hadoop version

A command output similar to the following one indicates that Hadoop is installed:

Hadoop 2.10.1
Subversion https://github.com/apache/hadoop -r 1827467c9a56f133025f28557bfc2c562d78e816
Compiled by centos on 2020-09-14T13:17Z
Compiled with protoc 2.5.0
From source with checksum 3114edef868f1f3824e7d0f68be03650
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.10.1.jar

Step 3: Configure Hadoop

1. Modify the core-site.xml configuration file of Hadoop.

a) Run the following command to go to the Edit page:

vim /opt/hadoop/etc/hadoop/core-site.xml

b) Press the I key to enter the edit mode.

c) In the configuration section, add the following content:

<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/hadoop/tmp</value>
<description>location to store temporary files</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

d) Press the Esc key to exit the edit mode and enter :wq to save and close the file.

2. Modify the Hadoop configuration file hdfs-site.xml.

a) Run the following command to go to the Edit page:

vim /opt/hadoop/etc/hadoop/hdfs-site.xml

b) Press the I key to enter the edit mode.

c) In the configuration section, add the following content:

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/tmp/dfs/data</value>
</property>

d) Press the Esc key to exit the edit mode and enter :wq to save and close the file.

Step 4: Configure Password-free SSH Logon

1. Run the following command to create a public key and a private key:

ssh-keygen -t rsa

A command output similar to the following one indicates that the public and private keys are created:

[root@iZbp1chrrv37a2kts7sydsZ ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:gjWO5mgARst+O5VUaTnGs+LxVhfmCJnQwKfEBTro2oQ root@iZbp1chrrv37a2kts7s****
The key's randomart image is:
+---[RSA 2048]----+
| .  o+Bo=        |
|o o .+.#   o     |
|.= o..B = + .    |
|=.  oO.o o o     |
|Eo..=o* S .      |
|.+.+o. +         |
|. +o. .          |
| .  .            |
|                 |
+----[SHA256]-----+

2. Run the following command to add the public key to the authorized_keys file.

cd .ssh
cat id_rsa.pub >> authorized_keys

Step 5: Start Hadoop

1. Run the following command to initialize namenode:

hadoop namenode -format

2. Run the following commands in sequence to start Hadoop:

start-dfs.sh

At the prompts that appear, enter yes, as shown in the following figure.

start-yarn.sh

A command output similar to the following one is returned.

[root@iZbp1chrrv37a2kts7s**** .ssh]# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/logs/yarn-root-resourcemanager-iZbp1chrrv37a2kts7sydsZ.out
localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-root-nodemanager-iZbp1chrrv37a2kts7sydsZ.out

3. Run the following command to view the processes that are started:

jps

The following processes are started:

[root@iZbp1chrrv37a2kts7s**** .ssh]# jps
11620DataNode
11493NameNode
11782SecondaryNameNode
11942ResourceManager
12344Jps
12047NodeManager

4. Use a browser to access http://:8088 and http://:50070. If the Hadoop pseudo-distributed environment is built, the page shown in the following figure is displayed.

Note: Make sure that security group rules of the ECS instance allow inbound traffic to ports 8088 and 50070 used by Hadoop. Otherwise, the Hadoop pseudo-distributed environment cannot be accessed. For more information, see Add a security group rule.

Community

Building a Hadoop Environment Based on ECS Instances

Prerequisites

Background Information

Procedure

Step 1: Install Java Development Kit (JDK)

Step 2: Install Hadoop

Step 3: Configure Hadoop

Step 4: Configure Password-free SSH Logon

Step 5: Start Hadoop

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

ECS(Elastic Compute Service)

Function Compute

Elastic High Performance Computing Solution

Elastic High Performance Computing