How to Setup a Single-Node Hadoop File System Cluster on Ubuntu

Hadoop is a free, open-source, scalable, and fault-tolerant framework written in Java that provides an efficient framework for running jobs on multiple nodes of clusters. Hadoop contains three main components: HDFS, MapReduce and YARN.

Since Hadoop is written in Java, you will need to install Java to your server first. You can install it by just running the following command:

apt-get install default-jdk -y

Then you can create a new user account for Hadoop and set up the SSH key-based authentication.

Next, download the latest version of the Hadoop from their official website and extract the downloaded file.

Next, move the extracted directory to the /opt with the following command:

mv hadoop-3.1.0 /opt/hadoop

Next, change the ownership of the hadoop directory using the following command:

chown -R hadoop:hadoop /opt/hadoop/

Next, you will need to set and initialize environment variables. Then log in with hadoop user and create a directory for hadoop file system storage:

mkdir -p /opt/hadoop/hadoopdata/hdfs/namenode
mkdir -p /opt/hadoop/hadoopdata/hdfs/datanode

First, you will need to edit core-site.xml file. This file contains the Hadoop port number information, file system allocated memory, data store memory limit and the size of Read/Write buffers.

nano /opt/hadoop/etc/hadoop/core-site.xml

Make the following changes:

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
</property>
</configuration>

Save the file, then open the hdfs-site.xml file. This file contains the replication data value, namenode path and datanode path for local file systems.

nano /opt/hadoop/etc/hadoop/hdfs-site.xml

Make the following changes:

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
    <value>file:///opt/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
    <value>file:///opt/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Save the file, then open the mapred-site.xml file.

nano /opt/hadoop/etc/hadoop/mapred-site.xml

Make the following changes:

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>

Save the file, then open the yarn-site.xml file:

nano /opt/hadoop/etc/hadoop/yarn-site.xml

Make the following changes:

<configuration>
 <property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
</configuration>

Save and close the file, when you are finished.

Hadoop is now installed and configured. It's time to initialize HDFS file system. You can do this by formatting Namenode:

hdfs namenode -format

Next, change the directory to the /opt/hadoop/sbin and start the Hadoop cluster using the following command:

cd /opt/hadoop/sbin/

start-dfs.sh

Next, check the status of the service using the following command:

jps

Now Hadoop is installed, you can access Hadoop different services through web browser. By default, Hadoop NameNode service started on port 9870. You can access it by visiting the URL http://192.168.0.104:9870 in your web browser.

To test Hadoop file system cluster. Create a directory in the HDFS file system and copy a file from local file system to HDFS storage. For details, you can go to How to Setup Hadoop Cluster Ubuntu 16.04.

Related Products

Data Integration

Data Integration is an all-in-one data synchronization platform. The platform supports online real-time and offline data exchange between all data sources, networks, and locations.

Data Integration leverages the computing capability of Hadoop clusters to synchronize the HDFS data from clusters to MaxCompute. This is called Mass Cloud Upload. Data Integration can transmit up to 5TB of data per day. The maximum transmission rate is 2 GB/s.

Object Storage Service

Alibaba Cloud Object Storage Service (OSS) is an encrypted, secure, cost-effective, and easy-to-use object storage service that enables you to store, back up, and archive large amounts of data in the cloud, with a guaranteed reliability of 99.999999999%. RESTful APIs allow storage and access to OSS anywhere on the Internet. You can elastically scale the capacity and processing capability, and choose from a variety of storage types to optimize the storage cost.

Community

How to Setup a Single-Node Hadoop File System Cluster on Ubuntu

Related Blog Posts

Setup a Single-Node Hadoop Cluster Using Docker

Diving into Big Data: Hadoop User Experience

Related Products

Data Integration

Object Storage Service

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

MaxCompute

E-MapReduce Service