Hadoop cluster building tutorial

Date: Oct 25, 2022

Related Tags:1. Setup Hadoop Cluster Ubuntu 16.04
2. Apsara File Storage for HDFS

Abstract: Hadoop is a distributed file system copied by Doug Cutting, the founder of Lucene, based on Google's related content and a basic framework system for analyzing and computing massive data, including MapReduce programs, hdfs systems, etc.! [It is inspired by Map/Reduce and Google File System (GFS) originally developed by Google Lab. ]

Introduction to hadoop

Hadoop is a distributed file system copied by Doug Cutting, the founder of Lucene, based on Google's related content and a basic framework system for analyzing and computing massive data, including MapReduce programs, hdfs systems, etc.! [It is inspired by Map/Reduce and Google File System (GFS) originally developed by Google Lab. ]

Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput to access application data, which is suitable for those with large data sets. set) application. HDFS relaxes (relax) POSIX requirements and can access data in the file system in the form of streaming (streaming access).

The core design of the Hadoop framework: HDFS and mapreduce

HDFS: Provide storage for massive data

MapReduce: Provides computing cluster for massive data: cluster

LB: load balancing


HA: High Availability

MHA, keepalived, hearebeat

HPC, Hadoop: high-volume computing-assisted storage and computing

What is Distributed: Decentralized

The advantages of Hadoop clusters

Hadoop is a software framework capable of distributed processing of large amounts of data. Hadoop processes data in a reliable, efficient, and scalable manner.

Hadoop is reliable because it assumes that compute elements and storage will fail, so it maintains multiple copies of working data, ensuring that processing can be redistributed against failed nodes.

Hadoop is efficient because it works in parallel, speeding up processing through parallel processing

Hadoop is also scalable, capable of processing petabytes of data.

Convert PB-level data to G?



Hadoop relies on community services, so it's cheap and anyone can use it.

Hadoop is a distributed computing platform that can be easily architected and used by users. Users can easily develop and run applications that process massive amounts of data on Hadoop. It mainly has the following advantages:

High reliability: people can trust hadoop's ability to store and process data in bits

High scalability: There are many nodes, which is convenient for computing and distributing data.

What is a node?

A node is a term that refers to a class of devices. They can be hosts (pcs), servers, or switches, routers, firewalls, etc. that make up a transport network.

Efficiency: Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

Fault Tolerance: Hadoop can automatically save multiple copies of data and can automatically redistribute failed tasks.

What does RAID fault tolerance mean? Raid has no fault tolerance? Raid is almost fault-tolerant.

Low cost: Compared with all-in-one computers, commercial data warehouses, and data marts such as QlikView and Yonghong Z-Suite, hadoop is open source, so the software cost of the project will be greatly reduced

Concepts about hadoop

1. Distributed storage:

What are the linux storage?



namespace: In a distributed storage system, the data scattered in different nodes may belong to the same file. In order to organize many files, the files can be placed in different folders, and the folders can be contained level by level. We call this form of organization a namespace. Namespaces manage all files in the entire server cluster. The responsibility of the namespace is not the same as the responsibility of storing the real data. The node responsible for the namespace is called the master node, and the node responsible for storing the real data is called the slave node.

Master-slave node:

The master node is responsible for managing the file structure of the file system, and the slave nodes are responsible for storing real data, collectively known as master-slaves.

When users operate, they should also deal with the master node first, query data on those slave nodes, and then read data from the slave nodes. Sometimes, in order to speed up user access, the entire namespace information is stored in memory. The more files are stored, the more memory space our master node needs.

When opening a file, where is it loaded first?

Answer: memory

Why can't we open a 2T file with a laptop?

Answer: The memory is too small

2, Block

When storing data from a node, some original data files may be large, some may be small, and files of different sizes are not easy to manage, so an independent storage file unit, called a block, can be abstracted.

Question: If my hard disk has 500G, I still have 200G left, but when I create a file, it prompts me that the hard disk space is insufficient?

Answer: In general, it is because the inode number is insufficient

3. Disaster recovery

The data is stored in the cluster, and access may fail due to network reasons or server hardware reasons. It is best to use the replication mechanism to back up the data to multiple servers at the same time, so that the data is safe, and the probability of data loss or access failure just small.

4. Disaster recovery in different places?

Answer: In different regions, build one or more sets of the same application or database to take over immediately after a disaster

In hadoop, the distributed storage system is called HDFS (hadoop distributed file system). Among them, the master node is called the name node (namenode), the slave node is called the data node (datanode)

Distributed computing

When processing data, we will read the data into memory for processing. If we process massive data, for example, the data size is 100GB, we need to count the total number of words in the file. It is almost impossible to load all data into memory, which is called moving data.

So is it possible to put the program code on the server where the data is stored? Because the program code is generally small and almost negligible compared with the original data, the time for transmitting the original data is saved. Now, data is stored in a distributed file system, 100GB of data may be stored on many servers, then the program code can be distributed to these servers, and executed on these servers at the same time, that is, parallel computing, which is also distributed calculate. This greatly reduces the execution time of the program. We refer to the way that the program code executes the computation on the machine where the data node is moved as mobile computation.

What distributed computing needs is the final result, and the program code will produce many results after parallel execution on many machines, so there needs to be a piece of code to summarize these intermediate results. Distributed computing in Hadoop is generally done in two stages.

The first stage is responsible for reading the original data in each data node, performing preliminary processing, and calculating the number of words in the data in each node. Then the processing results are transmitted to the second stage, and the results of each node are aggregated to produce the final result.

In hadoop, the distributed computing part is called MapReduce.

MapReduce is a programming model for parallel operations on large datasets (greater than 1TB). The concepts "Map" and "Reduce", and their main ideas, are both borrowed from functional programming languages ​​and features borrowed from vector programming languages. It greatly facilitates programmers to run their own programs on distributed systems without distributed parallel programming.

Distributed computing role

Master node: job node (jobtracker)

Slave node: task node (tasktracker)

Among the task nodes, the code that runs the first stage is called a map task, and the code that runs the second stage is called a reduce task.


1) hadoop: apache open source distributed framework

2) HDFS: Hadoop's distributed file system

3) NameNode: Hadoop HDFS metadata master node server, responsible for saving datenode file storage metadata information, this server is a single point.

4) obtracker: Hadoop's map/reduce scheduler, which is responsible for communicating with task nodes, distributing calculations and tracking task progress. This server is also a single point.

5) DataNode: Hadoop's data node, responsible for storing data

6) tasktracker: the scheduling degree of hadoop, responsible for the start and execution of map and reduce tasks

hadoop cluster building

1) Environment

Configure IP, close iptables, close selinux, configure hosts

2) Create a normal user

A common user, hadoop, should be created on the three servers, and the configuration password: 123456

[root@chenc01 ~]# useradd -u 8000 hadoop ; echo 123456 | passwd --stdin hadoop
Change the password for user hadoop .
passwd: All authentication tokens have been successfully updated.

3) set namenode

Set up namenode to be able to log in to the other two servers without a key

4) Install jdk

Question: What else can source be used for in the database?

Answer: import

5) Install java/jdk on the other two nodes

6) Install namenode

Hadoop installation directory: /home/hadoop/hadoop-3.13 Use root account to upload hadoop-3.1.3.tar.gz to the server and put it under /home/hadoop!

* create dfs and tmp
*Modify files
Remarks: This is the core configuration of hadoop. Two properties need to be configured here, fs.default.name configures the HDFS system command of hadoop, the location is port 9000 of the host, and hadoop.tmp.dir configures the root location of haddop's tmp directory.
Note: The main configuration file of HDFS, dfs.http.address configures the http access location of hdfs;

The copy of the dfs.replication configuration file, generally not greater than the number of slaves.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us