Community Blog Introduction to Hadoop in Big Data Ecosystem

Introduction to Hadoop in Big Data Ecosystem

In this tutorial, you will get some information of Hadoop in big data ecosystem and how to configure Hadoop.

Whenever we talk about big data, it is not uncommon to hear the phrase Hadoop.

Hadoop is an open source framework that manages distributed storage and data processing for big data applications running in clusters. It is mainly used for batch processing. The core parts of Apache Hadoop are Hadoop Distributed File System (HDFS) and MapReduce.

Since data is large, Hadoop splits the files into blocks and distributes them across nodes in a cluster, which means every node has a copy of the data.

  1. HDFS - It is the primary storage system used by Hadoop applications. HDFS is a distributed file system that stores files as Data Blocks and replicates it over other nodes.
  2. MapReduce – MapReduce receives data from HDFS and splits the input data initially. Now that processing can be done on all data parts simultaneously, which we call distributed processing.

When you want to get data into a big data environment, you can try the following:

  1. Sqoop - The word Sqoop is derived from "SQL + Hadoop", which clearly defines that it helps in transferring data between Hadoop and relational database servers. Thus when the data is structured and in batches, you can use Sqoop as a loading tool to push it into Hadoop.
  2. Apache Flume - A Data Flow used for efficiently collecting, aggregating, and pushing large amounts of streaming data into Hadoop.
  3. Kafka - It is used on real-time streaming data to provide real time analysis. Thus when data is unstructured and streaming, Kafka and Flume together make the processing pipelines.

Hadoop also plays an important part in the storage and processing of the data, for more information please go to Drilling into Big Data – A Gold Mine of Information (1).

Related Blog Posts

Setup a Single-Node Hadoop Cluster Using Docker

In this article, we will look at how you can set up Docker to be used to launch a single-node Hadoop cluster inside a Docker container on an Alibaba ECS instance.

Apache Hadoop is a core big data framework to store and process Big Data. The storage component of Hadoop is called Hadoop Distributed File system (usually abbreviated HDFS) and the processing component of Hadoop is called MapReduce. Next, there are several daemons that will run inside a Hadoop cluster, which include NameNode, DataNode, Secondary Namenode, ResourceManager, and NodeManager.

How to Connect Tableau to MaxCompute Using HiveServer2 Proxy

Nowadays, customers are using different BI tools to gain new insights into their treasure trove of data. And Tableau is one of the popular tools in the market for enterprises. In this tutorial, we will demonstrate how to connect Tableau to Alibaba Cloud's most advance big data platform MaxCompute.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.

Related Documentation

Use ES-Hadoop on E-MapReduce

ES-Hadoop is a tool used to connect the Hadoop ecosystem provided by Elasticsearch (ES). It enables users to use tools such as MapReduce (MR), Spark, and Hive to process data in ES (ES-Hadoop also supports taking a snapshot of ES indices and storing it in HDFS, which is not discussed in this topic).

Construct a data warehouse by using OSS and MaxCompute

This topic describes the method of using MaxCompute to construct a PB data warehouse based on OSS. By using MaxCompute to analyze the massive data stored in OSS, you can complete your big data analysis tasks in minutes and explore data value more efficiently.

OSS provides three storage classes: Standard, Infrequent Access, and Archive, which are suitable for different data access scenarios. OSS can be used together with Hadoop open-source community products and multiple Alibaba Cloud products, such as EMR, BatchCompute, MaxCompute, machine learning tool PAI, Data Lake Analytics, and Function Compute.

Related Products


EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.


MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

0 0 0
Share on

Alibaba Clouder

2,605 posts | 745 followers

You may also like