Introduction to Hadoop in Big Data Ecosystem

Whenever we talk about big data, it is not uncommon to hear the phrase Hadoop.

Hadoop is an open source framework that manages distributed storage and data processing for big data applications running in clusters. It is mainly used for batch processing. The core parts of Apache Hadoop are Hadoop Distributed File System (HDFS) and MapReduce.

Since data is large, Hadoop splits the files into blocks and distributes them across nodes in a cluster, which means every node has a copy of the data.

HDFS - It is the primary storage system used by Hadoop applications. HDFS is a distributed file system that stores files as Data Blocks and replicates it over other nodes.
MapReduce – MapReduce receives data from HDFS and splits the input data initially. Now that processing can be done on all data parts simultaneously, which we call distributed processing.

When you want to get data into a big data environment, you can try the following:

Sqoop - The word Sqoop is derived from "SQL + Hadoop", which clearly defines that it helps in transferring data between Hadoop and relational database servers. Thus when the data is structured and in batches, you can use Sqoop as a loading tool to push it into Hadoop.
Apache Flume - A Data Flow used for efficiently collecting, aggregating, and pushing large amounts of streaming data into Hadoop.
Kafka - It is used on real-time streaming data to provide real time analysis. Thus when data is unstructured and streaming, Kafka and Flume together make the processing pipelines.

Hadoop also plays an important part in the storage and processing of the data, for more information please go to Drilling into Big Data – A Gold Mine of Information (1).

Related Blog Posts

Setup a Single-Node Hadoop Cluster Using Docker

In this article, we will look at how you can set up Docker to be used to launch a single-node Hadoop cluster inside a Docker container on an Alibaba ECS instance.

Apache Hadoop is a core big data framework to store and process Big Data. The storage component of Hadoop is called Hadoop Distributed File system (usually abbreviated HDFS) and the processing component of Hadoop is called MapReduce. Next, there are several daemons that will run inside a Hadoop cluster, which include NameNode, DataNode, Secondary Namenode, ResourceManager, and NodeManager.

How to Connect Tableau to MaxCompute Using HiveServer2 Proxy

Nowadays, customers are using different BI tools to gain new insights into their treasure trove of data. And Tableau is one of the popular tools in the market for enterprises. In this tutorial, we will demonstrate how to connect Tableau to Alibaba Cloud's most advance big data platform MaxCompute.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.

Related Products

E-MapReduce

EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.

MaxCompute

MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

Community

Introduction to Hadoop in Big Data Ecosystem

Related Blog Posts

Setup a Single-Node Hadoop Cluster Using Docker

How to Connect Tableau to MaxCompute Using HiveServer2 Proxy

Related Documentation

Use ES-Hadoop on E-MapReduce

Construct a data warehouse by using OSS and MaxCompute

Related Products

E-MapReduce

MaxCompute

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

OSS(Object Storage Service)