A closer look at Hadoop: Hadoop architecture, features of various components and significant for big
Created#More Posted time:Apr 21, 2017 11:24 AM
Today, Apache Hadoop has become the driving force behind the development of the big data industry. Hive, Pig and other technologies are often mentioned. But what functions do they have and why do they need such strange names (such as Oozie, ZooKeeper, and Flume)?
Hadoop delivers the ability to handle big data at a cheaper cost (the big data volume is usually 10 GB to 100 GB, or bigger, with diverse data types including structured and unstructured data). But what is its difference from the earlier products?
Today's enterprise data warehouses and relational databases are good at processing structured data and can store large amounts of data. But the cost is somewhat expensive. This requirement for data limits the types of data that can be processed, and the inertial shortcomings also affect data warehouses' exploration of agile data in the face of massive heterogeneous data. This often means that valuable data sources are never tapped into within the organization. This is the biggest difference between Hadoop and traditional data processing approaches.
This article focuses on the components of the Hadoop system and explains the functions of various components.
MapReduce – core of Hadoop
In addition to the beneficial roles played by algorithms for Google's network search engine, MapReduce also plays a great role in the background. The MapReduce framework becomes the most influential "engine" behind today's big data processing. Apart from Hadoop, you will also find MPP (column-store database launched by Sybase IQ) and NoSQL (such as Vertica and MongoDB) on MapReduce.
An important innovation by MapReduce is that the processing task of a large data set query will be decomposed and processed in multiple running nodes. When the data size is large, one server won't be enough to solve the problem, and the advantage of distributed computing is apparent. Combining this technology with Linux servers will give birth to an extremely cost-effective alternative to large-scale computing arrays. Yahoo saw the future potential of Hadoop in 2006 and invited Hadoop founder Doug Cutting to develop Hadoop technology. By 2008, Hadoop had grown significantly. During its initial development to maturity, the Hadoop project absorbed some other components to further improve its own ease of use and functionality.
HDFS and MapReduce
Previously we have discussed the ability of MapReduce to distribute tasks to multiple servers for big data processing. For distributed computing, each server must be able to access the data, which is the role played by HDFS (Hadoop Distributed File System).
The combination of HDFS and MapReduce is powerful. During big data processing, the entire calculation process does not terminate when an error occurs in a server in the Hadoop cluster. At the same time, HFDS guarantees data redundancy in the event of a fault or error across the cluster. The result is written to a node in the HFDS when the calculation is complete. HDFS has no strict requirements for stored data formats, and data can be unstructured or in other categories. In contrast, relational databases require data to be structured and the architecture defined before storing the data.
The responsibility of developers writing code is to make data meaningful. Hadoop MapReduce-level programming utilizes Java APIs and enables manual loading of data files into HDFS.
Pig and Hive
For developers, the direct use of Java APIs can be tedious or error-prone, and also limits the flexibility of Java programmers to program on Hadoop. To solve this problem, Hadoop provides two solutions to make Hadoop programming easier.
• Pig is a programming language that simplifies common tasks of Hadoop. Pig can load data, express the conversion data, and store the final result. Pig's built-in operations make semi-structured data meaningful (such as log files). At the same time, Pig can extend the usage of custom data types added in Java and support data conversion.
• Hive plays the role of data warehouse in Hadoop. Hive superimposes structure on data in HDFS and allows the use of SQL-like syntax for data queries. Like Pig, Hive's core functionality is extensible.
Pig and Hive are always confusing. Hive is more suitable for data warehouse tasks, while Hive is mainly used for static structures and tasks requiring frequent analysis. Hive is similar to SQL, which makes it an ideal intersection for Hadoop's combination with other BI tools. Pig gives developers more flexibility in the field of big data sets and allows for the development of concise scripts for converting data streams to be embedded into larger applications. Pig is relatively more lightweight compared with Hive, and its main advantage lies in that it can slash the amount of code compared with the direct use of Hadoop Java APIs. Because of this, Pig is still attracting a lot of software developers.
Improve data access: HBase, Sqoop and Flume
The core of Hadoop remains a set of batch processing systems. Data is loaded into HDFS, processed and then ready for retrieval. For calculation, this is more or less regressing, but usually interaction and random access to data are necessary. HBase runs on top of HDFS as a column-store database. HBase is modeled on Google BigTable. The goal of the project is to quickly locate the required data and access it in billions of rows of data in the host. HBase uses MapReduce to process internal mass data. Both Hive and Pig can be used in combination with HBase. Hive and Pig also provide high-level language support for HBase, making it easy to perform statistical processing on HBase.
However, HBase also imposes some restrictions to authorize the random storage of data. For example, the performance of Hive with HBase is four to five times slower than the native Hive on HDFS. At the same time, HBase can store petabytes of data, while in comparison the capacity limit of HDFS reaches 30 PB. HBase is not suitable for ad-hoc analysis, while HBase is more suitable for integration of big data as part of large-scale applications, including logs, calculation and time series data.
Data access and output
Sqoop and Flume can improve data interoperability among other things. Sqoop mainly functions to import data from the relational database to Hadoop, and the data can be directly imported into HFDS or Hive. The Flume is designed to directly import streaming data or log data into HDFS.
Hive's user-friendly SQL query is an ideal bonding point for the integration with multiple databases, and database tools are connected via JDBC or ODBC database drivers.
ZooKeeper and Oozie in charge of coordinating workflows
As more and more projects become part of the Hadoop family and part of the cluster system operation, big data processing systems need to take charge of coordinating the working members. As the number of computing nodes increases, cluster members need to synchronize with each other and learn where to access the service and how to make the configuration. ZooKeeper is born for this.
During the task execution by Hadoop, sometimes multiple Map/Reduce jobs need to be connected together and batch dependencies may exist between them. The Oozie component functions to manage workflows and dependencies without the need for developers to write custom solutions.
Ambari is the latest new project into Hadoop and aims to incorporate core functions such as monitoring and management in the Hadoop project. Ambari helps system administrators deploy and configure Hadoop, upgrade clusters, and monitor services. It can also integrate other system management tools through APIs.
Apache Whirr is a set of class libraries running on cloud services (including Hadoop). It provides a high degree of complementarity. Whirr is now relatively neutral and supports both Amazon EC2 and Rackspace services.
Machine learning: Mahout
Different requirements from various organizations lead to varying forms of relevant data. The analysis on this data also needs a variety of methods. Mahout provides implementations of classic algorithms in some extensible machine learning fields, and aims to help developers create intelligent applications more quickly and easily. Mahout contains many implementations, including clusters, classification, recommendation and filtering, and frequent subitem mining.
In general, Hadoop is applied to a distributed environment. Much like in the case of Linux, vendors integrate and test components of the Apache Hadoop ecosystem and add their own tools and management features.