Whenever we talk about big data, it is not uncommon to hear the phrase Hadoop.
Hadoop is an open source framework that manages distributed storage and data processing for big data applications running in clusters. It is mainly used for batch processing. The core parts of Apache Hadoop are Hadoop Distributed File System (HDFS) and MapReduce.
Since data is large, Hadoop splits the files into blocks and distributes them across nodes in a cluster, which means every node has a copy of the data.
When you want to get data into a big data environment, you can try the following:
Hadoop also plays an important part in the storage and processing of the data, for more information please go to Drilling into Big Data – A Gold Mine of Information (1).
In this article, we will look at how you can set up Docker to be used to launch a single-node Hadoop cluster inside a Docker container on an Alibaba ECS instance.
Apache Hadoop is a core big data framework to store and process Big Data. The storage component of Hadoop is called Hadoop Distributed File system (usually abbreviated HDFS) and the processing component of Hadoop is called MapReduce. Next, there are several daemons that will run inside a Hadoop cluster, which include NameNode, DataNode, Secondary Namenode, ResourceManager, and NodeManager.
Nowadays, customers are using different BI tools to gain new insights into their treasure trove of data. And Tableau is one of the popular tools in the market for enterprises. In this tutorial, we will demonstrate how to connect Tableau to Alibaba Cloud's most advance big data platform MaxCompute.
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.
ES-Hadoop is a tool used to connect the Hadoop ecosystem provided by Elasticsearch (ES). It enables users to use tools such as MapReduce (MR), Spark, and Hive to process data in ES (ES-Hadoop also supports taking a snapshot of ES indices and storing it in HDFS, which is not discussed in this topic).
This topic describes the method of using MaxCompute to construct a PB data warehouse based on OSS. By using MaxCompute to analyze the massive data stored in OSS, you can complete your big data analysis tasks in minutes and explore data value more efficiently.
OSS provides three storage classes: Standard, Infrequent Access, and Archive, which are suitable for different data access scenarios. OSS can be used together with Hadoop open-source community products and multiple Alibaba Cloud products, such as EMR, BatchCompute, MaxCompute, machine learning tool PAI, Data Lake Analytics, and Function Compute.
EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.
MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.
Alibaba Clouder - January 11, 2018
Apache Flink Community China - December 25, 2019
Apache Flink Community China - July 28, 2020
Alibaba Clouder - September 2, 2019
Alibaba Clouder - September 27, 2019
Alibaba Clouder - January 12, 2018
More Posts by Alibaba Clouder