Challenges of Hadoop over cloud
Created#More Posted time:Mar 16, 2017 14:02 PM
Many people are concerned about the performance of cloud-based Hadoop, because cost associates with virtualization, which fuels the biased conclusion – the cloud-based performance must be poorer than the performance on physical machines. If 10 physical machines are separately virtualized to run Hadoop, performance overhead will surely be incurred. However, it will be another story in case of a public cloud. For a public cloud, the cost of virtualization is borne by the platform provider that has an advantage for bulk purchase of machines and may sell excess resources with the performance of virtualized machines secured.
A platform that sells an 8-core 32G virtualized machine must have the capacity.
With the flexibility of the cloud, the general cost of businesses will reduce.
On the other hand, running Hadoop in the cloud also confronts the platform provider with some challenges. These challenges and corresponding solutions from the platform provider are introduced below.
Challenges for cloud-based Hadoop - Shuffle
Shuffle can be in push or pull mode. In the push mode, data will be directly sent to the next node, such as Storm and Flink. In the pull mode, data will be first stored locally, and then the next node will be started to pull the data, such as Hadoop MR and Spark.
A major bottleneck of the push mode is the network. In a general cloud environment, the network works much the same as offline and may satisfy the needs.
A major bottleneck of the pull mode is the disk. In a cloud environment, a local disk or SDD acceleration will be provided. See the following:
According to a report from the Spark Community, in many scenarios such as machine learning, the CPU now becomes the bottleneck.
Challenge for cloud-based Hadoop – data localization
Data localization refers to moving computing to a data node upon analysis. When computing and storage are separate and data is stored in OSS, it is required to pull the data remotely from the OSS. Generally, people think this will cause a performance problem.
Network bandwidth has been developing fast:
Based on a comparison between 2009 and 2016, the bandwidth had been increased by around 100 times, and most impressively, domestic bandwidth increased from 4Mbps to 100Mbps. 4G is also gaining in popularity. Now I generally watch movies online, instead of downloading them to my computer. Many computer rooms have 100Gbps point-to-point bandwidth. Disks per se have not much improvement regarding the handling capacity. Still compression algorithm can be used to reduce the storage. In ETL scenarios, running is required only for several hours at night, and thus the sensitivity to performance is not so obvious; for machine learning, data caching by RAM is required; for stream-oriented computation, data per se are dynamic.
In general, data localization will be unnecessary, as the bandwidth increases, and business scenarios become more real time and diversified.
Challenges for cloud-based Hadoop – automated operation
Basic functions such as job management, job composing, monitoring, and alarming are okay. Hadoop is complicated in itself, and any fault occurring to it will affect job running.
These faults include, but are not limited to:
• Master failures
• Log clearance and so on
• Node failures and automatic recovery
• Data node disconnection handling
• Node Manager disconnection handling
• Job operation monitoring and alarming
• Overload monitoring and alarming
• Node data balancing
• Single node capacity expansion
• Automatic version upgrading
• Major data backup
• Hbase and other indicator monitoring and alarming
• Storm and other indicator monitoring and alarming
We need automatic diagnoses of these problems, and to solve these problems with the participation of both users and platforms.
Challenges for cloud-based Hadoop – suggestions from experts
• Need for expansion
• Hive SQL can rate SQL and give the optimal way of writing statements.
• Analyze storage to see whether compression is needed, whether there are too much small files and whether such files need to be merged; analyze access records to see whether cold data can be archived
• Analyze statistical job data during running to see whether the mapping time of a job is too short, whether there is data skew in MapReduce during running, and whether any single job has parameter adjustment.
This system is mainly for optimization of storage, job, performance and so on; it is generally not an internal system of an enterprise, but one created in the cloud to help enterprises.