Deployment structure of Hadoop over cloud
Created#More Posted time:Mar 13, 2017 14:05 PM
Through the past 10 years’ development, Hadoop has become a factual standard of big data technology. Particularly in 2012, Hadoop launched Yarn, allowing third-party computing engines (such as Spark) to run in it and thus making Yarn a data processing system of basic standard configuration for big data. In the past two years, many startups based on big data have emerged. In the latest sciences and technologies such as genetic engineering, Industry 4.0, artificial intelligence and unmanned operation, big data are positioned at the core.
Cloud computing is also developing fast. Cloud renders many benefits, among which flexibility is a major one. The cost can be evenly shared, attracting more and more enterprises to go onto the cloud. Particularly many startup enterprises are cloud-based. How to make good use of cloud has been a priority of technical departments in many companies, as cloud has become a new means to improve their companies’ competitiveness and to make them stand firm among competitors.
Cloud was not quite associated with big data at the very beginning. In some big companies, big data and cloud-oriented business lines are in the charge of different departments and almost independent to each other. Later, when it was found that businesses, even though cloud-based, were still experiencing low utilizations at night but high utilizations in the daytime and this was the complete opposite for big data, people started to consider whether the cloud could be combined with big data. Therefore, there is the Sahara project in Openstack, and E-MapReduce from Alibaba Could, a cloud provider.
This series of articles includes Deployment structure of cloud-based Hadoop, Advantages of cloud-based Hadoop, Challenges for cloud-based Hadoop and Best practice of cloud-based Hadoop. Reading these articles may require basic knowledge of Hadoop.
Deployment structure of cloud-based Hadoop
Cloud-based Hadoop deployment is flexible. Hadoop clusters can be deployed according to different business objectives. General cloud-based deployment structures are summarized as follows.
The traditional deployment mode is shown above. As offline machines are relatively fixed, Data Nodes and Node Managers will usually be deployed on general nodes.
Classic mode 2
Nodes are generally classified as Master Nodes, Core Nodes, and Task Nodes. This deployment structure is comparatively flexible, in which Node Manager is deployed on the Task Node. If you need to improve the computing capacity, you can add a Task Node. Task Node is stateless, which makes cluster reduction simple. (This may avoid unmatched computing with storage that often occurs offline and results in waste of resources)
Storage and computing separation
In this mode, data are stored in OSS, and Hadoop clusters can be started to analyze the data. The major advantage of this mode is that Hadoop clusters can be released after use to save costs. E-MapReduce will also provide an on-demand charging mode.
In some common business modes, the offline analysis required by users can be completed within a period of time at night; in such cases, it is quite appropriate to start a cluster at midnight to analyze data stored in OSS.
Tips: HADOOP-12756 Hadoop supports reading data in Alibaba Cloud OSS
The practice shown in the first figure is to provide an OSSFileSystem. The practice in the second figure is that the underlying HDFS accessed data in OSS through the proxy, and HDFS is made stateless, so that HDFS will enable some features similar to that of Alluxio and become absolute transparent to the layer above it. (Not implemented yet)
Cloud data sharing
When you have many clusters and tables, metadata can be stored in RDS (MySQL), so that you may simultaneously retain permanent clusters, or just start one temporary cluster for data analysis at night and then release it.
User data can be stored in OSS or in the permanent cluster.
This is mainly for security. Your business and big data systems are all kept in a private network which by default is not available externally, and can be made available by technical means.
Hybrid cloud mode
Currently, you may have many specialized systems that have not been made cloud-based at the moment, such as CRM, ERP, Oracle, and so on, but if you want to use the big data of the cloud for data analysis, a hybrid cloud solution can be considered for directly uploading data via a leased line to the cloud end.
The deployment structures above can be used in a mixed manner to satisfy needs of customers.