Ten major problems of E-MapReduce (Hadoop) - cluster plan
Created#More Posted time:Mar 8, 2017 13:26 PM
I strongly recommend E-MapReduce if you have demands for big data solutions. Give it a try.
Cluster plan considerations
All who are using or planning to use Hadoop will definitely encounter the cluster plan issue: what scale should I choose for my cluster? Is there is criterion to follow? This article will introduce the cluster plan for you.
In cloud-based E-MapReduce, you can freely configure the assortment. At present, there are two CPU-memory ratios available: 1:2 and 1:4. A single machine is equipped with four disks. Different disks vary in performance, such as basic cloud disks, ultra cloud disks and SSD cloud disks, and vary in prices naturally. The size of a single disk ranges from 40 GB to 32 TB.
Companies for which money is not a problem can skip this article – just purchase the most expensive and advanced configuration, and your demands will surely be met. But if you are a start-up or your company sticks to economizing on expenditures, you're welcome to join us in studying this article.
• Separation of offline and online resources, mainly separation of online streaming computation (SparkStreaming\Storm), and storage service (HBase) with offline computation. Because the two have different objectives: the online computation pursues a quick QPS response time, while the offline computation pursues a high throughput.
• HBase should be hosted on SSD cloud disks to use the HDFS provided by EMR directly, because HBase requires a low latency.
• Cold data should be stored in OSS as much as possible.
• Small files should be merged as much as possible and data should be stored in OSS.
• Storage and computing resources for offline computation should be separated as much as possible. If OSS is not able to provide the required performance (too many small files), the ephemeral disk will be needed.
• In the peak periods, start the EMR on-demand cluster to analyze data. When the wave crest passes, release the cluster to save costs.
• For Spark, try to configure the cpu:memory ratio to 1cpu:4g.
General steps for evaluating a cluster scale required:
Evaluate the data size -> Test a small cluster for its quantified performance -> Finalize on the cluster specification.
Typical offline scenarios
A user increases 100GB data every day, that is, 3TB every month. After compression, the data size is 1TB (suppose the compression ratio is 30%). All the data is stored in HDFS. There wasn't much data analysis a year ago, but the user wants to store the data. The computation is dominated by offline machine learning and ETL. Hive and Spark are mainly used, with 5-10 concurrent jobs. The user will have around 12TB data a year and may require 36TB of disk space in HDFS. In general, ETL and machine learning are comparatively CPU-consuming. At present, E-MapReduce jobs are submitted from the master which can have a larger size.
• The user’s requirement is to store 12TB of physical data, that is, 36TB of disk space in HDFS.
• The computing requirements are hard to estimate and dependent on actual running conditions. In general, it is evaluated by the user based on the run time and scale together. You can first run a basic case to evaluate the run time of a small cluster. Then you can tune up the machine scale linearly. Suppose the user requires the configuration of 20slave 8cpu 32g for running:
o A machine of 2master 8cpu 32g configuration should be prepared with a 350GB ultra cloud disk (350GB can ensure the maximum disk IO);
o Four ultra cloud disks of 20slave 8cpu 32g 450g configuration;
o All the data of a year ago or earlier is stored in OSS and E-MapReduce will connect to OSS directly to get the data for analysis as needed.
In general, the cluster may not be appropriate when the business changes. At this time, you need to re-adjust the cluster size. The most common way is to increase nodes to re-create a cluster with a new specification (It is better that you purchase a cluster of the subscription billing mode so that you can create another cluster as needed when the cluster is about to expire).
This section is easier to plan and the basic disk can be neglected. CPU takes the dominant role.
First test and then scale up.
If the streaming computation is purely used for UV statistics, the CPU and memory ratio should be 1:2. If you need to store data temporarily in the memory, the ratio should be 1:4.
Take the Spark Streaming with temporary data storage for example:
• 1master 4cpu 8g 350*1 ultra cloud disk;
• 2slave 4cpu 16g 100*4 ultra cloud disks. Nodes can be scaled based on actual needs later.
HBase storage service
SSD cloud disk is recommended for this section considering the IOPS.
The streaming computing CPU and memory should be allocated in the 1:4 ratio and the slave specification can be larger.
At the beginning, you can follow the below configuration:
• 2master 4cpu 8g 250*1 SSD cloud disk
• 2slave 16cpu 64g 250g*4 SSD cloud disks. Nodes can be scaled based on actual needs later.
Separated storage and computation for offline computation
Offline calculation can achieve separated storage and computation. For example, you can store all the data in the OSS for analysis by stateless E-MapReduce. In this case, E-MapReduce is purely responsible for computation with no assortment of storage and computation involved to adapt to the business. This mode is the most flexible. Later I will talk about separated storage and computation in a dedicated article.
The cluster specification is eventually subject to the evaluation of users based on their own business features. The above are just some general principles. We welcome suggestions from E-MapReduce and Hadoop users.