×
Community Blog Big Data Made Simpler with E-MapReduce – Part 2

Big Data Made Simpler with E-MapReduce – Part 2

Part 2 of this 2-part series discusses E-MapReduce cluster management and how it works in real-world scenarios and various usage scenarios.

By Shantanu Kaushik

In Part 1 of this 2-part series, we discussed how Alibaba Cloud E-MapReduce provides extensive elasticity and operational superiority in analyzing and processing data. We also discussed the prominent features and benefits of Alibaba Cloud E-MapReduce. In this article, we will discuss E-MapReduce cluster management and explain how it works in real-world scenarios. We will also discuss the various usage scenarios related to Alibaba Cloud EMR and the primary benefits of using EMR over the open-source big data ecosystem.

System Components

Alibaba Cloud E-MapReduce provides support for multiple components:

  • Hadoop is a big data processing platform that supports petabytes of storage capacity and equally existential compute capability.
  • Storm is a real-time compute engine that performs real-time big data processing with results with latencies as low as milliseconds.
  • ZooKeeper is an open-source coordination service based on a distributed architecture. It ensures your distributed applications are operating consistently on optimal levels.
  • Spark is a state-of-the-art distributed computing framework that supports real-time and offline computing scenarios. It also extends functionalities like machine learning and SQL syntax.
  • Hive is an offline data processing system also based on Hadoop. Hive supports the structured tale management based on the HDFS file system with a query syntax similar to SQL that enables simpler data analysis and processing.
  • Flink is another distributed engine to facilitate batch processing and stream processing of data.
  • Kafka is a well-known open-source system that enables a high-throughput and reliable distributed messaging publication system that supports subscription management.
  • Hue, Oozie, and Druid – Alibaba Cloud EMR uses numerous open-source tools, including Hue for management and enabling a superior web interface, Oozie for job scheduling, and Druid for real-time big data analysis.

Advantages over Other Big Data Ecosystems

Now that we know Alibaba Cloud E-MapReduce supports component-level integration, let’s take a look at the advantages it offers over the open-source big data ecosystems that are widely available for enterprises to deploy and apply.

1

The image above showcases all the steps that are required concerning your application logic. Alibaba Cloud EMR focuses on the last three steps to provide highly integrated, seamless cluster management. The first seven steps are preparations, but the last three steps are the most complex and time-consuming.

Alibaba Cloud EMR integrates a plethora of features that are required for cluster management. Some of the prime features are:

  • Host Selection
  • Cluster Building
  • Environment Deployment
  • Cluster Configuration
  • Cluster Execution
  • Job Configuration
  • Job Execution
  • Cluster Management
  • Performance Monitoring

Alibaba Cloud EMR is a highly efficient and self-sustained solution that frees you from all the tedious procurement, preparation, and O&M work required to build clusters. You only need to work on processing the logic of your applications.

Alibaba Cloud EMR functions with different combinations of cluster services to help you meet your business requirements. After running the Hadoop service for Alibaba Cloud EMR, you can perform:

  • Daily Data Measurement
  • Batch Computing

Here, if you include Spark, you can have added functionality to perform functions, such as:

  • Stream Computing
  • Real-Time Computing

EMR Composition

EMR clusters are the core user-centric components based on Hadoop or Spark and are deployed on one or more ECS instances. Let’s imagine a scenario where a Hadoop cluster consists of processes that run on the ECS instances of the cluster. Here, each ECS instance corresponds to a node. Alibaba Cloud EMR will intelligently distribute the execution of these processes after determining if they need to run them on the master node or core and task nodes. It depends on the priority and resource allocation of a task. Let’s take a look at how EMR clusters are managed on the chart below. Here, one master node has three slave nodes (core and task nodes) to facilitate multiple tasks simultaneously depending on their priority and required resources.

2

Usage Scenarios – Elastic MapReduce

Data Integration

Let’s take a look at an architectural overlay of this scenario on the chart below:

3

Alibaba Cloud E-MapReduce supports multiple data integration points using:

  • Open-Source Tools
  • Real-Time Data Integration Tools
  • Offline Tools
  • Alibaba Cloud In-House Developed Tools

If we look at the architectural flow, we can notice that Alibaba Cloud EMR uses multiple services, such as Alibaba Cloud MaxCompute, for data integration.

EMR also uses Data Transmission Service (DTS) to accept data from database clusters based on various DB platforms. It uses Object Storage Service as the Hadoop Storage File System (HDFS) and the Log Service to record everything.

Offline Computing

4

Cost-effectiveness is a big factor when working with Big Data technologies. Alibaba Cloud EMR supports multiple compute engines that include:

  • MapReduce (MR)
  • Hive
  • Pig
  • Spark
  • SparkSQL
  • Tez

EMR supports reading data from multiple data sources like Object Storage Service (OSS), MaxCompute, Kafka, and HDFS and allows for superb offline data processing. It writes the compute results to software in varying formats.

Data Analysis Using Ad-Hoc Queries

5

Alibaba Cloud is a highly flexible and scalable platform. It creates Hadoop clusters easily and efficiently to enable flexible and fast data analysis. The platform automatically releases the clusters after the data processing finishes. This form of elasticity is required while processing huge amounts of data, as it applies maximum cost-cutting. You are free to adjust the number of compute nodes within a cluster to adjust the processing priority for a task.

Streaming Data with EMR

In this scenario, Alibaba Cloud EMR applies the real-time computing scenario by enabling a flexible and reliable approach to induce a stable system. With an application of multiple real-time data sources, you can efficiently analyze and process this data using compute engines like Spark Streaming, Flink, and Storm.

Let’s take a look at the chart below to understand this scenario:

6

Wrapping Up

Alibaba Cloud solutions are based on the core computing practices, with an alignment towards flexibility, availability, and high performance. These basic requirements enable a solution to work optimally and provide productive results.

Upcoming Articles

  1. Discovering and Securing Sensitive Information – Alibaba Cloud
0 0 0
Share on

Alibaba Clouder

2,600 posts | 754 followers

You may also like

Comments

Alibaba Clouder

2,600 posts | 754 followers

Related Products