Data is everywhere. Phenomena such as the Internet of Things (IoT) and widespread digitization have unleashed a tsunami of information on the world and enterprises are struggling to keep up. E-MapReduce, from Alibaba Cloud, is a set of cloud-based tools to allow processing of vast amounts of data.
Some fifty-nine percent of organizations believe they lack the capabilities to generate meaningful business insights from their data.
The result? Many organizations pool this information in vast "data lakes" because they don't know what to do with it. They know it has value, but how to extract that value is not yet clear.
Consequently, the data languishes in its natural state. Anecdotally, more than eighty percent of enterprise data is unstructured, making it particularly precarious to process and manage. And yet, the amount of information stored in the world's IT systems is doubling about every two years, so you need to act now before you drown in data.
Worldwide Big Data revenues are also forecast to reach USD $210BN in 2020, representing an increase of more than seventy-five percent over a five-year period. As such, the wasted financial opportunity of data lakes is significant.
If you want a slice of this (quite sizable) pie, then you need a system to pan for the gold hidden in your data lake.
This is where Big Data processing steps in. It allows you to process and analyze your stored and real-time data.
So, cloud providers such as Alibaba Cloud are offering organizations the ability to create and manage container clusters quickly, cheaply and securely.
A cluster is a collection of cloud resources, which are required to run containers. Containers allow you to seamlessly move your software and information from one computing environment to another. They are more lightweight and use fewer resources than virtual machines, making them an ideal solution in Big Data scenarios.
Alibaba Cloud Elastic MapReduce, or E-MapReduce, is a rich framework to manage and process Big Data. It runs on the Alibaba Cloud platform and is built on the Alibaba Cloud Elastic Compute Service (ECS). Apache's open source frameworks Hadoop and Spark also form the core of this framework.
Hadoop carries out distributed processing of large data sets, usually across clusters of computers that can span anything from a few nodes to a few thousand nodes. It also stores large datasets (maps) using its storage component, known as the Hadoop Distributed File System (HDFS) and has a processing component called MapReduce.
Spark is a data processing tool that operates on distributed data collections, but it doesn't store your data. Spark is usually much faster than MapReduce because it operates on the whole data set in one go (while MapReduce operates in steps). As a result, Spark is ten times faster than MapReduce for batch processing and up to 100 times faster for in-memory analytics.
E-MapReduce utilizes the best parts of these two established frameworks, then adds Alibaba Cloud's expertise to optimize its processing capabilities. For example, E-MapReduce's Spark-based features make it particularly suitable for streaming large volumes of data.
E-MapReduce also provides a proprietary, integrated solution of cluster management tools, such as host selection, environment deployment, cluster building, cluster configuration, cluster running, job configuration, job running, cluster management, performance monitoring, and so on.
So, it manages many of the low-level tasks for cluster creation and provisioning, allowing you to focus on the processing logic of your application.
E-MapReduce consists of an "agent layer" at the base (where small, independent programs run in parallel to process inputs and create some output, like an outgoing network message), with the HDFS and Tachyon file systems sitting directly above it. Above those sit the full Hadoop ecosystem, along with Spark and a wide variety of Apache tools. The top layer is the web-based user-administration interface, which makes it easy to use and manage the underlying tools and systems.
What this means is that if you can do it using Hadoop, Spark, or their associated tools, you can do it in E-MapReduce. And you can do it much, much more easily than you could if you had to set up and provision Hadoop or Spark from scratch.
E-MapReduce allows seamless data exchange between cloud services such as the Alibaba Cloud Object Storage Service (OSS) and ApsaraDB for RDS, so that the user can share and transfer data between multiple systems to meet access needs for diverse types of businesses.
Consequently, E-MapReduce integrates very easily with other Big Data-oriented elements of Alibaba Cloud. It can work with Alibaba Elastic Compute Service (ECS) apps, and it can process data stored in OSS. It can also send data to the MaxCompute large-scale data warehousing platform, and take the output from the platform for further processing.
And, because E-MapReduce is based on Hadoop and Spark, you can effectively use the storage and computation space it provides as if it were a self-contained system running on its own host, rather than a standard cloud-computing storage system. This significantly slows down the length of time to provision the infrastructure it is running on.
Let's take a minute to understand the impact this could have on your organization. Let's say you work in the oil and gas industry and are monitoring a pipeline. Using a network of sensors across the line and Big Data analytics, you can link multiple forms of data to predict equipment, machine and even sensor failures in real-time.
Not only could this prevent catastrophic failures across the pipeline, it will also help you to proactively carry out the necessary maintenance work on the line. The cost savings are significant as even a small breakdown in this system could cause millions of dollars of intangible and tangible losses.
The same potential could be realized across other industries and scenarios. However, there are scalability hurdles to overcome as on-premise data storage can struggle to keep up with the increasing demands Big Data will place on your infrastructure.
Big Data stored in a cloud-based database can help your business with its decision-making processes. The previous example already highlighted one key benefit: the ability to proactively adapt your systems in real-time.
There are also a host of additional benefits. For example, E-MapReduce provides an elastic and on-demand self-service, so you can expand your storage or service with the click of a button. Your information is also readily available over the network and you can pool your resources to provide a more cost-effective solution.
E-MapReduce will also help you to remain competitive. Enterprises are demanding ever more flexibility and control over costs and are continuing their shift away from on-premise solutions. Fifty percent of organizations are predicted to embrace a cloud-first strategy for data, Big Data, and analytics in 2018.
Massive data analysis completed in real-time will also allow you to further streamline your business and make better decisions across the enterprise.
It's a world away from the data lake that could overspill and leave your organization drowning in data.
You can find out more about E-MapReduce and how it could help your business realize the true value of its data here.
Alibaba Clouder - December 26, 2016
Alibaba Clouder - March 6, 2019
Alibaba Clouder - March 12, 2019
Apache Flink Community China - August 2, 2019
Alibaba Clouder - April 1, 2020
Alibaba Clouder - May 9, 2019
A Big Data service that uses Apache Hadoop and Spark to process and analyze dataLearn More
This all-in-one omnichannel data solution helps brand merchants formulate brand strategies, monitor brand operation, and increase customer base.Learn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
ApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.Learn More
More Posts by Alibaba Clouder