View here to log in or access your console

OK

E-MapReduce Service

A Big Data service that uses Apache Hadoop and Spark to process and analyze data

Buy Now Contact Sales

Overview

Alibaba Cloud Elastic MapReduce (E-MapReduce) is a big data processing solution to quickly process huge amounts of data. Based on open source Apache Hadoop and Apache Spark, E-MapReduce flexibly manages your big data use cases such as trend analysis, data warehousing, and analysis of continuously streaming data.

E-MapReduce simplifies big data processing, making it easy, fast, scalable and cost-effective for you to provision distributed Hadoop clusters and process your data. This helps you to streamline your business through better decisions based on massive data analysis completed in real-time.


Benefits

Usability

  • Lets you simply select the required ECS model (CPU or memory) and disks, and select the required software for automatic deployment

  • Creates E-MapReduce Hadoop cluster as needed within minutes, and releases the cluster once an offline job is complete

  • Adds nodes dynamically as and when needed

  • Facilitates the provisioning, configuration, and tuning of Hadoop clusters

Cost-effective

  • Saves extra overheads involved in managing the underlying instances

  • Pay on an on-demand basis for every instance that you use

Flexible

  • Permits you to scale up or down the number of instances as per your requirements

Easy to integrate

  • Seamless integration with other Alibaba Cloud products to use as the input source or output destination of Hadoop/Spark calculation engine

Product Details

Alibaba Cloud E-MapReduce offers a fully managed service to analyze data through a user friendly web interface easily. E-MapReduce allows you to quickly launch Hadoop clusters within minutes for massive data processing. The product simplifies complex big data processing by performing data-intensive tasks for applications involved with machine learning, data mining, financial analysis and data warehousing.

The resources provisioned by the workload are released automatically upon completion of the processing task so that you pay only for the resources you consume. It utilizes Apache Storm, Spark, Hue, Hive, MapReduce and other solutions as backend services.

The product integrates easily with Alibaba Cloud services such as Elastic Computer Service, Resource Asset Management, ApsaraDB for RDS and ApsaraDB for Redis. You can also develop and run custom applications as per your business requirements using E-MapReduce.


Features

Highly Elastic

  • Quickly provisions as many instances as needed, and then releases those instances once the job is complete

  • Lets you deploy multiple new Hadoop clusters as well as resize existing clusters as required

  • Supports scaling up of Hadoop clusters as and when needed

Flexible Cluster Configuration

  • Allows you to freely select ECS model and relevant configuration including CPU, memory, and disks

  • Permits selection of required numbers of Master nodes (namenode and resourcemanager nodes) and Core nodes(datanode and nodemanager nodes)

Cost-effective

  • Provides flexible payment options on the basis of cluster payment type, subscription or Pay-As-You-Go

Support Multiple Data Storage and Databases

  • Leverage object data storage options using Alibaba Cloud OSS

  • Supports usage of Alibaba Cloud RDS (MySQL), Table Store Service and ApsaraDB for Redis as per your architecture requirements

Integration With 3rd-party Tools

Supports integration with

  • Frameworks: Apache Spark, MapReduce, Apache Pig

  • Tools: Apache Sqoop, Spark SQL

  • Data Storage: Apache HDFS, HBase

  • Supports machine learning, orchestration of processes, stream processing and graph analytics

  • You can also perform offline data processing, ad hoc data analysis, live streaming, etc.

  • Ensure efficient processing of massive data while reducing data processing cost and time

Secure

  • Lets you isolate the service permissions via the primary account/sub-account through easy integration with Alibaba Cloud RAM

  • Ensure security through configurable firewall settings for Alibaba Cloud ECS instances

  • Offers security configurations for encryption of data stored and processed using E-MapReduce

Flexible Execution of Jobs

  • Efficiently connect jobs (Hive, Pig, Apache Spark, etc.), execute as well as process them to get detailed analysis

  • Allows you to schedule regular workloads in an automated manner

Pricing

Alibaba Cloud E-MapReduce pricing is in addition to the price of ECS server instances. You pay only for the E-MapReduce instances that you use on an on-demand basis. The cost of E-MapReduce products includes the following components:

Cost of ECS instances

While purchasing E-MapReduce Hadoop clusters, Alibaba Cloud ECS will be purchased automatically, so you do not need to buy or prepare ECS instances in advance. If you are entitled to a discount for ECS, you will enjoy the same discount when purchasing ECS with E-MapReduce as well.

Cost of E-MapReduce services

E-MapReduce will provide multi-dimensional management services of Hadoop clusters, including display and control of pages, Open API and SDK support, monitoring, operation & maintenance tools, automatic server-side backend operation & maintenance and so on. See Details below.

Cost for external network traffic of the Master nodes

In the created cluster, 8 Mbps public network bandwidth will be opened for the Master nodes (for HA Hadoop cluster, both of the two Master nodes will have the 8 Mbps bandwidth). The traffic will be paid as you go, which is not included in the cost of the Hadoop clusters. It only charges outflow traffic on an hourly basis, while the inflow traffic is free of charge. For example, if you use 10 GB of outbound public traffic in an hour, the charge will be 10GB * price per GB (dollar/h). The traffic fees in different regions are slightly different.

China North 1, China East 1, China East 2, China South 1

E-MapReduce Instance typeCPUMEMORY (GB)Monthly subscription (US Dollar $/month)Pay-as-you-go (US Dollar $/hour)
ecs.s3.large48162.80.748
ecs.s3.large4824.840.1144
ecs.m1.medium41638.040.1984
ecs.m1.xlarge83277.320.3992
ecs.c1.large81650.960.2304
ecs.c2.large1632103.160.462
ecs.c2.xlarge1664155.880.7996
ecs.n1.medium2411.960.0532
ecs.n1.large4825.160.1084
ecs.n1.xlarge81651.520.218
ecs.n1.3xlarge1632104.240.4376
ecs.n1.7xlarge3264209.680.8772
ecs.n2.large41638.080.162
ecs.n2.xlarge83277.40.326
ecs.n2.3xlarge16641560.6532
ecs.n2.7xlarge32128313.161.3084

Asia Pacific SE 1 (Singapore)

E-MapReduce Instance typeCPUMEMORY (GB)Monthly subscription (US Dollar $/month)Pay-as-you-go (US Dollar $/hour)
ecs.s2.large24220.0428
ecs.s3.large4843.960.0856
ecs.m1.medium41661.520.1196
ecs.m1.xlarge832123.040.2388
ecs.c1.large81687.880.1708
ecs.c2.large1632175.760.3412
ecs.c2.xlarge1664n/an/a
ecs.n1.medium2424.440.0476
ecs.n1.large4848.840.0948
ecs.n1.xlarge81697.640.1896
ecs.n1.3xlarge1632195.280.3792
ecs.n1.7xlarge3264390.560.758
ecs.n2.large41668.360.1328
ecs.n2.xlarge832136.720.2656
ecs.n2.3xlarge1664273.40.5308
ecs.n2.7xlarge32128546.761.0612

US West 1 (Silicon Valley)

E-MapReduce Instance typeCPUMEMORY (GB)Monthly subscription (US Dollar $/month)Pay-as-you-go (US Dollar $/hour)
ecs.s2.large2418.680.0364
ecs.s3.large4822.960.0728
ecs.m1.medium41654.920.1068
ecs.m1.xlarge832109.840.2136
ecs.c1.large81647.120.1452
ecs.c2.large163295.440.29
ecs.c2.xlarge1664219.680.4264
ecs.n1.medium2420.760.0404
ecs.n1.large4841.520.0808
ecs.n1.xlarge816830.1612
ecs.n1.3xlarge16321660.3224
ecs.n1.7xlarge3264331.960.6444
ecs.n2.large41661.040.1188
ecs.n2.xlarge832122.040.2372
ecs.n2.3xlarge1664244.080.474
ecs.n2.7xlarge32128488.160.9476

Scenarios

Offline Data Processing

For easy management and processing of petabytes of structured or unstructured data such as stored logs, database records, etc. from applications.

Recommended Configuration

E-MapReduce + OSS + ECS + HBase + ODPS + ApsaraDB for RDS + ApsaraDB for Redis

offline

To process and analyze cosmic amount of data such as application logs to predict user activity, weather predictions, etc., E-MapReduce streams data from relational as well as non-relational database into provided datastore such as HDFS, Hbase or ODPS. This stored data, then, can be analysed using MapReduce or Apache Spark or Apache hive provided in EMR service. The analysed result is then uploaded to OSS, accessed by the web application and shown on the web page. E-MapReduce also facilitates batch processing at your required time and store the processed results in different storage systems.

Ad-hoc Data Analysis

For easily managing ad-hoc queries using an interactive web interface through Hue or Apache Zeppelin and view processed results easily.

Recommended

E-MapReduce + OSS + ODPS + RDS ( MySQL) + ApsaraDB for Redis + ECS (MongoDB)

adhoc

To instantly respond to any of the ad-hoc queries, E-MapReduce integrates with Apache Zeppelin and Apache Hue to offer you a user-friendly web interface so that to easily run and manage Hive or Spark SQL queries without the need of running the queries manually through CLI tools. Based on the structure type of the processed data, it can be stored on any of the managed services including ApsaraDB for RDS, ApsaraDB for Redis or ECS provided by Alibaba Cloud.

Massive Data Online Service

For easy management, processing and analysis of huge amount of data either originating from multiple data ingestion channels in real-time or already stored in any of the datastores.

Recommended Configuration

E-MapReduce + ECS + ODPS + RDS ( MySQL) + ApsaraDB for Redis

onlineservice

Data, be it real-time streaming data or previously stored data, is stored on the EMR data layer using HBase and Hadoop HDFS. This processed data is, then, accessed by a custom application which takes input from the data layer of EMR and makes it available in a human readable format on a custom dashboard.

Real-time Streaming

For real-time data processing of live data streams originating from different sources such as Twitter feeds, IoT sensors and user activities on e-commerce websites.

Recommended Configuration

E-MapReduce + OSS + Log Service + MNS + RDS ( MySQL) + ApsaraDB for Redis

streaming

Alibaba Cloud E-MapReduce can be easily plugged with other Alibaba Cloud services such as Log Service, ONS, MNS that act as data ingestion channels from real-time data streams. This data is streamed and processed using Apache Flume or Kafka in integration with Apache Storm using complex algorithms (Kafka is usually preferred with Apache Storm to provide data pipeline). The final processed data is then stored in HDFS, Hbase or any other big data store service in real-time.


Getting Started

Easily set-up, provision and manage your Hadoop clusters for massive data processing using Management console, CLI and APIs.

Using E-MapReduce through Management Console

Using Alibaba Cloud Management Console, you can create and configure Hadoop clusters as per your processing requirements. You can also specify the number and types of ECS instances as well as applications (Spark, Hive, Hue, etc.) to be provisioned in your cluster directly using management console.

E-MapReduce Console

E-MapReduce API Reference

You can simply use E-MapReduce APIs to efficiently provision and manage Hadoop clusters.

APIs

Resources

Below are the links to the documentation, SDKs and other related services.

Developer Resources

Below are the links to the documentation, SDKs and other related resources.

FAQs

1. What’s the difference between a job and an execution plan?

In Alibaba Cloud E-MapReduce, two steps are required to run a job, respectively:

a. Create a job:

In E-MapReduce, to create a “job” is actually creating a “job running configuration” which can be executed in a single click. Creating a “job” in E-MapReduce means creating a configuration to run the job. The configuration should contain the jar package to be run for the job, the input and output addresses of data and some running parameters. Such a configuration with a custom name is defined as a “job”. In order to define the resource provisioning or scheduling workloads for a running job, an execution plan is required.

b. Create an execution plan:

The execution plan is a bond that associates the job and the Hadoop cluster. Through the execution plan, we can combine multiple jobs into a job sequence and prepare a running cluster for the job (or automatically create a temporary cluster or associate an existing cluster). The execution plan also helps to set a periodical execution plan for the job sequence and automatically releases the cluster after the task is accomplished. We can also view the execution success situations and logs on the execution record list.

2. How to view a job log?

In the E-MapReduce system, the system is configured to upload the job’s running logs to OSS according to the jobID plan (the path is set by users when they create the cluster). You can view the job logs directly on the webpage.

3. How to view job logs on OSS?

You can search for all the log files on OSS directly and download them. To view job logs:

  • a. First, go to the execution plan page, find the corresponding execution plan and click “Running Log” to enter the running log page

  • b. Find the specific execution log on the running log page, such as the last execution log. Click the corresponding “Execution Cluster” to view the ID of the execution cluster

  • c. Then search for cluster ID directory under the OSS://mybucket/emr/spark directory

  • d. There will be multiple directories under OSS://mybucket/emr/spark/cluster ID/jobs according to the execution ID of the job, and each directory stores the running log file of the job

4. What are the timing policies of the cluster, execution plan and running job?

There are three timing policies as follows:

a. The timing policy of the cluster:

In the cluster list, we can see the running time of every cluster which can be calculated as: Running time = Time when the cluster is released - Time when the cluster is established. Once a cluster starts to be established, the running time lasts until the end of the lifecycle of the cluster.

b. The timing policy of the execution plan:

In the running log list of the execution plan, we can see the running time of every execution plan, and the timing policy can be summarized into two situations:

  • i. If the execution plan is executed on-demand, the running process of every execution log involves the cluster creation, job submission for running and cluster release. So the calculation policy of on-demand execution plan is: Running time = The time when the cluster is created + The total time used for running all the jobs in the execution plan + The time when the cluster is released

  • ii. If the execution plan is associated with an existing cluster, the entire execution cycle does not involve the cluster establishment and releasing. Running time = The total time used for completely running all the jobs in the execution plan

c. The timing policy of the job:

The job here refers to the jobs assigned to the execution plan. Click the ‘View Job List’ on the right of the running log of every execution plan and you can see the relevant job. Here the calculation of the running time of every job is: Running time = The actual time when the job running ends - The actual time when the job starts to run The actual running time of the job is the total time taken by the job to reach the completion stage.

5. Can I view job logs on the worker nodes in E-MapReduce?

Yes. Here is your log location: Execution Plan List > Running Log > Execution Log > View Job List > Job List > View Job Worker Instance. Please note that the “Save Log” option must be enabled when the cluster is created to view worker nodes.

6. Why there is no data in the external table created in Hive?

Hive does not mount the partitions directory of the specified directory automatically. So, it needs to be created manually

7. Why Spark Streaming job stops after running for a while without any reason?

First, check whether the Spark version is earlier than Version 1.6. Spark Version 1.6 repaired a memory leak bug. This bug causes container memory overuse which kills the job. In addition, you should check whether your code has been optimized well for memory usage.

8. Why is the job still in “Running” status in E-MapReduce Console when the Spark Streaming job has actually ended?

Check whether the running mode of the Spark Streaming job is “yarn-client”. If yes, it is recommended to change it to the “yarn-cluster” mode.

9. How to transmit AccessKeyId and AccessKeySecret parameters for jobs to read/write OSS data?

The simplest way is to use the complete OSS URI. Please refer to: Development Manual > Development Preparation.

10. Why am I getting an "Error: Could not find or load main class" while running EMR?

Check whether the path protocol header of the job jar package is “ossref” in the job configuration. If not, change it to “ossref”.

11. When Spark SQL is connected to RDS, ConnectionException appears. Why?

Check whether the RDS database address is an intranet address. If not, switch the database address to an intranet address in the RDS Console.

12. When Spark SQL is connected to RDS, the “Invalid authorization specification, message from server: ip not in whitelist” appears. Why?

Check the RDS whitelist settings and add the intranet addresses of the cluster machines to the RDS whitelist.

13. How to implement cluster machine division while using E-MapReduce?

The E-MapReduce contains a master node and multiple slave (or worker) nodes. All the data storage and computing tasks are handled by slave nodes instead of master nodes. For example, in a cluster with three 4-core 8G machines, one of the machines serves as the master node and the other two as the slave nodes. Therefore, the available computing resources of the cluster are only two 4-core 8G machines.

14. How to avoid memory overuse when Spark is connected to Flume?

Check whether the data receiving mode is Push-based. If not, change it to the Push-based mode for receiving data.