Set Up Elastic MapReduce to Run Big Data Applications
In this tutorial, we’ll show you how to get started with the Elastic MapReduce (E-MapReduce) Service from Alibaba Cloud.
What Is E-MapReduce?
E-MapReduce takes the grunt out of building a Hadoop cluster solution for your big data applications. We’ll show you how easy it is to set up a cluster and run a Spark job on it with E-MapReduce.
• You’ll need an Alibaba Cloud account - we’ll assume you already have one.
• You’ll need Alibaba Cloud Resource Access Management (RAM) authorization and an Access Key - we’ll show you this.
• You’ll need to buy and enable the Alibaba Cloud Object Storage Service (OSS) - we’ll show you this too.
• Whenever we need to pick a region, we’ll use EU Central 1 (Frankfurt) for the purposes of this tutorial.
• Anything else that comes up, we’ll show you as we go along.
Set up a Cluster
Log in to your Alibaba Cloud account and go to the console:
Scroll down and choose the E-MapReduce Service under the Analysis options:
To use the E-MapReduce service, you must have a default EMR role. If you haven’t set this up already, you will see this warning.
In that case, click Go to RAM and set one up by clicking Confirm Authorization Policy.
You now have a role set in Resource Access Management (RAM).
Now, let’s make sure we have an Access Key. Hover over your user name in the menu bar and click accesskeys on the drop down.
Ignore the Security Tips pop up window and click Continue to manage AccessKey.
Create access key.
Now we’ll set up Object Storage Service (OSS) which we also need for E-MapReduce. Go to the Products page, select Object Storage Service, and click Buy Now.
Agree to the terms and enable the service.
Wait while the purchase completes.
Once OSS is set up, go to the Object Storage console. Create a Bucket for E-MapReduce to use.
Fill in the Bucket details.
Now we’ll set up the log configuration on the Bucket. Click through to the Bucket you just created.
Go to the Basic Settings tab.
Scroll down and enable Log Storage.
You now have an OSS Bucket ready for E-MapReduce with the logs set up.
Go back to the console and select E-MapReduce.
This time, you should go directly to the E-MapReduce Console.
Creating the Cluster
Before clicking Create Cluster, make sure you are in the correct region. For the purposes of this tutorial, we have chosen Frankfurt.
The cluster configuration settings page will come up.
Keep the settings as they are for now. Click Next to go to the Hardware settings tab where you have to set up the following services that the cluster requires:
• A Virtual Private Cloud (VPC)
• A Virtual switch (VSwitch)
• A Security group
First, let’s set up the VPC and VSwitch.
Click Create VPC/Subnet (VSwitch) and Create VPC. Make sure you are in the correct region.
Continue to Manage VSwitch.
Give the VSwitch a name and add the zone and CIDR settings. Then click Create VSwitch.
Now go back to the cluster Hardware settings and add the VPC/VSwitch details you just created.
If you slide the New security group button to green, the system will create a default security group for you. Then click Next to go through to the Basic configuration tab.
On this tab, you can give the cluster a name, set the log path (the one we set up earlier in the OSS), and click to authorize the instance roles. You must also set a password for the cluster.
To set the log path, click the folder at the end of the log path entry and input the details.
Don’t forget to click to authorize instance roles. Don’t worry if you find yourself at another page unexpectedly. Just go back to the Basic tab on the cluster configuration settings where you will see the authorization in progress. Click finished.
Create the Login password and click OK to create the cluster.
Once you have clicked OK, you are sent to the E-MapReduce Console where you will see the cluster being created. The Creating process might take a while. Be patient.
Once the cluster is created, it will have an Idle status showing in the console.
Your cluster is now ready for work!
Create a Job
As an example, we will run the SparkPi calculation on the cluster. To do this, we must first create a job. On the E-MapReduce Console, select Jobs and click Create job.
Enter the job configuration settings. Give the job a name and select Spark as the job type. Then, add the following parameters:
--class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 512m --num-executors 1 --executor-memory 1g --executor-cores 2 /usr/lib/spark-current/examples/jars/spark-examples*.jar 10
These tell Spark to run the SparkPi class from the spark-examples jar.
Your SparkPi job is now ready to run on the cluster.
Create an Execution Plan and Run Your Job
In order to run a job on the cluster, you have to set up an Execution plan. Go to Execution plan and click CreateExecution plan.
Choose from an existing cluster (i.e. the one we just set up).
Select the cluster and click Next.
You will see a list of jobs to choose from. Select the job(s) you want to run.
Move the job(s) to the configured job column and click Next.
Give the Execution plan a name and click OK.
Back in the Execution plan page, click Run now.
Go ahead and execute. The Execution plan will run all the jobs in its job list.
We can see that the Execution plan is running. Let’s check to see if the Execution plan ran the job successfully by checking the logs. Click More, then choose Running log from the drop down menu.
On the Running log page, click View job list.
You will see a list of the jobs running in the Execution plan. For this example, stdout should print the result of the SparkPi calculation. Click stdout.
Spark has correctly approximated Pi with our cluster.
That’s it! The cluster has done its job and is now idle again.
If you find you don’t need your cluster anymore, simply go over to the cluster details page in the E-MapReduce Console and release it.
After a few minor preparatory steps - that you only have to do once - we have shown you how to set up the Alibaba Cloud E-MapReduce service for running big data applications.
First, we activated an OSS Bucket for our cluster to use. Then we built a cluster and configured it to run an example job, the SparkPi calculation.
Head over to your console in Alibaba Cloud and use this tutorial as a guide to get started. We have shown you how quick and easy it is to set up a cluster and run an application on it with Alibaba Cloud E-MapReduce.
Now you can build your own big data applications with Alibaba Cloud without needing the command line or worrying about the low level configuration details.
Be sure to check online for more whitepapers, blogs, and tutorials on other Alibaba Cloud products.