Quick start: Create and use a Data Lake cluster - E-MapReduce

Overview

This quick start shows you how to:

Quickly create a Data Lake cluster.
Submit and run a WordCount job by using a cluster client.
Understand the core features of Alibaba Cloud EMR and the basic usage of the Hadoop ecosystem.

Prerequisites

You have created an Alibaba Cloud account and completed real-name verification.
Grant the default EMR and ECS roles to the E-MapReduce service account. For more information, see Role authorization.

Precautions

You are responsible for managing and configuring the runtime environment for your code.

Procedure

Step 1: Create a cluster

Go to the Create Cluster page.
1. Log on to the EMR on ECS console.
2. In the top navigation bar, select a region and a resource group based on your business requirements.
  - Region: specifies the region in which to create the cluster. The region cannot be changed after the cluster is created.
  - Resource group: Displays all resources in your account by default.
3. In the upper-left corner, click CREATE_CLUSTER.

On the Create Cluster page, configure parameters for the cluster.

Section	Parameter	Example	Description
Software Configuration	Region	China (Hangzhou)	The physical location of the ECS instances for the cluster nodes. Important You cannot change the region after a cluster is created. Select the region carefully.
	Business Scenario	Data Lake	Select a scenario to allow EMR to automatically configure default components, services, and resources. This simplifies cluster setup and provides an environment tailored to the specified use case.
	Product Version	EMR-5.18.1	Select the latest EMR version.
	High Service Availability	Disabled	This feature is disabled by default. If you enable High Service Availability, EMR distributes the master nodes across different underlying hardware to reduce the risk of failure.
	Optional Services	HADOOP-COMMON, OSS-HDFS, YARN, Hive, Spark3, Tez, Knox, and OpenLDAP.	Select services based on your business requirements. By default, EMR starts the service processes for your selected services. Note To access the web UIs of services from the console, you must also select the Knox and OpenLDAP services.
	Collect Service Operational Logs	Enabled	Specifies whether to enable log collection for all services. By default, this switch is turned on to collect the service operational logs of your cluster. The logs are used only for cluster diagnostics. After you create a cluster, you can modify the Collection Status of Service Operational Logs parameter on the Basic Information tab. Important If you turn off this switch, the EMR cluster health check and service-related technical support are limited. For more information about how to disable log collection and the impacts imposed by disabling of log collection, see How do I stop collecting service logs?
	Metadata	Built-in MySQL	Stores metadata in the built-in MySQL database. Important The built-in MySQL database allows you to quickly set up a test environment but is not recommended for production environments. For production environments, use a self-managed ApsaraDB RDS instance or Data Lake Formation (DLF) for unified metadata management based on your business requirements.
	Root Storage Directory of Cluster	oss://******.cn-hangzhou.oss-dls.aliyuncs.com	The root storage directory of cluster data. This parameter is required only if you select the OSS-HDFS service. Note Before you use the OSS-HDFS service, make sure that the OSS-HDFS service is available in the region in which you want to create a cluster. If the OSS-HDFS service is unavailable in the region, you can change the region or use HDFS instead of OSS-HDFS. For more information about the regions in which OSS-HDFS is available, see Enable OSS-HDFS and grant access permissions. You can select the OSS-HDFS service when you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1.
Hardware Configuration	Billing Method	Pay-as-you-go	For testing, use the Pay-as-you-go billing method. After your tests are successful, you can release the test cluster and create a new cluster that uses the Subscription billing method for production.
	Zone	Zone I	You cannot change the zone after a cluster is created. Select the zone carefully.
	VPC	vpc_Hangzhou/vpc-bp1f4epmkvncimpgs****	Select a VPC in the current region. If no VPC is available, click Create VPC to create one. After you create the VPC, click the Refresh icon to select it.
	vSwitch	vsw_i/vsw-bp1e2f5fhaplp0g6p****	Select a vSwitch in the specified zone of the selected VPC. If no vSwitch is available in the zone, you must create one.
	Default Security Group	sg_seurity/sg-bp1ddw7sm2risw****	Important EMR does not support enterprise security groups created in the ECS console. You can select an existing security group or create a new one.
	Node Group	Turn on the Assign Public Network IP switch for the master node group. You can retain the default values for the other parameters.	You can configure the master, core, and task node groups based on your business requirements. For more information, see Configure hardware and a network.
Basic Configuration	Cluster Name	Emr-Data Lake	The cluster name, which must be 1 to 64 characters long and can contain letters, digits, hyphens (-), underscores (_), and Chinese characters.
	Identity Credentials	Password	Allows you to remotely log on to the cluster's master node. Note If you want to use password-free authentication, you can select Key Pair. For more information, see Manage SSH key pairs.
	Password and Confirm Password	A custom password.	Record the password. You will need it to log on to the cluster.

Click Confirm.

On the EMR on ECS page, the cluster is ready when its Status changes to Running. For more information about cluster parameters, see Create a cluster.

Step 2: Prepare data

After you create the cluster, you can run a data analysis test by using the pre-installed WordCount sample program. You can also upload and run your own big data applications. This topic uses the WordCount program to demonstrate the process, from data preparation to job submission.

Connect to the cluster over SSH. For more information, see Log on to a cluster.
Prepare the data file.

Create a text file named wordcount.txt as the input data for the WordCount job. The file must contain the following content:
```
hello world
hello wordcount
```
Upload the data file.

Note
You can upload the data file to the HDFS, OSS, or OSS-HDFS service of the cluster based on your business requirements. This topic uses the OSS-HDFS service as an example. To upload a file to OSS, see Simple upload.
1. Run the following command to create a directory named input:
```
hadoop fs -mkdir oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/
```
2. Run the following command to upload the wordcount.txt file from the current local directory to the input directory in OSS-HDFS:
```
hadoop fs -put wordcount.txt oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/
```

Step 3: Submit a job

You can use the WordCount program to analyze text data and count word frequencies.

Run the following command to submit the WordCount job:

hadoop jar /opt/apps/HDFS/hadoop-3.2.1-1.2.16-alinux3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount -D mapreduce.job.reduces=1 "oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/wordcount.txt" "oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/output/"

The following table describes the parameters in the command.

Parameter	Description
`/opt/apps/.../hadoop-mapreduce-examples-3.2.1.jar`	The sample program package that comes with Hadoop. This package includes multiple classic MapReduce sample programs. In this example, `hadoop-mapreduce-examples-3.2.1.jar` is the name of the JAR file in your cluster, and 3.2.1 is the version number. The version number is typically 3.2.1 for clusters of the EMR 5.x series and 2.8.5 for clusters of the EMR 3.x series.
`-D mapreduce.job.reduces`	Sets the number of reducers for the MapReduce job. By default, Hadoop automatically determines the number of reducers based on the input data size. If you do not specify the number of reducers, multiple output files, such as `part-r-00000` and `part-r-00001`, may be generated. By setting this parameter to 1, you can ensure that only one output file named `part-r-00000` is generated.
`oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/wordcount.txt`	The input path for the WordCount job. This is the path to the data file in OSS. Replace `<yourBucketname>` with the name of your OSS bucket and `cn-hangzhou` with the region ID.
`oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/output/`	The output path where the WordCount job stores its results.

Step 4: View the results

Job output

You can use a Hadoop shell command to view the job output.

Connect to the cluster over SSH. For more information, see Log on to a cluster.

Run the following command to view the job output:

hadoop fs -cat oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/output/part-r-00000

The following output is returned:

hello	2
wordcount	1
world	1

Job history

YARN is Hadoop's resource management framework for scheduling and managing cluster tasks. You can use the YARN UI to view the status and history of jobs to review their execution details.

Open port 8443. For more information, see Manage security groups.
Add a user. For more information, see OpenLDAP user management.

When you access the YARN UI page, you must use the username and password of a Knox account.
On the EMR on ECS page, find your cluster and click Services in the Actions column.
Click the Access Links and Ports tab.
In the YARN UI row, click the link in the Public URL column.

Authenticate with your user credentials to access the YARN UI page.
On the All Applications page, click the ID of a job to view its details.

The top of this page displays Cluster Metrics (such as Apps Submitted, Apps Running, and Memory Used) and Cluster Nodes Metrics. The bottom of the page lists applications with columns such as ID, Name, Application Type, State, and FinalStatus.

(Optional) Step 5: Release the cluster

If you no longer need the cluster, release it to avoid incurring further charges. After you confirm the release, the system performs the following operations:

Forcibly terminates all jobs on the cluster.
Terminates and releases all ECS instances.

The time required to release a cluster depends on the cluster size. Smaller clusters release faster. The release process typically completes within a few seconds and does not exceed 5 minutes.

Important

You can release Pay-as-you-go clusters at any time. Subscription clusters can be released only after their subscriptions expire.
Before you release a cluster, ensure the cluster is in the Initializing, Running, or Idle state.

On the EMR on ECS page, find the cluster to release, and choose > Release in the Actions column.

Alternatively, click the cluster name. On the Basic Information tab, choose All operations > Release in the upper-right corner.
In the dialog box that appears, click OK.

FAQ

For common issues with Alibaba Cloud EMR, see FAQ.

E-MapReduce:Quick start: Create and use a Data Lake cluster