This topic describes how to use the E-MapReduce (EMR) console to quickly create a Data Lake cluster based on the open-source Hadoop ecosystem and submit a classic WordCount job by using a cluster client. WordCount is a fundamental distributed computing task in Hadoop that counts word occurrences in large text files. It is widely used in data analysis, data mining, and other scenarios.
Overview
This quick start shows you how to:
-
Quickly create a Data Lake cluster.
-
Submit and run a WordCount job by using a cluster client.
-
Understand the core features of Alibaba Cloud EMR and the basic usage of the Hadoop ecosystem.
Prerequisites
-
You have created an Alibaba Cloud account and completed real-name verification.
-
Grant the default EMR and ECS roles to the E-MapReduce service account. For more information, see Role authorization.
Precautions
You are responsible for managing and configuring the runtime environment for your code.
Procedure
Step 1: Create a cluster
-
Go to the Create Cluster page.
-
Log on to the EMR on ECS console.
-
In the top navigation bar, select a region and a resource group based on your business requirements.
-
Region: specifies the region in which to create the cluster. The region cannot be changed after the cluster is created.
-
Resource group: Displays all resources in your account by default.
-
-
In the upper-left corner, click CREATE_CLUSTER.
-
-
On the Create Cluster page, configure parameters for the cluster.
Section
Parameter
Example
Description
Software Configuration
Region
China (Hangzhou)
The physical location of the ECS instances for the cluster nodes.
ImportantYou cannot change the region after a cluster is created. Select the region carefully.
Business Scenario
Data Lake
Select a scenario to allow EMR to automatically configure default components, services, and resources. This simplifies cluster setup and provides an environment tailored to the specified use case.
Product Version
EMR-5.18.1
Select the latest EMR version.
High Service Availability
Disabled
This feature is disabled by default. If you enable High Service Availability, EMR distributes the master nodes across different underlying hardware to reduce the risk of failure.
Optional Services
HADOOP-COMMON, OSS-HDFS, YARN, Hive, Spark3, Tez, Knox, and OpenLDAP.
Select services based on your business requirements. By default, EMR starts the service processes for your selected services.
NoteTo access the web UIs of services from the console, you must also select the Knox and OpenLDAP services.
Collect Service Operational Logs
Enabled
Specifies whether to enable log collection for all services. By default, this switch is turned on to collect the service operational logs of your cluster. The logs are used only for cluster diagnostics.
After you create a cluster, you can modify the Collection Status of Service Operational Logs parameter on the Basic Information tab.
ImportantIf you turn off this switch, the EMR cluster health check and service-related technical support are limited. For more information about how to disable log collection and the impacts imposed by disabling of log collection, see How do I stop collecting service logs?
Metadata
Built-in MySQL
Stores metadata in the built-in MySQL database.
ImportantThe built-in MySQL database allows you to quickly set up a test environment but is not recommended for production environments. For production environments, use a self-managed ApsaraDB RDS instance or Data Lake Formation (DLF) for unified metadata management based on your business requirements.
Root Storage Directory of Cluster
oss://******.cn-hangzhou.oss-dls.aliyuncs.com
The root storage directory of cluster data. This parameter is required only if you select the OSS-HDFS service.
NoteBefore you use the OSS-HDFS service, make sure that the OSS-HDFS service is available in the region in which you want to create a cluster. If the OSS-HDFS service is unavailable in the region, you can change the region or use HDFS instead of OSS-HDFS. For more information about the regions in which OSS-HDFS is available, see Enable OSS-HDFS and grant access permissions.
You can select the OSS-HDFS service when you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1.
Hardware Configuration
Billing Method
Pay-as-you-go
For testing, use the Pay-as-you-go billing method. After your tests are successful, you can release the test cluster and create a new cluster that uses the Subscription billing method for production.
Zone
Zone I
You cannot change the zone after a cluster is created. Select the zone carefully.
VPC
vpc_Hangzhou/vpc-bp1f4epmkvncimpgs****
Select a VPC in the current region. If no VPC is available, click Create VPC to create one. After you create the VPC, click the Refresh icon to select it.
vSwitch
vsw_i/vsw-bp1e2f5fhaplp0g6p****
Select a vSwitch in the specified zone of the selected VPC. If no vSwitch is available in the zone, you must create one.
Default Security Group
sg_seurity/sg-bp1ddw7sm2risw****
ImportantEMR does not support enterprise security groups created in the ECS console.
You can select an existing security group or create a new one.
Node Group
Turn on the Assign Public Network IP switch for the master node group. You can retain the default values for the other parameters.
You can configure the master, core, and task node groups based on your business requirements. For more information, see Configure hardware and a network.
Basic Configuration
Cluster Name
Emr-Data Lake
The cluster name, which must be 1 to 64 characters long and can contain letters, digits, hyphens (-), underscores (_), and Chinese characters.
Identity Credentials
Password
Allows you to remotely log on to the cluster's master node.
NoteIf you want to use password-free authentication, you can select Key Pair. For more information, see Manage SSH key pairs.
Password and Confirm Password
A custom password.
Record the password. You will need it to log on to the cluster.
-
Click Confirm.
On the EMR on ECS page, the cluster is ready when its Status changes to Running. For more information about cluster parameters, see Create a cluster.
Step 2: Prepare data
After you create the cluster, you can run a data analysis test by using the pre-installed WordCount sample program. You can also upload and run your own big data applications. This topic uses the WordCount program to demonstrate the process, from data preparation to job submission.
-
Connect to the cluster over SSH. For more information, see Log on to a cluster.
-
Prepare the data file.
Create a text file named
wordcount.txtas the input data for the WordCount job. The file must contain the following content:hello world hello wordcount -
Upload the data file.
NoteYou can upload the data file to the HDFS, OSS, or OSS-HDFS service of the cluster based on your business requirements. This topic uses the OSS-HDFS service as an example. To upload a file to OSS, see Simple upload.
-
Run the following command to create a directory named
input:hadoop fs -mkdir oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/ -
Run the following command to upload the
wordcount.txtfile from the current local directory to theinputdirectory in OSS-HDFS:hadoop fs -put wordcount.txt oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/
-
Step 3: Submit a job
You can use the WordCount program to analyze text data and count word frequencies.
Run the following command to submit the WordCount job:
hadoop jar /opt/apps/HDFS/hadoop-3.2.1-1.2.16-alinux3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount -D mapreduce.job.reduces=1 "oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/wordcount.txt" "oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/output/"
The following table describes the parameters in the command.
|
Parameter |
Description |
|
|
The sample program package that comes with Hadoop. This package includes multiple classic MapReduce sample programs. In this example, |
|
|
Sets the number of reducers for the MapReduce job. By default, Hadoop automatically determines the number of reducers based on the input data size. If you do not specify the number of reducers, multiple output files, such as |
|
|
The input path for the WordCount job. This is the path to the data file in OSS. Replace |
|
|
The output path where the WordCount job stores its results. |
Step 4: View the results
Job output
You can use a Hadoop shell command to view the job output.
-
Connect to the cluster over SSH. For more information, see Log on to a cluster.
-
Run the following command to view the job output:
hadoop fs -cat oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/output/part-r-00000The following output is returned:
hello 2 wordcount 1 world 1
Job history
YARN is Hadoop's resource management framework for scheduling and managing cluster tasks. You can use the YARN UI to view the status and history of jobs to review their execution details.
-
Open port 8443. For more information, see Manage security groups.
-
Add a user. For more information, see OpenLDAP user management.
When you access the YARN UI page, you must use the username and password of a Knox account.
-
On the EMR on ECS page, find your cluster and click Services in the Actions column.
-
Click the Access Links and Ports tab.
-
In the YARN UI row, click the link in the Public URL column.
Authenticate with your user credentials to access the YARN UI page.
-
On the All Applications page, click the ID of a job to view its details.
The top of this page displays Cluster Metrics (such as Apps Submitted, Apps Running, and Memory Used) and Cluster Nodes Metrics. The bottom of the page lists applications with columns such as ID, Name, Application Type, State, and FinalStatus.
(Optional) Step 5: Release the cluster
If you no longer need the cluster, release it to avoid incurring further charges. After you confirm the release, the system performs the following operations:
-
Forcibly terminates all jobs on the cluster.
-
Terminates and releases all ECS instances.
The time required to release a cluster depends on the cluster size. Smaller clusters release faster. The release process typically completes within a few seconds and does not exceed 5 minutes.
-
You can release Pay-as-you-go clusters at any time. Subscription clusters can be released only after their subscriptions expire.
-
Before you release a cluster, ensure the cluster is in the Initializing, Running, or Idle state.
-
On the EMR on ECS page, find the cluster to release, and choose in the Actions column.
Alternatively, click the cluster name. On the Basic Information tab, choose in the upper-right corner.
-
In the dialog box that appears, click OK.
Related documentation
-
Paths of commonly used files: Find the installation paths of commonly used files.
-
API overview: Use API operations to manage clusters and services.
FAQ
For common issues with Alibaba Cloud EMR, see FAQ.
> Release