All Products
Search
Document Center

E-MapReduce:Quick start: Create and use a Data Lake cluster

Last Updated:Jun 20, 2026

This topic describes how to use the E-MapReduce (EMR) console to quickly create a Data Lake cluster based on the open-source Hadoop ecosystem and submit a classic WordCount job by using a cluster client. WordCount is a fundamental distributed computing task in Hadoop that counts word occurrences in large text files. It is widely used in data analysis, data mining, and other scenarios.

Overview

This quick start shows you how to:

  • Quickly create a Data Lake cluster.

  • Submit and run a WordCount job by using a cluster client.

  • Understand the core features of Alibaba Cloud EMR and the basic usage of the Hadoop ecosystem.

Prerequisites

  • You have created an Alibaba Cloud account and completed real-name verification.

  • Grant the default EMR and ECS roles to the E-MapReduce service account. For more information, see Role authorization.

Precautions

You are responsible for managing and configuring the runtime environment for your code.

Procedure

Step 1: Create a cluster

  1. Go to the Create Cluster page.

    1. Log on to the EMR on ECS console.

    2. In the top navigation bar, select a region and a resource group based on your business requirements.

      • Region: specifies the region in which to create the cluster. The region cannot be changed after the cluster is created.

      • Resource group: Displays all resources in your account by default.

    3. In the upper-left corner, click CREATE_CLUSTER.

  2. On the Create Cluster page, configure parameters for the cluster.

    Section

    Parameter

    Example

    Description

    Software Configuration

    Region

    China (Hangzhou)

    The physical location of the ECS instances for the cluster nodes.

    Important

    You cannot change the region after a cluster is created. Select the region carefully.

    Business Scenario

    Data Lake

    Select a scenario to allow EMR to automatically configure default components, services, and resources. This simplifies cluster setup and provides an environment tailored to the specified use case.

    Product Version

    EMR-5.18.1

    Select the latest EMR version.

    High Service Availability

    Disabled

    This feature is disabled by default. If you enable High Service Availability, EMR distributes the master nodes across different underlying hardware to reduce the risk of failure.

    Optional Services

    HADOOP-COMMON, OSS-HDFS, YARN, Hive, Spark3, Tez, Knox, and OpenLDAP.

    Select services based on your business requirements. By default, EMR starts the service processes for your selected services.

    Note

    To access the web UIs of services from the console, you must also select the Knox and OpenLDAP services.

    Collect Service Operational Logs

    Enabled

    Specifies whether to enable log collection for all services. By default, this switch is turned on to collect the service operational logs of your cluster. The logs are used only for cluster diagnostics.

    After you create a cluster, you can modify the Collection Status of Service Operational Logs parameter on the Basic Information tab.

    Important

    If you turn off this switch, the EMR cluster health check and service-related technical support are limited. For more information about how to disable log collection and the impacts imposed by disabling of log collection, see How do I stop collecting service logs?

    Metadata

    Built-in MySQL

    Stores metadata in the built-in MySQL database.

    Important

    The built-in MySQL database allows you to quickly set up a test environment but is not recommended for production environments. For production environments, use a self-managed ApsaraDB RDS instance or Data Lake Formation (DLF) for unified metadata management based on your business requirements.

    Root Storage Directory of Cluster

    oss://******.cn-hangzhou.oss-dls.aliyuncs.com

    The root storage directory of cluster data. This parameter is required only if you select the OSS-HDFS service.

    Note
    • Before you use the OSS-HDFS service, make sure that the OSS-HDFS service is available in the region in which you want to create a cluster. If the OSS-HDFS service is unavailable in the region, you can change the region or use HDFS instead of OSS-HDFS. For more information about the regions in which OSS-HDFS is available, see Enable OSS-HDFS and grant access permissions.

    • You can select the OSS-HDFS service when you create a DataLake cluster in the new data lake scenario, a Dataflow cluster, a DataServing cluster, or a custom cluster of EMR V5.12.1, EMR V3.46.1, or a minor version later than EMR V5.12.1 or EMR V3.46.1.

    Hardware Configuration

    Billing Method

    Pay-as-you-go

    For testing, use the Pay-as-you-go billing method. After your tests are successful, you can release the test cluster and create a new cluster that uses the Subscription billing method for production.

    Zone

    Zone I

    You cannot change the zone after a cluster is created. Select the zone carefully.

    VPC

    vpc_Hangzhou/vpc-bp1f4epmkvncimpgs****

    Select a VPC in the current region. If no VPC is available, click Create VPC to create one. After you create the VPC, click the Refresh icon to select it.

    vSwitch

    vsw_i/vsw-bp1e2f5fhaplp0g6p****

    Select a vSwitch in the specified zone of the selected VPC. If no vSwitch is available in the zone, you must create one.

    Default Security Group

    sg_seurity/sg-bp1ddw7sm2risw****

    Important

    EMR does not support enterprise security groups created in the ECS console.

    You can select an existing security group or create a new one.

    Node Group

    Turn on the Assign Public Network IP switch for the master node group. You can retain the default values for the other parameters.

    You can configure the master, core, and task node groups based on your business requirements. For more information, see Configure hardware and a network.

    Basic Configuration

    Cluster Name

    Emr-Data Lake

    The cluster name, which must be 1 to 64 characters long and can contain letters, digits, hyphens (-), underscores (_), and Chinese characters.

    Identity Credentials

    Password

    Allows you to remotely log on to the cluster's master node.

    Note

    If you want to use password-free authentication, you can select Key Pair. For more information, see Manage SSH key pairs.

    Password and Confirm Password

    A custom password.

    Record the password. You will need it to log on to the cluster.

  3. Click Confirm.

    On the EMR on ECS page, the cluster is ready when its Status changes to Running. For more information about cluster parameters, see Create a cluster.

Step 2: Prepare data

After you create the cluster, you can run a data analysis test by using the pre-installed WordCount sample program. You can also upload and run your own big data applications. This topic uses the WordCount program to demonstrate the process, from data preparation to job submission.

  1. Connect to the cluster over SSH. For more information, see Log on to a cluster.

  2. Prepare the data file.

    Create a text file named wordcount.txt as the input data for the WordCount job. The file must contain the following content:

    hello world
    hello wordcount
  3. Upload the data file.

    Note

    You can upload the data file to the HDFS, OSS, or OSS-HDFS service of the cluster based on your business requirements. This topic uses the OSS-HDFS service as an example. To upload a file to OSS, see Simple upload.

    1. Run the following command to create a directory named input:

      hadoop fs -mkdir oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/
    2. Run the following command to upload the wordcount.txt file from the current local directory to the input directory in OSS-HDFS:

      
      hadoop fs -put wordcount.txt oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/

Step 3: Submit a job

You can use the WordCount program to analyze text data and count word frequencies.

Run the following command to submit the WordCount job:

hadoop jar /opt/apps/HDFS/hadoop-3.2.1-1.2.16-alinux3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount -D mapreduce.job.reduces=1 "oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/wordcount.txt" "oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/output/"

The following table describes the parameters in the command.

Parameter

Description

/opt/apps/.../hadoop-mapreduce-examples-3.2.1.jar

The sample program package that comes with Hadoop. This package includes multiple classic MapReduce sample programs. In this example, hadoop-mapreduce-examples-3.2.1.jar is the name of the JAR file in your cluster, and 3.2.1 is the version number. The version number is typically 3.2.1 for clusters of the EMR 5.x series and 2.8.5 for clusters of the EMR 3.x series.

-D mapreduce.job.reduces

Sets the number of reducers for the MapReduce job.

By default, Hadoop automatically determines the number of reducers based on the input data size. If you do not specify the number of reducers, multiple output files, such as part-r-00000 and part-r-00001, may be generated. By setting this parameter to 1, you can ensure that only one output file named part-r-00000 is generated.

oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/input/wordcount.txt

The input path for the WordCount job. This is the path to the data file in OSS. Replace <yourBucketname> with the name of your OSS bucket and cn-hangzhou with the region ID.

oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/output/

The output path where the WordCount job stores its results.

Step 4: View the results

Job output

You can use a Hadoop shell command to view the job output.

  1. Connect to the cluster over SSH. For more information, see Log on to a cluster.

  2. Run the following command to view the job output:

    hadoop fs -cat oss://<yourBucketname>.cn-hangzhou.oss-dls.aliyuncs.com/output/part-r-00000

    The following output is returned:

    hello	2
    wordcount	1
    world	1

Job history

YARN is Hadoop's resource management framework for scheduling and managing cluster tasks. You can use the YARN UI to view the status and history of jobs to review their execution details.

  1. Open port 8443. For more information, see Manage security groups.

  2. Add a user. For more information, see OpenLDAP user management.

    When you access the YARN UI page, you must use the username and password of a Knox account.

  3. On the EMR on ECS page, find your cluster and click Services in the Actions column.

  4. Click the Access Links and Ports tab.

  5. In the YARN UI row, click the link in the Public URL column.

    Authenticate with your user credentials to access the YARN UI page.

  6. On the All Applications page, click the ID of a job to view its details.

    The top of this page displays Cluster Metrics (such as Apps Submitted, Apps Running, and Memory Used) and Cluster Nodes Metrics. The bottom of the page lists applications with columns such as ID, Name, Application Type, State, and FinalStatus.

(Optional) Step 5: Release the cluster

If you no longer need the cluster, release it to avoid incurring further charges. After you confirm the release, the system performs the following operations:

  1. Forcibly terminates all jobs on the cluster.

  2. Terminates and releases all ECS instances.

The time required to release a cluster depends on the cluster size. Smaller clusters release faster. The release process typically completes within a few seconds and does not exceed 5 minutes.

Important
  • You can release Pay-as-you-go clusters at any time. Subscription clusters can be released only after their subscriptions expire.

  • Before you release a cluster, ensure the cluster is in the Initializing, Running, or Idle state.

  1. On the EMR on ECS page, find the cluster to release, and choose more > Release in the Actions column.

    Alternatively, click the cluster name. On the Basic Information tab, choose All operations > Release in the upper-right corner.

  2. In the dialog box that appears, click OK.

Related documentation

FAQ

For common issues with Alibaba Cloud EMR, see FAQ.