All Products
Search
Document Center

E-MapReduce:Prepare the environment

Last Updated:Mar 26, 2026

This tutorial walks you through setting up the E-MapReduce (EMR) and DataWorks environment needed for the user profile analysis tutorial series. By the end of this guide, you will have:

  1. An EMR cluster configured for DataWorks integration.

  2. A DataWorks workspace in the China (Shanghai) region.

  3. A serverless resource group with public network access.

  4. The EMR cluster registered in DataWorks and ready to run tasks.

The resources you create in this tutorial run in a live environment and incur charges. To avoid unnecessary costs, delete the resources after you complete the tutorial series.

Prerequisites

Before you begin, make sure you have:

  • An Alibaba Cloud account with permissions to create EMR clusters, DataWorks workspaces, and VPC resources.

  • DataWorks activated. For activation steps, see Prepare an environment.

  • Reviewed the user profile analysis tutorial introduction to understand the overall workflow.

Notes

  • The basic user information and website access logs used in this tutorial are provided as test data.

  • All data in this tutorial is manual mock data and can only be used for experimental operations in DataWorks.

  • For data manipulation, this tutorial uses Data Development (DataStudio) (Old Version).

Set up the EMR cluster

Create an EMR cluster that DataWorks can connect to for running data processing tasks.

  1. Follow the steps in Create a cluster to create a new cluster. Use the following configuration:

    Important

    Before creating the cluster, check Best practices for configuring DataWorks on EMR clusters to confirm which cluster configurations DataWorks supports.

    ParameterValue
    RegionChina (Shanghai)
    Business ScenarioData Lake
    Product VersionLatest version
    Optional ServicesSelect at minimum: Hive component, OSS-HDFS component (both required)
    MetadataDLF Unified Metadata
    Cluster Storage Root PathSelect an OSS-HDFS instance. If the list is empty, click Create OSS-HDFS Instance to create one.

Set up the DataWorks environment

Step 1: Create a workspace

Skip this step if you already have a workspace in the China (Shanghai) region.

  1. Log in to the DataWorks console. In the upper-left corner, switch the region to China (Shanghai).

  2. In the left-side navigation pane, click Workspace, then click Create Workspace. Create a standard mode workspace, which isolates the production and development environments. For details, see Create a workspace.

Step 2: Create a serverless resource group

The tutorial uses a serverless resource group for data synchronization and scheduling. Serverless resource groups do not support cross-region operations, so create one in China (Shanghai).

Purchase the resource group

  1. Log in to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group.

  2. Click Create Resource Group. On the purchase page, set Region And Zone to China (Shanghai), enter a name for the resource group, and complete the purchase following the prompts. For billing details, see Serverless resource group billing.

Configure the resource group

  1. On the Resource Group page, find the resource group you created and click Associate Workspace in the Actions column. Associate it with the workspace you created in Step 1.

  2. Enable public network access for the resource group. The test data for this tutorial is retrieved over the public internet. By default, the resource group has no public network access. Set up an Internet NAT gateway for the Virtual Private Cloud (VPC) associated with the resource group and assign an elastic IP address (EIP) to enable internet connectivity.

    1. Log in to the VPC console and go to the Internet NAT Gateway page. Select the China (Shanghai) region.

    2. Click Create Internet NAT Gateway and configure the following parameters:

      ParameterValue
      RegionChina (Shanghai)
      VPCThe VPC associated with your resource group. To find it: in the DataWorks console, go to Resource Group > find your resource group > Network Settings in the Actions column > VPC Binding tab > Data Scheduling & Data Integration section.
      Associate vSwitchThe vSwitch associated with your resource group (same location as the VPC above)
      Access ModeSNAT-enabled Mode
      EIPPurchase EIP
      Create Service-Linked RoleClick Create Service-Linked Role. Required the first time you create an Internet NAT gateway.

      Leave all other parameters at their default values.

    3. Click Buy Now. On the confirmation page, accept the terms of service and click Activate Now.

For more details, see Create and use a serverless resource group.

Step 3: Register the EMR cluster and initialize the resource group

Register the EMR cluster with DataWorks so it can run tasks on the cluster.

Navigate to the EMR cluster registration page

  1. Log in to the DataWorks console. Switch to the China (Shanghai) region. In the left-side navigation pane, click More > Management Center. Select your workspace from the drop-down list and click Go To Management Center.

  2. In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the dialog box, click E-MapReduce. The Register EMR Cluster page appears.

Register the cluster

  1. On the Register EMR Cluster page, enter the cluster details. Set the following parameters:

    ParameterValue
    Cluster Alibaba Cloud AccountCurrent Alibaba Cloud Account
    Cluster TypeData Lake (datalake)
    Default Access IdentityCluster Account: Hadoop
    Pass Proxy User InformationPass

Initialize the resource group

  1. On the Cluster Management page, find the registered cluster and click Resource Group Initialization in the upper-right corner.

  2. Click Initialize next to the resource group that needs initialization.

  3. After initialization completes, click Confirm.

    Important

    Make sure initialization succeeds before proceeding. If it fails, check the error message and run network connectivity diagnostics as suggested. Failed initialization causes subsequent tasks to fail.

For step-by-step registration instructions, see Register an EMR cluster to DataWorks.

What's next

With the environment ready, proceed to the next tutorial to synchronize user profile data and website access logs to Object Storage Service (OSS), create Apache Hive tables, and query the data using EMR Hive nodes. See Synchronize data.