Configure EMR cluster and DataWorks workspace for user profile analysis - DataWorks

This tutorial walks you through a user profile analysis experiment using DataWorks for data synchronization, processing, and quality monitoring. Before you start, complete the setup described on this page: create an E-MapReduce (EMR) cluster for compute, a DataWorks workspace for orchestration, and a serverless resource group for scheduling and data synchronization.

Read the Experiment introduction before you begin. It explains the full experiment flow and helps you understand how the pieces fit together.

What you'll do

In this setup phase, you will:

Create an EMR cluster — the compute engine that runs data processing tasks.
Create a DataWorks workspace — the environment where you build and manage data pipelines.
Create a serverless resource group — handles scheduling and data synchronization, and requires Internet access to retrieve the tutorial's test data.
Register the EMR cluster to DataWorks and initialize the resource group — connects the compute layer to the orchestration layer so tasks can run.

Prerequisites

Before you begin, make sure you have:

An Alibaba Cloud account with permissions to create EMR clusters, DataWorks workspaces, serverless resource groups, VPCs, and Internet NAT gateways.
Access to the China (Shanghai) region — all resources in this tutorial are created in this region.
A VPC and vSwitch in China (Shanghai) — required when setting up the serverless resource group. If you don't have one, you can create it from the VPC console during the setup steps.

Background

To formulate effective business strategies, enterprises need to obtain basic user profile data — such as geographical and social attributes — derived from website user behavior. This data supports scheduled, targeted profile analysis, enabling refined website traffic operations.

Usage notes

This tutorial provides sample user information and website access data for immediate use. All data is artificially generated mock data.
This tutorial uses DataStudio (legacy version).

Set up an EMR environment

Create an EMR cluster

The EMR cluster is the compute engine for this tutorial. DataWorks submits and manages data processing tasks on this cluster, so you must register it to DataWorks after creation.

When you create the cluster, configure the following parameters in the Software Configuration step:

Parameter	Value
Region	China (Shanghai)
Business Scenario	Data Lake
Product Version	Latest version
Optional Services	Select Hive and OSS-HDFS
Metadata	DLF Unified Metadata
Root Storage Directory of Cluster	Select an OSS-HDFS bucket. If none is available, click Create OSS-HDFS Bucket.

For detailed steps, see Step 1: Create a cluster.

Important

DataWorks support for EMR clusters varies by cluster configuration. Before you create an EMR cluster and develop tasks in DataWorks, read Best practices for EMR cluster configuration in DataWorks.

Set up a DataWorks environment

Activate DataWorks before developing tasks. For more information, see Prepare an environment.

Step 1: Create a workspace

The DataWorks workspace is where you build and manage data pipelines. In standard mode, the development environment is isolated from the production environment, so you can test tasks safely before publishing them.

If a workspace already exists in the China (Shanghai) region, skip this step and use the existing workspace.

Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region.
In the left-side navigation pane, click Workspace. On the Workspaces page, click Create Workspace to create a workspace in standard mode. For more information, see Create a workspace.

Step 2: Create a serverless resource group

The serverless resource group handles data synchronization and scheduling in this tutorial. You need to purchase it, associate it with your workspace, and enable Internet access so it can retrieve the test data.

Purchase a serverless resource group.
1. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.
2. Click Create Resource Group. On the buy page, set Region and Zone to China (Shanghai), specify a resource group name, configure the remaining parameters as prompted, and complete payment. For billing details, see Serverless resource groups. > Note: If no virtual private cloud (VPC) or vSwitch exists in the current region, click the link in the parameter description to create one in the VPC console. For more information, see What is VPC?
Associate the serverless resource group with the DataWorks workspace. A resource group can only be used in tasks after it is associated with a workspace. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the resource group you purchased and click Associate Workspace in the Actions column. In the Associate Workspace panel, find the target workspace and click Associate in the Actions column.

Enable the serverless resource group to access the Internet. By default, a serverless resource group cannot access the Internet. The test data in this tutorial is hosted online, so you must configure an Internet NAT gateway and an elastic IP address (EIP) for the VPC associated with the resource group.

Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.

Click Create Internet NAT Gateway and configure the parameters. The following table describes the key parameters. Retain default values for all other parameters.

Parameter	Value
Region	China (Shanghai)
Network and Zone	Select the VPC and vSwitch associated with the resource group. To find them: in the DataWorks console, go to Resource Group → find your resource group → click Network Settings in the Actions column → on the VPC Binding tab, view the VPC and vSwitch in the Data Scheduling & Data Integration section. For more information, see What is VPC?
EIP	Select Purchase EIP
Service-linked Role	If this is your first time creating a NAT gateway, click Create Service-linked Role

Click Buy Now. On the Confirm page, read the terms of service, select the Terms of Service check box, and click Activate Now.

For more information, see Use serverless resource groups.

Step 3: Register the EMR cluster to DataWorks and initialize the resource group

Registering the EMR cluster to DataWorks lets DataWorks submit compute tasks to the cluster. Initializing the resource group establishes the network connection between the resource group and the cluster so that tasks can run successfully.

Go to the Register EMR Cluster page.
1. Go to the SettingCenter page. Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select your workspace from the drop-down list and click Go to Management Center.
2. In the left navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster. Select E-MapReduce as the cluster type. The Register EMR Cluster page appears.
Register the EMR cluster. On the Register EMR Cluster page, configure the following key parameters:
Parameter Value
Alibaba Cloud Account to Which Cluster Belongs Current Alibaba Cloud Account
Cluster Type Data Lake
Default Access Identity Cluster Account: hadoop
Pass Proxy User Information Pass
Initialize the resource group.
1. On the Clusters page in SettingCenter, find the EMR cluster you registered and click Initialize Resource Group in the cluster information section.
2. In the Initialize Resource Group dialog box, find your resource group and click Initialization.
3. After initialization completes, click OK.
Important
Make sure the initialization succeeds before proceeding. If it fails, review the failure message and run a network connectivity diagnosis as prompted. Tasks using an uninitialized resource group will fail.

Parameter	Value
Alibaba Cloud Account to Which Cluster Belongs	Current Alibaba Cloud Account
Cluster Type	Data Lake
Default Access Identity	Cluster Account: hadoop
Pass Proxy User Information	Pass

For more information, see DataStudio (legacy version): Associate an EMR computing resource.

Next steps

Your environment is ready. In the next tutorial, you will synchronize basic user information and website access logs to OSS, then create a table in an EMR Hive node to query the synchronized data. For more information, see Synchronize data.