This tutorial walks you through setting up the E-MapReduce (EMR) and DataWorks environment needed for the user profile analysis tutorial series. By the end of this guide, you will have:
An EMR cluster configured for DataWorks integration.
A DataWorks workspace in the China (Shanghai) region.
A serverless resource group with public network access.
The EMR cluster registered in DataWorks and ready to run tasks.
The resources you create in this tutorial run in a live environment and incur charges. To avoid unnecessary costs, delete the resources after you complete the tutorial series.
Prerequisites
Before you begin, make sure you have:
An Alibaba Cloud account with permissions to create EMR clusters, DataWorks workspaces, and VPC resources.
DataWorks activated. For activation steps, see Prepare an environment.
Reviewed the user profile analysis tutorial introduction to understand the overall workflow.
Notes
The basic user information and website access logs used in this tutorial are provided as test data.
All data in this tutorial is manual mock data and can only be used for experimental operations in DataWorks.
For data manipulation, this tutorial uses Data Development (DataStudio) (Old Version).
Set up the EMR cluster
Create an EMR cluster that DataWorks can connect to for running data processing tasks.
Follow the steps in Create a cluster to create a new cluster. Use the following configuration:
ImportantBefore creating the cluster, check Best practices for configuring DataWorks on EMR clusters to confirm which cluster configurations DataWorks supports.
Parameter Value Region China (Shanghai) Business Scenario Data Lake Product Version Latest version Optional Services Select at minimum: Hive component, OSS-HDFS component (both required) Metadata DLF Unified Metadata Cluster Storage Root Path Select an OSS-HDFS instance. If the list is empty, click Create OSS-HDFS Instance to create one.
Set up the DataWorks environment
Step 1: Create a workspace
Skip this step if you already have a workspace in the China (Shanghai) region.
Log in to the DataWorks console. In the upper-left corner, switch the region to China (Shanghai).
In the left-side navigation pane, click Workspace, then click Create Workspace. Create a standard mode workspace, which isolates the production and development environments. For details, see Create a workspace.
Step 2: Create a serverless resource group
The tutorial uses a serverless resource group for data synchronization and scheduling. Serverless resource groups do not support cross-region operations, so create one in China (Shanghai).
Purchase the resource group
Log in to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group.
Click Create Resource Group. On the purchase page, set Region And Zone to China (Shanghai), enter a name for the resource group, and complete the purchase following the prompts. For billing details, see Serverless resource group billing.
Configure the resource group
On the Resource Group page, find the resource group you created and click Associate Workspace in the Actions column. Associate it with the workspace you created in Step 1.
Enable public network access for the resource group. The test data for this tutorial is retrieved over the public internet. By default, the resource group has no public network access. Set up an Internet NAT gateway for the Virtual Private Cloud (VPC) associated with the resource group and assign an elastic IP address (EIP) to enable internet connectivity.
Log in to the VPC console and go to the Internet NAT Gateway page. Select the China (Shanghai) region.
Click Create Internet NAT Gateway and configure the following parameters:
Parameter Value Region China (Shanghai) VPC The VPC associated with your resource group. To find it: in the DataWorks console, go to Resource Group > find your resource group > Network Settings in the Actions column > VPC Binding tab > Data Scheduling & Data Integration section. Associate vSwitch The vSwitch associated with your resource group (same location as the VPC above) Access Mode SNAT-enabled Mode EIP Purchase EIP Create Service-Linked Role Click Create Service-Linked Role. Required the first time you create an Internet NAT gateway. Leave all other parameters at their default values.
Click Buy Now. On the confirmation page, accept the terms of service and click Activate Now.
For more details, see Create and use a serverless resource group.
Step 3: Register the EMR cluster and initialize the resource group
Register the EMR cluster with DataWorks so it can run tasks on the cluster.
Navigate to the EMR cluster registration page
Log in to the DataWorks console. Switch to the China (Shanghai) region. In the left-side navigation pane, click More > Management Center. Select your workspace from the drop-down list and click Go To Management Center.
In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the dialog box, click E-MapReduce. The Register EMR Cluster page appears.
Register the cluster
On the Register EMR Cluster page, enter the cluster details. Set the following parameters:
Parameter Value Cluster Alibaba Cloud Account Current Alibaba Cloud Account Cluster Type Data Lake (datalake) Default Access Identity Cluster Account: Hadoop Pass Proxy User Information Pass
Initialize the resource group
On the Cluster Management page, find the registered cluster and click Resource Group Initialization in the upper-right corner.
Click Initialize next to the resource group that needs initialization.
After initialization completes, click Confirm.
ImportantMake sure initialization succeeds before proceeding. If it fails, check the error message and run network connectivity diagnostics as suggested. Failed initialization causes subsequent tasks to fail.
For step-by-step registration instructions, see Register an EMR cluster to DataWorks.
What's next
With the environment ready, proceed to the next tutorial to synchronize user profile data and website access logs to Object Storage Service (OSS), create Apache Hive tables, and query the data using EMR Hive nodes. See Synchronize data.