Build a User Profile Analysis Pipeline with EMR Serverless Spark - E-MapReduce

This tutorial walks you through setting up an OSS bucket, an EMR Serverless Spark workspace, and a DataWorks workspace so you can run the user profile analysis tutorial end-to-end. By the end of this page, you will have:

Created an OSS bucket with OSS-HDFS enabled to store user information and website access logs.
Created an EMR Serverless Spark workspace to process data.
Created a DataWorks workspace and a serverless resource group with Internet access for data synchronization and scheduling.

Prerequisites

Before you begin, ensure that you have:

Read Experiment introduction to understand the full scope of the tutorial. This step is required before starting any environment setup.
Activated DataWorks. See Activate DataWorks.
An Alibaba Cloud account with permissions to create OSS buckets, EMR Serverless Spark workspaces, and DataWorks resources in the China (Shanghai) region.

The test data used in this tutorial (basic user information and website access logs) is provided for you. All data is manually generated mock data for experimental use in DataWorks only. This tutorial uses Data Development (Data Studio) (New Version) for data transformation.

Create an OSS bucket

Create an OSS bucket in the China (Shanghai) region to store user information and website access logs for data modeling and analysis.

Log on to the OSS console.
In the left-side navigation pane, click Buckets. On the Buckets page, click Create Bucket.

In the Create Bucket panel, configure the following parameters and click OK.

Parameter	Description
Bucket Name	Set to `dw-spark-demo` for this tutorial.
Region	Select China (Shanghai).
OSS-HDFS	Turn on this switch.

Go back to the Buckets page, find the bucket, and then click the bucket name to go to the Objects page.

Create an EMR Serverless Spark workspace

EMR Serverless Spark processes the data in this tutorial. If you already have an EMR Serverless Spark workspace in the China (Shanghai) region, skip this section.

Go to Create a workspace and configure the following parameters:

Parameter	Description
Region	Select China (Shanghai).
Billing Method	Select Pay-as-you-go.
Workspace Name	Enter a custom name.
DLF for Metadata Storage	Select a Data Lake Formation (DLF) catalog to associate with the workspace. To isolate metadata between workspaces, select different catalogs for each workspace.
Workspace Directory	Select an OSS bucket directory to store job log files.

EMR Serverless Spark offers two editions. Professional Edition includes all Basic Edition features plus advanced capabilities and performance enhancements—suited for large-scale extract, transform, and load (ETL) tasks. Basic Edition provides core features with efficient compute engines.

Set up a DataWorks environment

Step 1: Create a DataWorks workspace

If a DataWorks workspace with Participate in Public Preview of DataStudio of New Version turned on already exists in the China (Shanghai) region, skip this step and use that workspace.

Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Workspace.
On the Workspaces page, click Create Workspace. Create the workspace in standard mode and turn on Participate in Public Preview of DataStudio of New Version. Standard mode isolates the development environment from the production environment.

As of February 18, 2025, the first time you activate DataWorks and create a workspace in the China (Shanghai) region, the new-version Data Studio is activated by default.

For details, see Create a workspace.

Step 2: Create a serverless resource group

A serverless resource group handles data synchronization and scheduling. Complete the following three parts in order: purchase the resource group, associate it with your workspace, and enable Internet access.

Serverless resource groups do not support cross-region operations. Use the China (Shanghai) region throughout.

Purchase a serverless resource group

Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group.
On the Resource Groups page, click Create Resource Group. On the purchase page, set Region and Zone to China (Shanghai), enter a resource group name, configure any remaining parameters, and complete the purchase. For billing details, see Billing of serverless resource groups.

If no virtual private cloud (VPC) or vSwitch exists in the current region, use the link in the parameter description to create one in the VPC console. See What is a VPC?

Associate the resource group with your workspace

On the Resource Groups page, find the resource group you purchased and click Associate Workspace in the Actions column. In the Associate Workspace panel, find your workspace and click Associate.

The resource group is not usable until it is associated with a workspace.

Enable Internet access for the resource group

The test data is retrieved over the Internet, but serverless resource groups cannot access the Internet by default. Configure an Internet NAT gateway with a Source Network Address Translation (SNAT) rule and an elastic IP address (EIP) for the VPC associated with the resource group.

Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.

Click Create Internet NAT Gateway and configure the following parameters. Keep the default values for any parameters not listed here.

Parameter	Description
Region	Select China (Shanghai).
VPC	Select the VPC associated with the resource group. To find it, go to the DataWorks console, click Resource Group, find the resource group, and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab, note the VPC and vSwitch values. See also What is a VPC?
Associate vSwitch	Select the vSwitch associated with the resource group.
Access Mode	Select SNAT-enabled Mode.
EIP	Select Purchase EIP.
Service-linked Role	Click Create Service-linked Role if this is the first time you create a NAT gateway.

Click Buy Now. On the Confirm page, read and accept the terms of service, then click Activate Now.

For full details, see Create and use a serverless resource group.

What to do next

Your environments are ready. In the next tutorial, you will synchronize basic user information and website access logs to OSS, and create a table in a Spark SQL node to query the synchronized data. See Synchronize data.