DataWorks EMR Serverless Spark tutorial - E-MapReduce - Alibaba Cloud Documentation Center

Learn how to use DataWorks with EMR Serverless Spark for big data development through a user profile analysis case study covering Data Integration, Data Studio, and Operation Center.

Case introduction

To develop effective business strategies, you can obtain basic profile data about website users from their website activities. This data includes geographical and social attributes. You can analyze this profile data by time and location to enable fine-grained operations on website traffic. This case uses DataWorks with EMR Serverless Spark to complete data synchronization, data processing, data management, and data consumption.

Note

Before you begin, read Tutorial objectives and design for an overview of the end-to-end workflow in this user persona analysis case.

Data Studio

This tutorial uses the new Data Studio platform in DataWorks. Make sure that the new Data Studio is enabled for your workspace. To enable it:

When you create a workspace, select Use Data Studio (New Version).
To upgrade from an older version of Data Studio, click Upgrading at the top of the interface and follow the on-screen instructions.
After February 18, 2025, DataWorks enables the new Data Studio by default for any Alibaba Cloud account that creates its first workspace in the following regions:

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)

Procedure

Step 1: Prepare environments

Create the required EMR Serverless Spark and DataWorks workspaces, and configure resource group and network settings.
Step 2: Synchronize data

Configure a DataWorks synchronization task to sync the basic user information and website access logs of the users provided in the tutorial to a Spark computing resource, then query the data.
Step 3: Process data

Process the synchronized basic user information and access log data with EMR Spark SQL nodes in DataWorks to generate user profile data.
Step 4: Monitor data quality

Set up table monitors to detect and intercept dirty data before it affects downstream processes.
Step 5: Manage data

After the analysis completes, view the generated data tables in EMR and their lineages in Data Map.
Step 6: Consume data

Use the data analytics module to visualize the processed data and extract key insights into business trends.