Learn how to use DataWorks with EMR Serverless Spark for big data development through a user profile analysis case study covering Data Integration, Data Studio, and Operation Center.
Case introduction
To develop effective business strategies, you can obtain basic profile data about website users from their website activities. This data includes geographical and social attributes. You can analyze this profile data by time and location to enable fine-grained operations on website traffic. This case uses DataWorks with EMR Serverless Spark to complete data synchronization, data processing, data management, and data consumption.
Before you begin, read Tutorial objectives and design for an overview of the end-to-end workflow in this user persona analysis case.
Data Studio
This tutorial uses the new Data Studio platform in DataWorks. Make sure that the new Data Studio is enabled for your workspace. To enable it:
-
When you create a workspace, select Use Data Studio (New Version).
-
To upgrade from an older version of Data Studio, click Upgrading at the top of the interface and follow the on-screen instructions.
-
After February 18, 2025, DataWorks enables the new Data Studio by default for any Alibaba Cloud account that creates its first workspace in the following regions:
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)
Procedure
-
Step 1: Prepare environments
Create the required EMR Serverless Spark and DataWorks workspaces, and configure resource group and network settings.
-
Step 2: Synchronize data
Configure a DataWorks synchronization task to sync the basic user information and website access logs of the users provided in the tutorial to a Spark computing resource, then query the data.
-
Step 3: Process data
Process the synchronized basic user information and access log data with EMR Spark SQL nodes in DataWorks to generate user profile data.
-
Step 4: Monitor data quality
Set up table monitors to detect and intercept dirty data before it affects downstream processes.
-
Step 5: Manage data
After the analysis completes, view the generated data tables in EMR and their lineages in Data Map.
-
Step 6: Consume data
Use the data analytics module to visualize the processed data and extract key insights into business trends.