This tutorial shows you how to use the DataWorks and EMR product portfolio for big data development and analysis. This tutorial uses a user persona analysis case to demonstrate the capabilities of DataWorks in Data Integration, Data Development, and Operation Center.
Case description
To create better business strategies, you need to obtain basic profile data, such as geographical and social attributes, from user website behavior. This data allows for scheduled persona analysis and fine-grained management of website traffic. You can use the DataWorks and EMR product portfolio to perform data synchronization, data transformation, data management, and data consumption.
Read Experiment introduction to familiarize yourself with the entire process of a user profile analysis case. This ensures that you can complete this tutorial.
Data development platform
This tutorial uses the DataWorks classic DataStudio platform. Ensure that your workspace is not set to Use The New Data Studio.
When you create a workspace, do not select Use The New Data Studio.
After February 18, 2025, the new Data Studio is enabled by default when you create a workspace for the first time in the following regions using an Alibaba Cloud account with DataWorks enabled. If the new Data Studio is enabled by default in your workspace, see the Experience the new Data Studio tutorial.
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)
Procedure
Create the EMR cluster and DataWorks workspace for this tutorial. Then, configure the resource group network.
In DataWorks, configure a data synchronization task to sync the provided user information and website log data to Object Storage Service (OSS). Create an EMR foreign table to parse the data in OSS, which syncs the data to the attached EMR computing resource. You can then query the synchronized data.
Use an EMR Hive node in DataWorks to transform the data in the user information and access log tables that are synced to EMR. The goal is to generate the target user persona data.
Configure data quality monitoring for the tables generated during data transformation. This helps detect and block dirty data early to prevent it from affecting downstream processes.
After the user persona analysis workflow completes, data tables are created in EMR. Use Data Map to view the data lineage between these tables.
Consume data
After the user persona analysis is complete, use the DataAnalysis module to visualize the transformed data. This helps you quickly extract key information and understand business trends.