This tutorial shows you how to use the DataWorks and Spark product portfolio for big data development and analysis. It uses a user persona analysis case to demonstrate the capabilities of DataWorks in Data Integration, Data Development, and Operation Center.
Tutorial overview
To create better business strategies, you need to obtain basic user profile data from website behavior. This data includes attributes such as geographic location and social status. You can then perform scheduled user persona analysis to enable fine-grained website traffic operations. You can use the DataWorks and EMR Serverless Spark product portfolio to perform data synchronization, data transformation, data management, and data consumption.
Read Experiment introduction to familiarize yourself with the entire process of a user profile analysis case. This ensures that you can complete this tutorial.
Data development platform
This tutorial uses the previous version of Data Development (DataStudio) in DataWorks. Make sure that your workspace does not use the New Version of Data Development (Data Studio).
When you create a workspace, do not select Use the New Version of Data Development (Data Studio).
After February 18, 2025, when an Alibaba Cloud account activates DataWorks for the first time and creates a workspace in the following regions, the new version of Data Development is enabled by default. If the new version of Data Development is enabled by default for your workspace, see Get started with the new version of Data Development.
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)
Procedure
Create the Spark project and DataWorks workspace required for this tutorial. Then, complete the network configuration for the resource group.
Configure a data synchronization pipeline in DataWorks. Synchronize the user information and website log data provided in this tutorial to the Spark computing resource. Then, query the synchronized data.
Use an EMR Spark SQL node in DataWorks to transform the data in the user information table and access log table that were synchronized to Spark. This produces the target user persona data.
Configure Data Quality monitoring rules for the tables generated from data transformation. This helps you detect and block dirty data early to prevent its impact from spreading.
After the user persona analysis task flow is complete, the corresponding data tables are created in Spark. In the Data Map module, you can view the generated data tables and their table lineage.
Consume data
After the user persona analysis is complete, you can use the DataAnalysis module to visualize the transformed data. This helps you quickly extract key information and understand business trends.