This tutorial shows how to use DataWorks and Spark for big data development and analysis. A user profile analysis case study demonstrates the capabilities of DataWorks in Data Integration, DataStudio, and Operation Center.
Case introduction
To develop effective business strategies, you can obtain basic profile data about website users from their website activities. This data includes geographical and social attributes. You can analyze this profile data by time and location to enable fine-grained operations on website traffic. This case uses DataWorks with EMR Serverless Spark to complete data synchronization, data processing, data management, and data consumption.
To follow this tutorial, read Tutorial objectives and design to understand the overall flow of the user persona analysis.
Data Studio
This tutorial uses the new DataStudio platform in DataWorks. Make sure that the new DataStudio is enabled for your workspace. You can enable it as follows:
When you create a workspace, select Use Data Studio (New Version).
To upgrade from the old DataStudio version, click the Upgrade button at the top of the interface. Then, follow the on-screen instructions to complete the upgrade.
After February 18, 2025, the new DataStudio is enabled by default when an Alibaba Cloud account enables DataWorks and creates a workspace for the first time in the following regions:
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)
Procedure
Create the Spark project and DataWorks workspace required for this tutorial. Then, configure the related resource group and network configurations.
Configure a data synchronization task in DataWorks to synchronize the user information and access log data provided in this tutorial to a Spark computing resource. Then, query the synchronized data.
Use EMR Spark SQL nodes in DataWorks to process data in the user information table and access log table that are synchronized to Spark. This process generates the target user profile data.
Configure data quality monitoring rules for the tables generated from data processing. This helps you identify and block dirty data early to prevent its impact from spreading.
After the user profile analysis task is complete, data tables are created in EMR. View the generated data tables and their table lineage in the Data Map module.
Consume data
After the user profile analysis is complete, use the DataAnalysis module to visualize the processed data. This helps you quickly extract key information and gain insights into business trends.