This tutorial shows how to use DataWorks and Spark for big data development and analysis. A user profile analysis case study demonstrates the capabilities of DataWorks in Data Integration, DataStudio, and Operation Center.
Case introduction
To develop effective business strategies, you can obtain basic profile data about website users from their website activities. This data includes geographical and social attributes. You can analyze this profile data by time and location to enable fine-grained operations on website traffic. This case uses DataWorks with EMR Serverless Spark to complete data synchronization, data processing, data management, and data consumption.
Read Experiment introduction to familiarize yourself with the entire process of a user profile analysis case. This ensures that you can complete this tutorial.
Data development platform
In this tutorial, DataWorks new-version Data Studio is used. Make sure that Data Studio is enabled in your workspace. You can use one of the following methods to enable Data Studio:
Turn on Participate in Public Preview of Data Studio when you create a workspace.
In the top navigation bar of the old-version DataStudio page, click Upgrade Data Studio to upgrade old-version DataStudio to new-version Data Studio as prompted.
Since February 19, 2025, Data Studio is enabled by default if you activate DataWorks and create a workspace for the first time by using your Alibaba Cloud account in the following regions:
China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Hong Kong), Singapore, Indonesia (Jakarta), and Germany (Frankfurt)
Procedure
Create the Spark project and DataWorks workspace required for this tutorial. Then, configure the related resource group and network configurations.
Configure a data synchronization task in DataWorks to synchronize the user information and access log data provided in this tutorial to a Spark computing resource. Then, query the synchronized data.
Use EMR Spark SQL nodes in DataWorks to process data in the user information table and access log table that are synchronized to Spark. This process generates the target user profile data.
Configure data quality monitoring rules for the tables generated from data processing. This helps you identify and block dirty data early to prevent its impact from spreading.
After the user profile analysis task is complete, data tables are created in EMR. View the generated data tables and their table lineage in the Data Map module.
Consume data
After the user profile analysis is complete, use the DataAnalysis module to visualize the processed data. This helps you quickly extract key information and gain insights into business trends.