Use old-version DataWorks - E-MapReduce - Alibaba Cloud Documentation Center

This tutorial teaches you how to use DataWorks with EMR Serverless Spark for big data development and analytics. You will experience DataWorks capabilities in Data Integration, Data Development, and Operation Center modules through a user profile analysis case.

Case introduction

To develop effective business strategies, you can obtain basic profile data of website users from their website activities. This data includes geographical and social attributes. You can analyze profile data by time and location to enable fine-grained operations on website traffic. This case uses DataWorks and E-MapReduce (EMR) Serverless Spark to complete data synchronization, data transformation, data management, and data consumption.

Note

To complete the steps in this tutorial, read Case objectives and design to understand the workflow of the user profile analysis case.

Data development platform

This case uses DataStudio (old version). Make sure your workspace is not participating in the public preview of the new DataStudio.

When you create a workspace, do not select Participate in Public Preview of DataStudio.
After February 18, 2025, the new DataStudio is enabled by default when you use an Alibaba Cloud account to activate DataWorks and create a workspace for the first time in the following regions. If the new DataStudio is already enabled by default for you, see the tutorial in Use the new DataStudio.
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)

Procedure

Step 1: Prepare the environment
Create an EMR Serverless Spark workspace and a DataWorks workspace that are required for the tutorial, and configure the resource group and network settings.
Step 2: Synchronize data
Configure a data synchronization task in DataWorks to synchronize the basic user information and website access logs provided in the tutorial to a Spark computing resource, and query the synchronized data.
Step 3: Process data
Use EMR Spark SQL nodes in DataWorks to process the data in the basic user information table and access log table that are synchronized to Spark to obtain the desired user profile data.
Step 4: Monitor data quality
Configure a monitor for tables generated after data processing to help identify and intercept dirty data in advance to prevent the impacts of dirty data from escalating.
Step 5: Manage data
Data tables are generated in Spark after a user profile analysis task is complete. You can view the generated data tables and data lineages between the tables in Data Map.
Step 6: Consume data
After the user profile analysis is complete, use the data analytics module to visualize the processed data, allowing you to quickly extract key information and gain insights into business trends behind the data.