This tutorial demonstrates how to use the DataWorks and EMR product portfolio for data development and analysis. A user profile analysis example is used to showcase DataWorks features, including Data Integration, Data Studio, and Operation Center.
Tutorial overview
To create better business strategies, you need to obtain basic profile data, such as geographical and social attributes, from user website behavior. This data enables scheduled profile analysis and fine-grained website traffic operations. You can use the DataWorks and EMR product portfolio to perform data synchronization, data transformation, data management, and data consumption.
To complete the steps in this tutorial, read Case objectives and design to understand the workflow of the user profile analysis case.
Data Studio
This case study uses Data Studio (new version). Make sure that Data Studio (new version) is enabled for your workspace. You can enable it in one of the following ways:
When you create a workspace, select Participate in the public preview of DataStudio.
To upgrade from DataStudio (legacy version) to the new version, click the Upgrade button at the top of the old DataStudio interface. Then, follow the on-screen instructions to complete the upgrade.
Starting February 18, 2025, Data Studio (new version) will be enabled by default when you use an Alibaba Cloud account to enable DataWorks and create a workspace for the first time in the following regions:
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)
Procedure
Create the EMR cluster and DataWorks workspace required for this tutorial. Then, complete the network configurations for the resource group.
In DataWorks, configure a data synchronization task to synchronize user information and website log data to OSS. Then, use an EMR foreign table to parse the data in OSS, synchronize it to the attached EMR computing resource, and query the synchronized data.
Use an EMR Hive node in DataWorks to transform the data in the user information and access log tables that are synchronized to EMR. This process generates the target user profile data.
Configure Data Quality monitoring rules for the tables generated from data transformation. This helps detect and block dirty data early and prevents its impact from spreading.
After the user profile analysis task flow is complete, the corresponding data tables are created in EMR. You can then view the data lineage between these tables in Data Map.
Consume data
After the user persona analysis is complete, you can use the DataAnalysis module to create a data visualization of the processed data to quickly extract key information and gain insights into the business trends behind the data.