This topic describes how to use the combination of DataWorks and E-MapReduce (EMR) for data warehouse development and analysis. This topic also provides a case study on user profile analysis for you to experience the capabilities of DataWorks services, such as Data Integration, Data Studio, and Operation Center.
Experiment introduction
To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data by time and location, enabling refined operations on website traffic. You can use the combination of DataWorks and EMR to complete data synchronization, data processing, data management, and data consumption.
You must read Experiment introduction to have a deep understanding of the entire process of a user profile analysis experiment. This ensures that you can complete this tutorial.
Procedure
Step 1: Prepare the environment
Create an EMR cluster and a DataWorks workspace that are required for the tutorial, and configure the environment.
Step 2: Synchronize data
Configure a data synchronization task in DataWorks to synchronize basic user information and website access logs of the users provided in the tutorial to an Object Storage Service (OSS) data source, and create tables by using EMR Hive nodes to query the synchronized data.
Step 3: Process data
Use EMR Hive nodes in DataWorks to process the data in the basic user information table and access log table that are synchronized to OSS to obtain the desired user profile data.
Step 4: Configure a monitor
In DataWorks Data Quality, configure a monitor for the dwd_log_info_di_emr table that is generated after the synchronized data is processed.