This topic walks you through a simple user profile analysis experiment to describe how to use DataWorks and E-MapReduce (EMR) to synchronize data, develop data, and perform O&M on nodes. In this topic, you can learn about the background information, workflow design, DataWorks services involved in the experiment, and experiment data.
Experiment design
Background information
To help your enterprise better formulate operational strategies, you must obtain basic user profile data from user behavior on websites, such as the geographical and social attributes of users, to periodically perform a user profile analysis. This can help you perform fine-grained operations on website traffic. To do this, you must use DataWorks to perform the following operations:
Synchronize data.
Process data.
Manage data.
Consume data.
Workflow design
For this experiment, you can complete user profile analysis by using DataWorks and EMR. The procedure contains the following steps:
In Data Integration, extract basic user information and website access logs of users from different data sources to a compute engine.
Process and split the website access logs of users in the compute engine into fields that can be analyzed.
Aggregate the basic user information and the processed website access logs of users in the compute engine.
Further process the data to generate a basic user profile.
DataWorks services involved
The following table describes the use of different DataWorks services in each step of the experiment.
Step | Operation | Phase-specific objective |
Synchronize data | Configure a synchronization node to synchronize the basic user information that is stored in ApsaraDB RDS for MySQL and website access logs of users that are stored in Object Storage Service (OSS) to EMR. | Learn how to perform the following operations:
|
Process data | In DataStudio, split the website access logs of users into fields that can be analyzed by using methods such as functions or regular expressions, and aggregate the processed website access logs of users and the basic user information to generate a basic user profile. | Learn how to perform the following operations:
|
Configure data quality monitoring rules | Data Quality allows you to monitor the quality of data in tables that are generated by scheduling nodes. You can configure monitoring rules for tables to monitor the quality of data in the tables. | Configure data quality monitoring rules for the table generated by a DataWorks node to quickly identify the dirty data that is generated during the change in source data and prevent the dirty data from affecting descendant nodes. |
Experiment data
Structure of log data for the experiment
Before you perform the operations in this experiment, make sure that you are familiar with the existing business data, the data format, and the basic user profile data structure that is required for business background analysis.
The following code shows the raw log data that is stored in the OSS object
user_log.txt:$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];The following table describes the valid information that is obtained from the raw log data.
Field
Description
$remote_addr
The IP address of the client that sends the request.
$remote_user
The username that is used to log on to the client.
$time_local
The local time of the server.
$request
The HTTP request, including the request type, request URL, and HTTP version number.
$status
The status code that is returned by the server.
$body_bytes_sent
The number of bytes returned to the client, not including the number of bytes of the header.
$http_referer
The source URL of the request.
$http_user_agent
The information about the client that sends the request, such as the browser used.
Structure of user information data for the experiment
Structure of user information data stored in ApsaraDB RDS for MySQL (ods_user_info_d)
Field | Description |
uid | The name of the user. |
gender | The gender. |
age_range | The age range. |
zodiac | The zodiac sign. |
Structure of final data obtained in the experiment
Confirm the schema of the final data table as described in the following table based on the valid data that you obtain after you analyze raw data and your business requirements.
Field | Description |
uid | The name of the user. |
region | The region. |
device | The terminal type. |
pv | The number of page views. |
gender | The gender. |
age_range | The age range. |
Zodiac | The zodiac sign. |
What to do next
Operation | Description | References |
Manage metadata | In Data Map, view and manage the metadata of the source table. | |
Consume data |
|