This tutorial walks you through a user profile analysis experiment using DataWorks and E-MapReduce (EMR). By the end of this tutorial, you will know how to:
-
Synchronize data from multiple sources — ApsaraDB RDS for MySQL and Object Storage Service (OSS) — to EMR.
-
Process and split raw web access logs into analyzable fields.
-
Aggregate user information with processed log data to generate a user profile.
-
Monitor data quality with automated rules to catch dirty data before it reaches downstream nodes.
Prerequisites
Before you begin, make sure you have:
-
An Alibaba Cloud account with access to DataWorks and EMR.
-
Basic familiarity with SQL and data pipeline concepts.
Experiment design
Background
To support fine-grained operations on website traffic, this experiment builds a periodic user profile analysis pipeline. The pipeline covers geographical and social attributes of users and uses DataWorks to:
-
Synchronize data from different sources.
-
Process and transform raw data.
-
Manage data quality.
-
Make data available for consumption.
How it works
The pipeline runs in four steps:
-
Synchronize: Data Integration extracts basic user information from ApsaraDB RDS for MySQL and web access logs from OSS, then loads both into EMR.
-
Process: Split the raw access logs into structured fields using functions and regular expressions. Aggregate the split logs with user information in EMR.
-
Transform: Further process the aggregated data to produce a complete user profile.
-
Monitor: Apply Data Quality rules to detect dirty data in scheduling node outputs and prevent it from propagating to descendant nodes.
DataWorks services involved
The following table maps each pipeline step to the DataWorks service used and the skills you will practice.
| Step | Operation | What you will learn |
|---|---|---|
| Synchronize data | Configure a synchronization node to load user information from ApsaraDB RDS for MySQL and access logs from OSS into EMR | Synchronize data from different sources to EMR; create a table for the related data source; trigger a node manually; view node logs |
| Process data | In DataStudio, split access logs into fields using functions or regular expressions, then aggregate logs and user information to generate a user profile | Create and configure nodes in a DataWorks workflow; run a workflow |
| Configure data quality monitoring rules | In Data Quality, configure monitoring rules for tables generated by scheduling nodes to detect dirty data | Set up data quality rules to identify dirty data caused by source data changes and prevent it from affecting descendant nodes |
Experiment data
Access log data
The experiment uses raw web access logs stored in the OSS object user_log.txt. Each log line follows this format:
$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];
The following table describes each field extracted from the raw log.
| Field | Description |
|---|---|
$remote_addr |
IP address of the client |
$remote_user |
Username used to log in to the client |
$time_local |
Local time on the server |
$request |
HTTP request, including the request type, URL, and HTTP version |
$status |
HTTP status code returned by the server |
$body_bytes_sent |
Number of bytes sent to the client, excluding header bytes |
$http_referer |
Source URL of the request |
$http_user_agent |
Client information, such as the browser type |
User information data
Basic user information is stored in ApsaraDB RDS for MySQL in the table ods_user_info_d.
| Field | Description |
|---|---|
uid |
User name |
gender |
Gender |
age_range |
Age range |
zodiac |
Zodiac sign |
Final output data
After processing and aggregating the two data sources, the pipeline produces a result table with the following schema.
| Field | Description |
|---|---|
uid |
User name |
region |
Region |
device |
Terminal type |
pv |
Page views |
gender |
Gender |
age_range |
Age range |
Zodiac |
Zodiac sign |
What's next
After completing the experiment, you can extend the pipeline with the following operations.
| Operation | Description | References |
|---|---|---|
| Manage metadata | In Data Map, view and manage the metadata of the source table | Manage data |
| Consume data | In DataAnalysis, run SQL queries against the result table to analyze geographical distribution of users and city rankings. Use the API feature in DataService Studio to expose the result table as an API. | Visualize data on a dashboard and Use an API to provide data services |