User Profile Analysis Pipeline Architecture with EMR - E-MapReduce

This tutorial walks you through a user profile analysis experiment using DataWorks and E-MapReduce (EMR). By the end of this tutorial, you will know how to:

Synchronize data from multiple sources — ApsaraDB RDS for MySQL and Object Storage Service (OSS) — to EMR.
Process and split raw web access logs into analyzable fields.
Aggregate user information with processed log data to generate a user profile.
Monitor data quality with automated rules to catch dirty data before it reaches downstream nodes.

Prerequisites

Before you begin, make sure you have:

An Alibaba Cloud account with access to DataWorks and EMR.
Basic familiarity with SQL and data pipeline concepts.

Experiment design

Background

To support fine-grained operations on website traffic, this experiment builds a periodic user profile analysis pipeline. The pipeline covers geographical and social attributes of users and uses DataWorks to:

Synchronize data from different sources.
Process and transform raw data.
Manage data quality.
Make data available for consumption.

How it works

The pipeline runs in four steps:

Synchronize: Data Integration extracts basic user information from ApsaraDB RDS for MySQL and web access logs from OSS, then loads both into EMR.
Process: Split the raw access logs into structured fields using functions and regular expressions. Aggregate the split logs with user information in EMR.
Transform: Further process the aggregated data to produce a complete user profile.
Monitor: Apply Data Quality rules to detect dirty data in scheduling node outputs and prevent it from propagating to descendant nodes.

DataWorks services involved

The following table maps each pipeline step to the DataWorks service used and the skills you will practice.

Step	Operation	What you will learn
Synchronize data	Configure a synchronization node to load user information from ApsaraDB RDS for MySQL and access logs from OSS into EMR	Synchronize data from different sources to EMR; create a table for the related data source; trigger a node manually; view node logs
Process data	In DataStudio, split access logs into fields using functions or regular expressions, then aggregate logs and user information to generate a user profile	Create and configure nodes in a DataWorks workflow; run a workflow
Configure data quality monitoring rules	In Data Quality, configure monitoring rules for tables generated by scheduling nodes to detect dirty data	Set up data quality rules to identify dirty data caused by source data changes and prevent it from affecting descendant nodes

Experiment data

Access log data

The experiment uses raw web access logs stored in the OSS object user_log.txt. Each log line follows this format:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

The following table describes each field extracted from the raw log.

Field	Description
`$remote_addr`	IP address of the client
`$remote_user`	Username used to log in to the client
`$time_local`	Local time on the server
`$request`	HTTP request, including the request type, URL, and HTTP version
`$status`	HTTP status code returned by the server
`$body_bytes_sent`	Number of bytes sent to the client, excluding header bytes
`$http_referer`	Source URL of the request
`$http_user_agent`	Client information, such as the browser type

User information data

Basic user information is stored in ApsaraDB RDS for MySQL in the table ods_user_info_d.

Field	Description
`uid`	User name
`gender`	Gender
`age_range`	Age range
`zodiac`	Zodiac sign

Final output data

After processing and aggregating the two data sources, the pipeline produces a result table with the following schema.

Field	Description
`uid`	User name
`region`	Region
`device`	Terminal type
`pv`	Page views
`gender`	Gender
`age_range`	Age range
`Zodiac`	Zodiac sign

What's next

After completing the experiment, you can extend the pipeline with the following operations.

Operation	Description	References
Manage metadata	In Data Map, view and manage the metadata of the source table	Manage data
Consume data	In DataAnalysis, run SQL queries against the result table to analyze geographical distribution of users and city rankings. Use the API feature in DataService Studio to expose the result table as an API.	Visualize data on a dashboard and Use an API to provide data services