All Products
Search
Document Center

E-MapReduce:Experiment overview

Last Updated:Mar 26, 2026

This tutorial walks you through a user profile analysis experiment using DataWorks and E-MapReduce (EMR). By the end of this tutorial, you will know how to:

  • Synchronize data from multiple sources — ApsaraDB RDS for MySQL and Object Storage Service (OSS) — to EMR.

  • Process and split raw web access logs into analyzable fields.

  • Aggregate user information with processed log data to generate a user profile.

  • Monitor data quality with automated rules to catch dirty data before it reaches downstream nodes.

Prerequisites

Before you begin, make sure you have:

  • An Alibaba Cloud account with access to DataWorks and EMR.

  • Basic familiarity with SQL and data pipeline concepts.

Experiment design

Background

To support fine-grained operations on website traffic, this experiment builds a periodic user profile analysis pipeline. The pipeline covers geographical and social attributes of users and uses DataWorks to:

  • Synchronize data from different sources.

  • Process and transform raw data.

  • Manage data quality.

  • Make data available for consumption.

How it works

The pipeline runs in four steps:

  1. Synchronize: Data Integration extracts basic user information from ApsaraDB RDS for MySQL and web access logs from OSS, then loads both into EMR.

  2. Process: Split the raw access logs into structured fields using functions and regular expressions. Aggregate the split logs with user information in EMR.

  3. Transform: Further process the aggregated data to produce a complete user profile.

  4. Monitor: Apply Data Quality rules to detect dirty data in scheduling node outputs and prevent it from propagating to descendant nodes.

DataWorks services involved

The following table maps each pipeline step to the DataWorks service used and the skills you will practice.

Step Operation What you will learn
Synchronize data Configure a synchronization node to load user information from ApsaraDB RDS for MySQL and access logs from OSS into EMR Synchronize data from different sources to EMR; create a table for the related data source; trigger a node manually; view node logs
Process data In DataStudio, split access logs into fields using functions or regular expressions, then aggregate logs and user information to generate a user profile Create and configure nodes in a DataWorks workflow; run a workflow
Configure data quality monitoring rules In Data Quality, configure monitoring rules for tables generated by scheduling nodes to detect dirty data Set up data quality rules to identify dirty data caused by source data changes and prevent it from affecting descendant nodes

Experiment data

Access log data

The experiment uses raw web access logs stored in the OSS object user_log.txt. Each log line follows this format:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

The following table describes each field extracted from the raw log.

Field Description
$remote_addr IP address of the client
$remote_user Username used to log in to the client
$time_local Local time on the server
$request HTTP request, including the request type, URL, and HTTP version
$status HTTP status code returned by the server
$body_bytes_sent Number of bytes sent to the client, excluding header bytes
$http_referer Source URL of the request
$http_user_agent Client information, such as the browser type

User information data

Basic user information is stored in ApsaraDB RDS for MySQL in the table ods_user_info_d.

Field Description
uid User name
gender Gender
age_range Age range
zodiac Zodiac sign

Final output data

After processing and aggregating the two data sources, the pipeline produces a result table with the following schema.

Field Description
uid User name
region Region
device Terminal type
pv Page views
gender Gender
age_range Age range
Zodiac Zodiac sign

What's next

After completing the experiment, you can extend the pipeline with the following operations.

Operation Description References
Manage metadata In Data Map, view and manage the metadata of the source table Manage data
Consume data In DataAnalysis, run SQL queries against the result table to analyze geographical distribution of users and city rankings. Use the API feature in DataService Studio to expose the result table as an API. Visualize data on a dashboard and Use an API to provide data services