All Products
Search
Document Center

E-MapReduce:Experiment introduction

Last Updated:Mar 07, 2025

This topic walks you through a simple user profile analysis experiment to describe how to use DataWorks and E-MapReduce (EMR) to synchronize data, develop data, and perform O&M on nodes. In this topic, you can learn about the background information, workflow design, DataWorks services involved in the experiment, and experiment data.

Experiment design

Background information

To help your enterprise better formulate operational strategies, you must obtain basic user profile data from user behavior on websites, such as the geographical and social attributes of users, to periodically perform a user profile analysis. This can help you perform fine-grained operations on website traffic. To do this, you must use DataWorks to perform the following operations:

  • Synchronize data.

  • Process data.

  • Manage data.

  • Consume data.

Workflow design

For this experiment, you can complete user profile analysis by using DataWorks and EMR. The procedure contains the following steps:

  1. In Data Integration, extract basic user information and website access logs of users from different data sources to a compute engine.

  2. Process and split the website access logs of users in the compute engine into fields that can be analyzed.

  3. Aggregate the basic user information and the processed website access logs of users in the compute engine.

  4. Further process the data to generate a basic user profile.

DataWorks services involved

The following table describes the use of different DataWorks services in each step of the experiment.

Step

Operation

Phase-specific objective

Synchronize data

Configure a synchronization node to synchronize the basic user information that is stored in ApsaraDB RDS for MySQL and website access logs of users that are stored in Object Storage Service (OSS) to EMR.

Learn how to perform the following operations:

  • Synchronize data from different data sources to EMR.

  • Create a table for the related data source.

  • Quickly trigger the node.

  • View node logs.

Process data

In DataStudio, split the website access logs of users into fields that can be analyzed by using methods such as functions or regular expressions, and aggregate the processed website access logs of users and the basic user information to generate a basic user profile.

Learn how to perform the following operations:

  • Create and configure nodes in a DataWorks workflow.

  • Run a workflow.

Configure data quality monitoring rules

Data Quality allows you to monitor the quality of data in tables that are generated by scheduling nodes. You can configure monitoring rules for tables to monitor the quality of data in the tables.

Configure data quality monitoring rules for the table generated by a DataWorks node to quickly identify the dirty data that is generated during the change in source data and prevent the dirty data from affecting descendant nodes.

Experiment data

Structure of log data for the experiment

Before you perform the operations in this experiment, make sure that you are familiar with the existing business data, the data format, and the basic user profile data structure that is required for business background analysis.

  • The following code shows the raw log data that is stored in the OSS object user_log.txt:

    $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

    The following table describes the valid information that is obtained from the raw log data.

    Field

    Description

    $remote_addr

    The IP address of the client that sends the request.

    $remote_user

    The username that is used to log on to the client.

    $time_local

    The local time of the server.

    $request

    The HTTP request, including the request type, request URL, and HTTP version number.

    $status

    The status code that is returned by the server.

    $body_bytes_sent

    The number of bytes returned to the client, not including the number of bytes of the header.

    $http_referer

    The source URL of the request.

    $http_user_agent

    The information about the client that sends the request, such as the browser used.

Structure of user information data for the experiment

Structure of user information data stored in ApsaraDB RDS for MySQL (ods_user_info_d)

Field

Description

uid

The name of the user.

gender

The gender.

age_range

The age range.

zodiac

The zodiac sign.

Structure of final data obtained in the experiment

Confirm the schema of the final data table as described in the following table based on the valid data that you obtain after you analyze raw data and your business requirements.

Field

Description

uid

The name of the user.

region

The region.

device

The terminal type.

pv

The number of page views.

gender

The gender.

age_range

The age range.

Zodiac

The zodiac sign.

What to do next

Operation

Description

References

Manage metadata

In Data Map, view and manage the metadata of the source table.

Manage data

Consume data

  • In DataAnalysis, execute SQL statements to query and analyze data in the final result table. Perform a user profile analysis on the final result table. For example, you can analyze the geographical distribution of users and the rankings of cities by the number of registered users.

  • Use the API feature of DataService Studio to create APIs based on the final result table.