DataWorks tutorial: website user profile analysis - DataWorks

Walk through a website user profile analysis to learn core DataWorks tasks: data synchronization, processing, management, and consumption.

Objectives

Expected outcome

Complete common DataWorks tasks independently, including data synchronization, data development, and task O&M.

Target audience

Developers, data analysts, and product operations personnel who need to extract and analyze data warehouse data.

Design overview

Extract basic user profiles from website behavioral data—geographical attributes, social attributes, and more—to drive periodic task scheduling and fine-grained traffic operations in DataWorks.

Services involved

This case involves the following services.

Service category	Service name	Description
Database	ApsaraDB RDS for MySQL	Stores basic user information.
Database	Object Storage Service (OSS)	Stores log information.
Compute engine	MaxCompute	Processes raw data and stores results. Choose one of: MaxCompute, EMR, EMR Serverless StarRocks, or EMR Serverless Spark.
	EMR Serverless StarRocks
	E-MapReduce (EMR)
	EMR Serverless Spark
Data mid-end	DataWorks	Serves as the data mid-end for synchronization, processing, quality monitoring, consumption, and scheduling.

Important

Databases and DataWorks are shared across all compute engine paths. You only need to associate the desired compute engine with your DataWorks workspace.
If you use EMR or EMR Serverless Spark, prepare an OSS data source to receive basic user information and log information. If you use EMR Serverless StarRocks, prepare an OSS data source to store the .jar package for registering a StarRocks function. Ensure the OSS data sources have sufficient storage space and that you have the required permissions.

Architecture

Add databases as data sources and associate compute engines as computing resources in your DataWorks workspace. Then process, manage, and consume data to obtain user geographical and social attributes.

Workflow

In this case, you can select the appropriate website user profile analysis process based on compute engines that you use. Four paths are available: User profile analysis (MaxCompute), User profile analysis (StarRocks), User profile analysis (EMR), and User profile analysis (Spark). Each path includes the following steps:

Use Data Integration to synchronize user information and access logs from data sources to a compute engine.
Split access logs into analyzable fields in the compute engine.
Aggregate user information with the processed access logs.
Process the aggregated data to produce user profiles.

Operations

The following table describes the operations that are involved in this case.

Step	Operation	Phased objective
Synchronize data	Synchronize MySQL user information and OSS access logs to computing resources. MaxCompute and Spark: Synchronize raw data directly to computing resources via Data Integration. EMR and Spark: Store synchronized raw data in the prepared OSS object, then use EMR and Spark tables to read it.	You learn to: Synchronize data from various sources to MaxCompute, EMR, StarRocks, or Spark. Create tables for data sources. Trigger tasks manually. View task logs.
Process data	Use Data Studio to split log data into analyzable fields with functions and regular expressions, then aggregate the fields with user information to produce profile data.	You learn to: Create and configure workflow tasks. Run a workflow.
Manage data	Use Data Map to view and manage source table metadata. Monitor dirty data from source changes and stop related tasks if errors occur to prevent downstream impact.	Search for and view metadata and details of data source tables in DataWorks. Configure data quality monitoring rules for task output tables to detect dirty data and prevent it from affecting downstream tasks.
Consume data	Use Data Analysis to run SQL queries on the final result table. For example, analyze user geographical distribution and city-level registration rankings. Use the DataService Studio API management feature to create API services from the final result table.	Present data visually and create APIs in DataWorks.

Case data

These data structures are used in subsequent synchronization, processing, and management steps.

Log data structure

Familiarize yourself with the existing business data, data format, and target user profile structure before proceeding.

Raw log data format in the OSS file user_log.txt:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

The log data contains the following fields.

Field name	Field description
$remote_addr	The IP address of the client that sends the request.
$remote_user	The username that is used to log on to the client.
$time_local	The local time of the server.
$request	The HTTP request. An HTTP request consists of the request type, request URL, and HTTP version number.
$status	The status code that is returned by the server.
$body_bytes_sent	The number of bytes returned to the client. The number of bytes of the header is not included in the field value.
$http_referer	The source URL of the request.
$http_user_agent	The information about the client that sends the request, such as the browser that is used.

User information data structure

The following table lists the table structure of the MySQL user information data table ods_user_info_d.

Field name	Field description
uid	The username.
gender	The gender.
age_range	The age range.
zodiac	The zodiac sign.

Output data structure

Final table structure after raw data analysis. Adjust fields based on your business requirements.

Field name	Field description
uid	The username.
region	The region.
device	The terminal type.
pv	The number of page views.
gender	The gender.
age_range	The age range.
Zodiac	The zodiac sign.