Walk through a website user profile analysis to learn core DataWorks tasks: data synchronization, processing, management, and consumption.
Objectives
Expected outcome
Complete common DataWorks tasks independently, including data synchronization, data development, and task O&M.
Target audience
Developers, data analysts, and product operations personnel who need to extract and analyze data warehouse data.
Design overview
Extract basic user profiles from website behavioral data—geographical attributes, social attributes, and more—to drive periodic task scheduling and fine-grained traffic operations in DataWorks.
Services involved
This case involves the following services.
|
Service category |
Service name |
Description |
|
Database |
Stores basic user information. |
|
|
Stores log information. |
||
|
Compute engine |
Processes raw data and stores results. Choose one of: MaxCompute, EMR, EMR Serverless StarRocks, or EMR Serverless Spark. |
|
|
Data mid-end |
Serves as the data mid-end for synchronization, processing, quality monitoring, consumption, and scheduling. |
-
Databases and DataWorks are shared across all compute engine paths. You only need to associate the desired compute engine with your DataWorks workspace.
-
If you use EMR or EMR Serverless Spark, prepare an OSS data source to receive basic user information and log information. If you use EMR Serverless StarRocks, prepare an OSS data source to store the
.jarpackage for registering a StarRocks function. Ensure the OSS data sources have sufficient storage space and that you have the required permissions.
Architecture
Add databases as data sources and associate compute engines as computing resources in your DataWorks workspace. Then process, manage, and consume data to obtain user geographical and social attributes.
Workflow
In this case, you can select the appropriate website user profile analysis process based on compute engines that you use. Four paths are available: User profile analysis (MaxCompute), User profile analysis (StarRocks), User profile analysis (EMR), and User profile analysis (Spark). Each path includes the following steps:
-
Use Data Integration to synchronize user information and access logs from data sources to a compute engine.
-
Split access logs into analyzable fields in the compute engine.
-
Aggregate user information with the processed access logs.
-
Process the aggregated data to produce user profiles.
Operations
The following table describes the operations that are involved in this case.
|
Step |
Operation |
Phased objective |
|
Synchronize data |
Synchronize MySQL user information and OSS access logs to computing resources.
|
You learn to:
|
|
Process data |
Use Data Studio to split log data into analyzable fields with functions and regular expressions, then aggregate the fields with user information to produce profile data. |
You learn to:
|
|
Manage data |
Use Data Map to view and manage source table metadata. Monitor dirty data from source changes and stop related tasks if errors occur to prevent downstream impact. |
|
|
Consume data |
|
Present data visually and create APIs in DataWorks. |
Case data
These data structures are used in subsequent synchronization, processing, and management steps.
Log data structure
Familiarize yourself with the existing business data, data format, and target user profile structure before proceeding.
Raw log data format in the OSS file user_log.txt:
$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];
The log data contains the following fields.
|
Field name |
Field description |
|
$remote_addr |
The IP address of the client that sends the request. |
|
$remote_user |
The username that is used to log on to the client. |
|
$time_local |
The local time of the server. |
|
$request |
The HTTP request. An HTTP request consists of the request type, request URL, and HTTP version number. |
|
$status |
The status code that is returned by the server. |
|
$body_bytes_sent |
The number of bytes returned to the client. The number of bytes of the header is not included in the field value. |
|
$http_referer |
The source URL of the request. |
|
$http_user_agent |
The information about the client that sends the request, such as the browser that is used. |
User information data structure
The following table lists the table structure of the MySQL user information data table ods_user_info_d.
|
Field name |
Field description |
|
uid |
The username. |
|
gender |
The gender. |
|
age_range |
The age range. |
|
zodiac |
The zodiac sign. |
Output data structure
Final table structure after raw data analysis. Adjust fields based on your business requirements.
|
Field name |
Field description |
|
uid |
The username. |
|
region |
The region. |
|
device |
The terminal type. |
|
pv |
The number of page views. |
|
gender |
The gender. |
|
age_range |
The age range. |
|
Zodiac |
The zodiac sign. |