All Products
Search
Document Center

DataWorks:Tutorial: User profile analysis

Last Updated:Jun 02, 2026

Walk through a website user profile analysis to learn core DataWorks tasks: data synchronization, processing, management, and consumption.

Objectives

Expected outcome

Complete common DataWorks tasks independently, including data synchronization, data development, and task O&M.

Target audience

Developers, data analysts, and product operations personnel who need to extract and analyze data warehouse data.

Design overview

Extract basic user profiles from website behavioral data—geographical attributes, social attributes, and more—to drive periodic task scheduling and fine-grained traffic operations in DataWorks.

Services involved

This case involves the following services.

Service category

Service name

Description

Database

rds_mysqlApsaraDB RDS for MySQL

Stores basic user information.

OSSObject Storage Service (OSS)

Stores log information.

Compute engine

MaxComputeMaxCompute

Processes raw data and stores results. Choose one of: MaxCompute, EMR, EMR Serverless StarRocks, or EMR Serverless Spark.

StarRocksEMR Serverless StarRocks

emrE-MapReduce (EMR)

apachespark (2)EMR Serverless Spark

Data mid-end

dide DataWorksDataWorks

Serves as the data mid-end for synchronization, processing, quality monitoring, consumption, and scheduling.

Important
  • Databases and DataWorks are shared across all compute engine paths. You only need to associate the desired compute engine with your DataWorks workspace.

  • If you use EMR or EMR Serverless Spark, prepare an OSS data source to receive basic user information and log information. If you use EMR Serverless StarRocks, prepare an OSS data source to store the .jar package for registering a StarRocks function. Ensure the OSS data sources have sufficient storage space and that you have the required permissions.

Architecture

Add databases as data sources and associate compute engines as computing resources in your DataWorks workspace. Then process, manage, and consume data to obtain user geographical and social attributes.

image

Workflow

In this case, you can select the appropriate website user profile analysis process based on compute engines that you use. Four paths are available: User profile analysis (MaxCompute), User profile analysis (StarRocks), User profile analysis (EMR), and User profile analysis (Spark). Each path includes the following steps:

  1. Use Data Integration to synchronize user information and access logs from data sources to a compute engine.

  2. Split access logs into analyzable fields in the compute engine.

  3. Aggregate user information with the processed access logs.

  4. Process the aggregated data to produce user profiles.

Operations

The following table describes the operations that are involved in this case.

Step

Operation

Phased objective

Synchronize data

Synchronize MySQL user information and OSS access logs to computing resources.

  • MaxCompute and Spark: Synchronize raw data directly to computing resources via Data Integration.

  • EMR and Spark: Store synchronized raw data in the prepared OSS object, then use EMR and Spark tables to read it.

You learn to:

  • Synchronize data from various sources to MaxCompute, EMR, StarRocks, or Spark.

  • Create tables for data sources.

  • Trigger tasks manually.

  • View task logs.

Process data

Use Data Studio to split log data into analyzable fields with functions and regular expressions, then aggregate the fields with user information to produce profile data.

You learn to:

  • Create and configure workflow tasks.

  • Run a workflow.

Manage data

Use Data Map to view and manage source table metadata. Monitor dirty data from source changes and stop related tasks if errors occur to prevent downstream impact.

  • Search for and view metadata and details of data source tables in DataWorks.

  • Configure data quality monitoring rules for task output tables to detect dirty data and prevent it from affecting downstream tasks.

Consume data

  • Use Data Analysis to run SQL queries on the final result table. For example, analyze user geographical distribution and city-level registration rankings.

  • Use the DataService Studio API management feature to create API services from the final result table.

Present data visually and create APIs in DataWorks.

Case data

These data structures are used in subsequent synchronization, processing, and management steps.

Log data structure

Familiarize yourself with the existing business data, data format, and target user profile structure before proceeding.

Raw log data format in the OSS file user_log.txt:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

The log data contains the following fields.

Field name

Field description

$remote_addr

The IP address of the client that sends the request.

$remote_user

The username that is used to log on to the client.

$time_local

The local time of the server.

$request

The HTTP request. An HTTP request consists of the request type, request URL, and HTTP version number.

$status

The status code that is returned by the server.

$body_bytes_sent

The number of bytes returned to the client. The number of bytes of the header is not included in the field value.

$http_referer

The source URL of the request.

$http_user_agent

The information about the client that sends the request, such as the browser that is used.

User information data structure

The following table lists the table structure of the MySQL user information data table ods_user_info_d.

Field name

Field description

uid

The username.

gender

The gender.

age_range

The age range.

zodiac

The zodiac sign.

Output data structure

Final table structure after raw data analysis. Adjust fields based on your business requirements.

Field name

Field description

uid

The username.

region

The region.

device

The terminal type.

pv

The number of page views.

gender

The gender.

age_range

The age range.

Zodiac

The zodiac sign.