This topic walks you through a simple user profile analysis experiment to describe the main features of DataWorks and the tasks that you can run in DataWorks. The experiment is performed by separately using the service portfolios of DataWorks and E-MapReduce (EMR), DataWorks and MaxCompute, DataWorks and StarRocks, and DataWorks and Spark.
Experiment objectives
Expectations
After you perform the experiment, you will better understand the main features of DataWorks.
After you perform the experiment, you will be able to develop and run common data-related tasks such as data synchronization, data development, and task O&M.
Intended audiences
Development engineers, data analysts, product operations engineers, and engineers who need to query data from data warehouses and analyze and gain insights into data.
Experiment design
Background information
To help your enterprise better formulate operational strategies, you must obtain basic user profile data from user behavior on websites, such as the geographical and social attributes of users, to periodically perform a user profile analysis. This can help you perform fine-grained operations on website traffic. To do this, you must use DataWorks to perform the following operations:
Synchronize data.
Process data.
Manage data.
Consume data.
Services involved
To perform a user profile analysis, databases used to store raw data, computing and storage databases, and a development platform are required. This subsection describes the services that are required in this experiment.
Services used to store raw data
ApsaraDB RDS for MySQL
In this experiment, ApsaraDB RDS for MySQL is used to store user information, and provides basic information about the data sources of user information by default.
For information about ApsaraDB RDS for MySQL, see What is ApsaraDB RDS for MySQL?
Object Storage Service (OSS)
In this experiment, OSS is used to store log information, and provides basic information about the data sources of log information by default.
For information about OSS, see What is OSS?
ImportantIf you use the service portfolio of DataWorks and EMR or DataWorks and StarRocks to perform the experiment, you must prepare the OSS data source that is used to receive user information and log data for the experiment or the OSS data source that is used to store a
JAR
package required to register a function in StarRocks.
Services used for computing and storage
EMR
In this experiment, EMR on ECS is used to process raw data to generate the required data. After the processing, the data is stored.
For information about EMR on ECS, see What is EMR on ECS?
MaxCompute
In this experiment, MaxCompute is used to process raw data to generate the required data. After the processing, the data is stored.
For information about MaxCompute, see What is MaxCompute?
EMR Serverless StarRocks
In this experiment, EMR Serverless StarRocks is used to process raw data to generate the required data. After the processing, the data is stored.
EMR Serverless Spark
In this experiment, EMR Serverless Spark is used to process raw data to generate the required data. After the processing, the data is stored.
For information about EMR Serverless Spark, see What is EMR Serverless Spark?
ImportantThis experiment is performed based on DataWorks, by using EMR, MaxCompute, EMR Serverless StarRocks, or EMR Serverless Spark as the data source. You can perform the experiment as long as you activate one of these services.
Service used for development and scheduling
DataWorks
In this experiment, DataWorks serves as a data mid-end used to synchronize, process, monitor the quality of, and consume raw data.
For information about DataWorks, see What is DataWorks?
Workflow design
For this experiment, you can select different user profile analysis experiment procedures based on the compute engines that you have. The procedures for MaxCompute, EMR, StarRocks, and Spark compute engines are provided. The procedures contain the following steps:
Use Data Integration to extract basic information and access log information of users from different data sources to a compute engine.
Process and split the access log information in the compute engine into fields that can be analyzed.
Aggregate the basic information and the processed access log information in the compute engine.
Perform further processing to generate a basic user profile.
DataWorks services involved
The following table describes the use of different DataWorks services in each step of the experiment.
Step | Operation | Phase-specific objective |
Synchronize data | Configure a synchronization task to synchronize basic user information that is stored in ApsaraDB RDS for MySQL and user access log information that is stored in OSS to MaxCompute, EMR, or StarRocks. | Learn how to perform the following operations:
|
Process data | Use DataStudio to split the user access log information into fields that can be analyzed by using methods such as functions or regular expressions, and aggregate the processed user access log information and the basic user information to generate a basic user profile. | Learn how to perform the following operations:
|
Manage data | Use Data Map to view and manage the metadata of the source table. Monitor the dirty data that is generated during the change in source data. If an error occurs, stop the running of the related task to prevent negative impacts caused by the error. |
|
Consume data |
| Present data in a visualized manner and create APIs based on DataWorks. |
Experiment data
Structure of log data for the experiment
Before you perform the operations in this experiment, make sure that you are familiar with the existing business data, the data format, and the basic user profile data structure that is required for business background analysis.
The following code shows the raw log data that is stored in the OSS object
user_log.txt
:$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];
The following table describes the valid information that is obtained from the raw log data.
Field
Description
$remote_addr
The IP address of the client that sends the request.
$remote_user
The username that is used to log on to the client.
$time_local
The local time of the server.
$request
The HTTP request, including the request type, request URL, and HTTP version number.
$status
The status code that is returned by the server.
$body_bytes_sent
The number of bytes returned to the client, not including the number of bytes of the header.
$http_referer
The source URL of the request.
$http_user_agent
The information about the client that sends the request, such as the browser used.
Structure of user information data for the experiment
Structure of user information data stored in ApsaraDB RDS for MySQL (ods_user_info_d
)
Field | Description |
uid | The name of the user. |
gender | The gender. |
age_range | The age range. |
zodiac | The zodiac sign. |
Structure of final data obtained in the experiment
Confirm the schema of the final data table as described in the following table based on the valid data that you obtain after you analyze raw data and your business requirements.
Field | Description |
uid | The name of the user. |
region | The region. |
device | The terminal type. |
pv | The number of page views. |
gender | The gender. |
age_range | The age range. |
Zodiac | The zodiac sign. |
Consultation
If you have questions when you perform the experiment, join the DingTalk group for consultation.