All Products
Search
Document Center

DataWorks:Requirement analysis

Last Updated:Aug 22, 2023

To ensure that you can complete this experiment, we recommend that you familiarize yourself with the business background of the experiment and complete the preparations.

Precautions

  • Basic user information and website access logs of users that are required for tests in this experiment are provided.

  • The data in this experiment can be used only for experimental operations in DataWorks, and all the data is manual mock data.

Business background

To develop effective business management strategies, you must obtain basic profile data of website users based on their activities on websites. The basic profile data includes the geographical and social attributes of the website users. You can analyze profile data based on time and location and perform refined operations on website traffic by using basic user profile data.

Step 1: Familiarize yourself with existing business data

Before you perform the operations in this experiment, make sure that you are familiar with existing business data, the format of data in the project, and the basic user profile data schema that is required for business background analysis.

Object Storage Service (OSS) object user_log.txt that stores raw log data

The following code block shows the raw log data that is stored in the OSS object user_log.txt:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

The following table describes the valid information that is obtained from the raw log data.

Field

Description

$remote_addr

The IP address of the client that sends the request.

$remote_user

The username that is used to log on to the client.

$time_local

The local time of the server.

$request

The HTTP request. An HTTP request consists of the request type, request URL, and HTTP version number.

$status

The status code that is returned by the server.

$body_bytes_sent

The number of bytes returned to the client. The number of bytes of the header is not included in the field value.

$http_referer

The source URL of the request.

$http_user_agent

The information about the client that sends the request, such as the browser used.

MySQL table ods_user_info_d that stores user information

Field

Description

uid

The username.

gender

The gender.

age_range

The age range.

zodiac

The zodiac sign.

Step 2: Analyze and generate the desired basic user profile data schema

Confirm the final data table schema that is displayed in the following table based on the valid data that you obtain after you analyze raw data and based on your business requirements.

Field

Description

uid

The username.

region

The region.

device

The terminal type.

pv

The number of page views.

gender

The gender.

age_range

The age range.

Zodiac

The zodiac sign.

Step 3: Design a data flow and a workflow

  1. Specify table names based on naming conventions:

    You can specify table names by referring to naming conventions. In this experiment, the following tables are required.

    • Basic user information table: ods_user_info_d

    • Raw data table that stores website access logs: ods_raw_log_d

    • Fact table that stores website access logs: dwd_log_info_di

    • Wide table that stores website access information about users: dws_user_info_all_di

    • User profile data table: ads_user_info_1d

  2. Design a data flow based on specifications:

    In this experiment, batch synchronization nodes in Data Integration are used to synchronize basic user information and website access logs of users to MaxCompute tables, and ODPS SQL nodes are used to process the synchronized data layer by layer into the final desired user profile data by using the MaxCompute compute engine. The preceding figure shows the logic.

  3. Specify node names based on naming conventions:

    To better identify a node and the output table of the node, the node is named the same as the output table of the node in this experiment by default. We recommend that you comply with the same specifications in actual data development.

  4. Design a workflow based on specifications:

    • Design a workflow based on a data flow: The relationships between ancestor and descendant nodes in DataWorks follow the one-to-one mapping principle for node and table names. For more information about node naming conventions, see Configure same-cycle scheduling dependencies.

    • Enable the workflow management design: The batch synchronization node ods_raw_log_d for OSS and the batch synchronization node ods_user_info_d for MySQL are used to write raw business data to a data warehouse. No data lineage exists at the data warehouse side and the two nodes do not have a unified root node. In this case, you must add a zero load node named WorkShop_Start to manage the user profile analysis workflow to which the nodes belong.

      In this experiment, the scheduling time of the zero load node WorkShop_Start is set to 00:15 every day. This way, the user profile analysis workflow is triggered to run at 00:15 every day.

Note

A node can start to run only after its ancestor nodes finish running, and DataWorks does not actually run a zero load node but directly prompts that the node is successfully run. You can configure the zero load node WorkShop_Start as the ancestor node of the batch synchronization nodes ods_raw_log_d and ods_user_info_d. This way, the zero load node can be used to manage the user profile analysis workflow.

Step 4: Summarize requirements

Phase

Object

Description

Data collection

ods_user_info_d

Synchronize the basic user information from the MySQL table ods_user_info_d to the MaxCompute table ods_user_info_d.

ods_raw_log_d

Synchronize the website access logs of users from the OSS object user_log.txt to the MaxCompute table ods_raw_log_d.

Data processing

dwd_log_info_di

Use methods such as functions and regular expressions to process and split the log data that is stored in the MaxCompute table ods_raw_log_d into analyzable fields. For example, you can use functions and regular expressions to convert IP addresses into regions. Then, write the processed data to the dwd_log_info_di table.

dws_user_info_all_di

Aggregate the MaxCompute table ods_user_info_d used to store the basic user information and the MaxCompute table dwd_log_info_di used to store the preliminarily processed log data into the dws_user_info_all_di wide table used to store user access information.

ads_user_info_1d

Process the dws_user_info_all_di wide table that is used to store user access information to generate desired user profile data.

Data quality monitoring

ods_raw_log_d

Configure a rule that monitors whether the number of rows synchronized to the raw log data table is 0 on a daily basis. This rule helps prevent invalid data processing.

ods_user_info_d

Configure a strong rule that monitors whether the number of rows synchronized to the user information table is 0 on a daily basis, and a weak rule that monitors whether the business primary key in the table is unique on a daily basis. These rules help prevent invalid data processing.

ads_user_info_1d

Configure a weak rule that monitors the fluctuation of the number of rows in the user information table on a daily basis, and a strong rule that monitors whether the business primary key in the table is unique on a daily basis. These rules are used to observe the fluctuation of daily unique visitors (UVs) and help you learn the application status at the earliest opportunity.

Data visualization

ads_user_info_1d

After data is processed, you can use the DataAnalysis service to display the analysis result of the user profile data. Sample data visualization:

  • Numbers of registered members in different provinces and cities

  • Distribution of page views of members in different age ranges

  • Distribution of page views of members by gender

  • Numbers of page views of members counted by gender and zodiac sign