All Products
Search
Document Center

DataWorks:Experiment introduction

Last Updated:Jan 07, 2025

This topic walks you through a simple user profile analysis experiment to describe the main features of DataWorks and the tasks that you can run in DataWorks. The experiment is performed by separately using the service portfolios of DataWorks and E-MapReduce (EMR), DataWorks and MaxCompute, DataWorks and StarRocks, and DataWorks and Spark.

Experiment objectives

Expectations

  1. After you perform the experiment, you will better understand the main features of DataWorks.

  2. After you perform the experiment, you will be able to develop and run common data-related tasks such as data synchronization, data development, and task O&M.

Intended audiences

Development engineers, data analysts, product operations engineers, and engineers who need to query data from data warehouses and analyze and gain insights into data.

Experiment design

Background information

To help your enterprise better formulate operational strategies, you must obtain basic user profile data from user behavior on websites, such as the geographical and social attributes of users, to periodically perform a user profile analysis. This can help you perform fine-grained operations on website traffic. To do this, you must use DataWorks to perform the following operations:

  • Synchronize data.

  • Process data.

  • Manage data.

  • Consume data.

Services involved

To perform a user profile analysis, databases used to store raw data, computing and storage databases, and a development platform are required. This subsection describes the services that are required in this experiment.

  • Services used to store raw data

    • rds_mysqlApsaraDB RDS for MySQL

      • In this experiment, ApsaraDB RDS for MySQL is used to store user information, and provides basic information about the data sources of user information by default.

      • For information about ApsaraDB RDS for MySQL, see What is ApsaraDB RDS for MySQL?

    • oss 对象存储OSSObject Storage Service (OSS)

      • In this experiment, OSS is used to store log information, and provides basic information about the data sources of log information by default.

      • For information about OSS, see What is OSS?

      Important

      If you use the service portfolio of DataWorks and EMR or DataWorks and StarRocks to perform the experiment, you must prepare the OSS data source that is used to receive user information and log data for the experiment or the OSS data source that is used to store a JAR package required to register a function in StarRocks.

  • Services used for computing and storage

    • emrEMR

      • In this experiment, EMR on ECS is used to process raw data to generate the required data. After the processing, the data is stored.

      • For information about EMR on ECS, see What is EMR on ECS?

    • MaxComputeMaxCompute

      • In this experiment, MaxCompute is used to process raw data to generate the required data. After the processing, the data is stored.

      • For information about MaxCompute, see What is MaxCompute?

    • StarRocksEMR Serverless StarRocks

      • In this experiment, EMR Serverless StarRocks is used to process raw data to generate the required data. After the processing, the data is stored.

    • apachespark (2)EMR Serverless Spark

      • In this experiment, EMR Serverless Spark is used to process raw data to generate the required data. After the processing, the data is stored.

      • For information about EMR Serverless Spark, see What is EMR Serverless Spark?

    Important

    This experiment is performed based on DataWorks, by using EMR, MaxCompute, EMR Serverless StarRocks, or EMR Serverless Spark as the data source. You can perform the experiment as long as you activate one of these services.

  • Service used for development and scheduling

    • dide DataWorksDataWorks

      • In this experiment, DataWorks serves as a data mid-end used to synchronize, process, monitor the quality of, and consume raw data.

      • For information about DataWorks, see What is DataWorks?

Workflow design

For this experiment, you can select different user profile analysis experiment procedures based on the compute engines that you have. The procedures for MaxCompute, EMR, StarRocks, and Spark compute engines are provided. The procedures contain the following steps:

  1. Use Data Integration to extract basic information and access log information of users from different data sources to a compute engine.

  2. Process and split the access log information in the compute engine into fields that can be analyzed.

  3. Aggregate the basic information and the processed access log information in the compute engine.

  4. Perform further processing to generate a basic user profile.

DataWorks services involved

The following table describes the use of different DataWorks services in each step of the experiment.

Step

Operation

Phase-specific objective

Synchronize data

Configure a synchronization task to synchronize basic user information that is stored in ApsaraDB RDS for MySQL and user access log information that is stored in OSS to MaxCompute, EMR, or StarRocks.

Learn how to perform the following operations:

  • Synchronize data from different data sources to MaxCompute, EMR, StarRocks, or Spark.

  • Create a table for the related data source.

  • Quickly trigger the task.

  • View task logs.

Process data

Use DataStudio to split the user access log information into fields that can be analyzed by using methods such as functions or regular expressions, and aggregate the processed user access log information and the basic user information to generate a basic user profile.

Learn how to perform the following operations:

  • Create and configure nodes in a DataWorks workflow.

  • Run a workflow.

Manage data

Use Data Map to view and manage the metadata of the source table. Monitor the dirty data that is generated during the change in source data. If an error occurs, stop the running of the related task to prevent negative impacts caused by the error.

  • Obtain the metadata of the data source table based on DataWorks, search for the data source table, and view the detailed information of the data source table.

  • Configure data quality monitoring rules for the table generated by a DataWorks task to quickly identify the dirty data that is generated during the change in source data and prevent the dirty data from affecting descendant tasks.

Consume data

  • Use DataAnalysis to perform SQL-based queries and analysis on data in the final result table. Perform a user profile analysis on the final result table. For example, you can analyze the geographical distribution of users and the rankings of cities by the number of registered users.

  • Use the API feature of DataService Studio to create APIs based on the final result table.

Present data in a visualized manner and create APIs based on DataWorks.

Experiment data

Structure of log data for the experiment

Before you perform the operations in this experiment, make sure that you are familiar with the existing business data, the data format, and the basic user profile data structure that is required for business background analysis.

  • The following code shows the raw log data that is stored in the OSS object user_log.txt:

    $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent"$http_referer" "$http_user_agent" [unknown_content];

    The following table describes the valid information that is obtained from the raw log data.

    Field

    Description

    $remote_addr

    The IP address of the client that sends the request.

    $remote_user

    The username that is used to log on to the client.

    $time_local

    The local time of the server.

    $request

    The HTTP request, including the request type, request URL, and HTTP version number.

    $status

    The status code that is returned by the server.

    $body_bytes_sent

    The number of bytes returned to the client, not including the number of bytes of the header.

    $http_referer

    The source URL of the request.

    $http_user_agent

    The information about the client that sends the request, such as the browser used.

Structure of user information data for the experiment

Structure of user information data stored in ApsaraDB RDS for MySQL (ods_user_info_d)

Field

Description

uid

The name of the user.

gender

The gender.

age_range

The age range.

zodiac

The zodiac sign.

Structure of final data obtained in the experiment

Confirm the schema of the final data table as described in the following table based on the valid data that you obtain after you analyze raw data and your business requirements.

Field

Description

uid

The name of the user.

region

The region.

device

The terminal type.

pv

The number of page views.

gender

The gender.

age_range

The age range.

Zodiac

The zodiac sign.

Consultation

If you have questions when you perform the experiment, join the DingTalk group for consultation.