This experiment uses the DataWorks and MaxCompute product combo to describe how to use DataWorks.
The features of DataWorks are used to accomplish the following purposes in website user profile analysis scenarios:
Configure rules to monitor data quality.
Visualize data on a dashboard.
Development engineers, data analysts, and engineers who query data from data warehouses and analyze and gain insights into data, such as product operations engineers
In this experiment, the following services are used:
In this experiment, DataWorks is used to collect and process data, monitor the quality of data, and visualize data. You must activate DataWorks in advance. For more information, see Activate DataWorks.
MaxCompute is used to implement underlying data processing and computing. You must activate MaxCompute in advance. For more information, see Activate MaxCompute.
ApsaraDB RDS for MySQL
In this experiment, ApsaraDB RDS for MySQL is used to store user information. The basic information of an ApsaraDB RDS for MySQL data source is provided by default. You do not need to separately activate this service.
Object Storage Service (OSS)
In this experiment, OSS is used to store website access logs of users. The basic information of an OSS data source is provided by default. You do not need to separately activate this service.
Used DataWorks services
In this experiment, the following DataWorks services are used.
Use the DataWorks Data Integration service to synchronize the user information that is stored in ApsaraDB RDS for MySQL and the website access logs of users that are stored in OSS to MaxCompute, commit the nodes that are used to process the data to the scheduling system, and then perform periodic synchronization of incremental data by using DataWorks scheduling parameters.
Learn the following items:
Use the DataWorks DataStudio service to split log data into analyzable fields by using methods such as functions and regular expressions. Aggregate the processed log data and the user information tables into basic user profile data. Then, commit the data to the scheduling system and perform periodic data cleansing operations by using DataWorks scheduling parameters.
Learn the following items:
Use the DataWorks Data Quality service to monitor dirty data that is generated when the periodic extract, transform, and load (ETL) operations are performed. If dirty data is detected, the node execution is blocked to prevent the dirty data from spreading.
Learn how to use the DataWorks Data Quality service to configure monitoring rules to monitor the data quality of tables generated by DataWorks nodes. This ensures that the dirty data generated during the ETL process can be detected at the earliest opportunity and effectively prevents the dirty data from spreading downstream.
Use the DataWorks DataAnalysis service to perform user profile analysis on final result tables. For example, you can analyze the geographical distribution of users and the rankings of the number of registered users in different provinces and cities.
Learn how to visualize data on a dashboard by using DataWorks.
After you perform the experiment, you can understand the main features of DataWorks.
After you perform this experiment, you can independently complete common data-related tasks, such as data collection, data development, and task O&M in DataWorks.
If you learn the experiment online, the experiment may require approximately 1 hour to complete.
You may be charged fees when you run the experiment. The lifecycle of tables created in this experiment is set to 14 days by default to reduce costs. To prevent the fees of long-term node scheduling, after you complete the experiment, you can configure the Validity Period parameter for the related node or freeze the root node of the workflow to which the node belongs. The root node is the zero load node named WorkShop_Start.
If you have questions during the workshop, join the DingTalk group for consultation.