Monitor the data quality of generated user profile data - DataWorks

This topic describes how to use Data Quality to monitor the quality of table data.

Prerequisites

Before you begin, complete the tasks described in Synchronize data and Transform data.

Used data integration to synchronize basic user information from the ApsaraDB RDS for MySQL table ods_user_info_d to the MaxCompute table ods_user_info_d.
Used data integration to synchronize website access logs from the user_log.txt file in OSS to the MaxCompute table ods_raw_log_d.
Used Data Studio to transform the collected data into basic user persona data.

Background information

Data Quality is a one-stop platform that provides quality checks, notifications, and management services for various disparate data sources. Data Quality monitors data in datasets and currently supports MaxCompute tables. When offline MaxCompute data changes, Data Quality checks the data and blocks the production pipeline to prevent the spread of dirty data. Data Quality also provides management of historical check results, which lets you analyze and classify data quality.

In this tutorial, you use the Data Quality feature of DataWorks to promptly detect changes in the source data and identify dirty data generated during the extract, transform, and load (ETL) process for the user persona analysis case. Based on the user persona analysis and processing flow, the quality monitoring rules are summarized as follows:

Table name	Requirement details
ods_raw_log_d	Monitor the daily synchronized raw log data to ensure the table is not empty. This prevents invalid processing and mitigates issues promptly.
ods_user_info_d	Check user information data that is synchronized daily. Ensure the row count is not zero (strong rule) and the business primary key is unique (soft rule). This prevents invalid processing and stops data issues early.
dwd_log_info_di	Not monitored separately.
dws_user_info_all_di	Not monitored separately.
ads_user_info_1d	Configure a rule to monitor the daily row count of your user data table. This lets you track changes in daily unique visitors (UVs) and promptly understand your application's status.

Go to the rule configuration page

Go to the Data Quality page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Governance > Data Quality. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Quality.
Go to the rule configuration page for the table.
In the navigation pane on the left of the Data Quality page, click Configure Rules > Configure By Table, and find the target table based on the following parameters.
- Data Source: MaxCompute
- Database: Your current production project (workshop2024_01).
- Table: In this tutorial, you need to configure quality monitoring for the ods_raw_log_d, ods_user_info_d, and ads_user_info_1d tables.
Find the target table in the search results and click Rule Management in the Actions column to go to the quality details page for the table. For specific configurations, see the following operations.

Configure quality monitoring rules

Configure quality monitoring rules for ods_raw_log_d

The ods_raw_log_d table is used to receive user website access records synchronized from OSS. Based on the business properties of this table, configure a rule to check that the table row count is not 0. Then, associate the rule with a quality monitor to trigger data quality checks.

1. Configure a quality monitor

A quality monitor checks whether the data in a specified data range (partition) of a target table meets expectations.

In this step, you need to set the Data Range for the monitor to dt=$[yyyymmdd-1]. When the monitor runs, it matches the data in this partition and checks whether the data quality meets your expectations.

Each time the scheduling task for the ods_raw_log_d table runs, it triggers the quality monitor. The monitor uses its associated quality rules to check the data in the corresponding data range and determine if it meets the data quality check rules.

Perform the following steps:

Click the Quality Monitoring tab, and then click Create Quality Monitor.

Configure the quality monitor.

Key parameters:

Parameter	Configuration example
Data Range	dt=$[yyyymmdd-1]
Trigger Method	Triggered by production scheduling. Select the `ods_raw_log_d` node created in Synchronize data.
Select Quality Rule	Do not configure this now. Configure it in a later section.

Note

For more information about how to configure a quality monitor, see Configure rules for a single table.

In this tutorial, the goal is to monitor whether the table data generated by daily scheduling tasks meets expectations. Because the table generates data for the previous day, the data is considered as expected if the scheduled time is the current day and the business time is the previous day.

2. Configure monitoring rules

The ods_raw_log_d table is used to receive website access records synchronized from OSS and serves as a source table for user persona analysis. To prevent invalid processing and quality issues, configure a strong rule for this table to check if the row count is greater than 0. This rule determines whether the sync task has written data to the table's partition.

If the number of rows in the corresponding partition of the ods_raw_log_d table is 0, an alert is triggered, the ods_raw_log_d node fails, and the downstream tasks of the ods_raw_log_d node are blocked.

Perform the following steps:

Click the Rule Management tab. In the Quality Monitoring Perspective, select the quality monitor that you created, such as raw_log_number_of_table_rows_not_0. Then, click the Create Rule button on the right to go to the Create Rule page.
In System Templates, find the Table Row Count Is Greater Than 0 rule, click Use, and then set Importance to Strong Rule.
Note
In this tutorial, the rule is defined as a strong rule. If the row count of the ods_raw_log_d table is 0, an alert is triggered and the execution of downstream tasks is blocked.
Click OK.
Note
For more information about configuration items, see Configure rules for a single table.

3. Test run quality monitoring

A test run verifies whether the check rules in a quality monitor are configured correctly. To ensure that the quality rules are correct and meet expectations, test run the quality monitor after you create the rules.

On the Rule Management tab, in the Quality Monitoring Perspective, select the quality monitor that you created, such as raw_log_number_of_table_rows_not_0. Then, click Test Run on the right to open the Test Run dialog box.
In the Test Run dialog box, select a Scheduled Time and click Test Run.
After the test run is complete, follow the on-screen instructions and click View Details to check whether the test passed.

4. Subscription Quality Monitoring

Data Quality provides a monitoring and alerting feature. Subscribe to quality monitors to promptly receive notifications about quality check exceptions and handle them. This ensures data security, stability, and timely output.

On the Rule Management tab, in the Quality Monitoring Perspective, select the quality monitor that you created, such as raw_log_number_of_table_rows_not_0. Then, click Alert Subscription on the right.
As prompted on the page, add a Subscription Method and Recipient, and then click Save in the Actions column.
After you configure the subscription, in the navigation pane on the left, click O&M > Monitoring. Select My Subscriptions to view and modify your subscribed tasks.

Configure quality monitoring rules for ods_user_info_d

The ods_user_info_d table is used to receive basic user information synchronized from ApsaraDB RDS for MySQL. Based on the business properties of this table, configure rules to check that the table row count is not 0 and that the business primary key is unique. Then, associate the rules with a quality monitor to trigger data quality checks.

1. Configure a quality monitor

A quality monitor checks whether the data in a specified data range (partition) of a target table meets expectations.

In this step, you need to configure the Data Range of the quality monitor to dt=$[yyyymmdd-1]. When the monitor runs, it matches the data in this partition to determine whether it meets your expectations.

Each time the scheduling task for the ods_user_info_d table runs, it triggers the quality monitor. The monitor uses its associated quality rules to check the data in the corresponding data range and determine if it meets the data quality check rules.

Perform the following steps:

On the Quality Monitoring tab, click Create Quality Monitor.

Configure the quality monitor.

Key parameters:

Parameter	Configuration example
Data Range	dt=$[yyyymmdd-1]
Trigger Method	Triggered by production scheduling. Select the `ods_user_info_d` node created in Synchronize data.
Select Quality Rule	Do not configure this now. Configure it in a later section.

Note

For more information about how to configure a quality monitor, see Configure rules for a single table.

2. Configure monitoring rules

The ods_user_info_d table is used to receive basic user information synchronized from ApsaraDB RDS for MySQL and serves as a source table for user persona analysis. To prevent invalid processing and quality issues, configure a strong rule for this table to check if the row count is greater than 0. This rule determines whether the sync task has written data to the table's partition.

After the rule takes effect, if the number of rows in the corresponding partitions of the ods_user_info_d table is 0, an alert is triggered, the ods_user_info_d node fails, and the downstream tasks of the ods_user_info_d node are blocked.

Perform the following steps:

On the Rule Management tab, in the Quality Monitoring Perspective, select the quality monitor that you created, such as user_info_quality_control. Then, click Create Rule to go to the Create Rule page.
In System Template, find the Table Row Count Is Greater Than 0 rule, click Use, and then set Degree Of Importance to Strong Rule.
Note
In this tutorial, the rule is defined as a strong rule. If the row count of the ods_user_info_d table is 0, an alert is triggered and the execution of downstream tasks is blocked.
In the System Template, find the Unique Value Count, Fixed Value rule, click Use, and then modify the Rule Scope, Monitoring Threshold, and Degree Of Importance as follows.
- Rule Scope: uid(STRING)
- Monitoring Threshold: Normal threshold = 0
- Severity: soft rule
Click OK.
Note
For more information about configuration items, see Configure rules for a single table.

3. Other configurations

The procedures to test run the quality monitor and subscribe to the quality monitor are the same as those described in Configure quality monitoring rules for ods_raw_log_d.

Configure quality monitoring rules for ads_user_info_1d

The ads_user_info_1d table is the final sink table. Based on its business properties, monitor the table row count fluctuation and check the uniqueness of the business primary key. This lets you observe daily UV fluctuations and stay informed about online traffic changes. Associate the rules with a quality monitor to trigger data quality checks.

1. Configure a partition filter expression

A quality monitor checks whether the data in a specified data range (partition) of a target table meets expectations.

In this step, you must set the Data Range parameter of the monitor to dt=$[yyyymmdd-1]. When the monitor runs, it matches the data in this partition to determine whether the data quality meets your expectations.

Each time the scheduling task for the ads_user_info_1d table runs, it triggers the quality monitor. The monitor uses its associated quality rules to check the data in the corresponding data range and determine if it meets the data quality check rules.

Perform the following steps:

On the Quality Monitoring tab, click Create Quality Monitor.

Configure the quality monitor.

Key parameters:

Parameter	Configuration example
Data Range	dt=$[yyyymmdd-1]
Trigger Method	Triggered by production scheduling. Select the `ads_user_info_1d` node created in Transform data.
Select Quality Rule	Do not configure this now. Configure it in a later section.

Note

For more information about how to configure a quality monitor, see Configure rules for a single table.

2. Create monitoring rules

The ads_user_info_1d table is used for user persona analysis. To detect daily UV fluctuations, monitor the row count fluctuation of the aggregated data and check the uniqueness of the primary key. This helps you observe daily UV fluctuations and stay informed about online traffic changes.

When the rule is in effect, it triggers an alert if it detects that the primary key in the ads_user_info_1d table is not unique. It triggers a warning alert if the 7-day row count fluctuation rate is greater than 10% and less than 50%. It triggers an error alert if the 7-day row count fluctuation rate is greater than 50%.

Note

Data quality monitoring has Handling Policies configured:

Strong rule-Red alert: The handling policy is Blocks. This indicates that if a data quality issue is detected in the table, the scheduling node in the production environment that is used to write data to the table is identified, and the system sets the running status of the node to Failed. In this case, the descendant nodes of the node cannot be run, which blocks the production link and prevents the spread of dirty data.
For other exceptions, the handling policy is Alert. This means that if a data quality issue is detected, the system sends alert information to the alert subscription channels of the monitoring job.

Therefore, when you configure quality rules:

If you set a strong rule, an error alert blocks downstream nodes. Other alerts do not.
If you set a soft rule, neither red abnormal alerts nor other abnormal alerts block downstream task nodes.

Perform the following steps:

On the Rule Management tab, in the Quality Monitoring Perspective, select the quality monitor that you created, such as ads_user_info_quality_control. Then, click Create Rule to go to the Create Rule page.
In the System Template tab, find the Number Of Rows, 7-day Volatility rule, click +Use, and then configure the Monitoring Threshold and Importance Level as follows.
- Monitoring Threshold:
  - Error threshold > 50%
  - Warning threshold > 10%
  - Normal threshold <= 10%
- Importance: soft rule
In the System Template, find the Table Row Count Is Greater Than 0 rule, click Use, and then set Importance to Strong Rule.
Click OK.
Note
For more information about configuration items, see Configure rules for a single table.

3. Other configurations

The procedures to test run the quality monitor and subscribe to the quality monitor are the same as those described in Configure quality monitoring rules for ods_raw_log_d.

What to do next

After the data is transformed, you can use the DataAnalysis module to visualize the data. For more information, see Visualize data.