This topic describes how to use Data Quality to monitor the quality of table data.
Prerequisites
Before you begin, complete the tasks described in Synchronize data and Transform data.
Used data integration to synchronize basic user information from the ApsaraDB RDS for MySQL table ods_user_info_d to the MaxCompute table ods_user_info_d.
Used data integration to synchronize website access logs from the user_log.txt file in OSS to the MaxCompute table ods_raw_log_d.
Used Data Studio to transform the collected data into basic user persona data.
Background information
Data Quality is a one-stop platform that provides quality checks, notifications, and management services for various disparate data sources. Data Quality monitors data in datasets and currently supports MaxCompute tables. When offline MaxCompute data changes, Data Quality checks the data and blocks the production pipeline to prevent the spread of dirty data. Data Quality also provides management of historical check results, which lets you analyze and classify data quality.
In this tutorial, you use the Data Quality feature of DataWorks to promptly detect changes in the source data and identify dirty data generated during the extract, transform, and load (ETL) process for the user persona analysis case. Based on the user persona analysis and processing flow, the quality monitoring rules are summarized as follows:
Table name | Requirement details |
ods_raw_log_d | Monitor the daily synchronized raw log data to ensure the table is not empty. This prevents invalid processing and mitigates issues promptly. |
ods_user_info_d | Check user information data that is synchronized daily. Ensure the row count is not zero (strong rule) and the business primary key is unique (soft rule). This prevents invalid processing and stops data issues early. |
dwd_log_info_di | Not monitored separately. |
dws_user_info_all_di | Not monitored separately. |
ads_user_info_1d | Configure a rule to monitor the daily row count of your user data table. This lets you track changes in daily unique visitors (UVs) and promptly understand your application's status. |
Go to the rule configuration page
Go to the Data Quality page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Quality.
Go to the rule configuration page for the table.
In the navigation pane on the left of the Data Quality page, click , and find the target table based on the following parameters.
Data Source: MaxCompute
Database: Your current production project (workshop2024_01).
Table: In this tutorial, you need to configure quality monitoring for the
ods_raw_log_d,ods_user_info_d, andads_user_info_1dtables.
Find the target table in the search results and click Rule Management in the Actions column to go to the quality details page for the table. For specific configurations, see the following operations.
Configure quality monitoring rules
Configure quality monitoring rules for ods_raw_log_d
The ods_raw_log_d table is used to receive user website access records synchronized from OSS. Based on the business properties of this table, configure a rule to check that the table row count is not 0. Then, associate the rule with a quality monitor to trigger data quality checks.
1. Configure a quality monitor
A quality monitor checks whether the data in a specified data range (partition) of a target table meets expectations.
In this step, you need to set the Data Range for the monitor to dt=$[yyyymmdd-1]. When the monitor runs, it matches the data in this partition and checks whether the data quality meets your expectations.
Each time the scheduling task for the ods_raw_log_d table runs, it triggers the quality monitor. The monitor uses its associated quality rules to check the data in the corresponding data range and determine if it meets the data quality check rules.
Perform the following steps:
Click the Quality Monitoring tab, and then click Create Quality Monitor.
Configure the quality monitor.
Key parameters:
Parameter
Configuration example
Data Range
dt=$[yyyymmdd-1]
Trigger Method
Triggered by production scheduling. Select the
ods_raw_log_dnode created in Synchronize data.Select Quality Rule
Do not configure this now. Configure it in a later section.
NoteFor more information about how to configure a quality monitor, see Configure rules for a single table.
In this tutorial, the goal is to monitor whether the table data generated by daily scheduling tasks meets expectations. Because the table generates data for the previous day, the data is considered as expected if the scheduled time is the current day and the business time is the previous day.
2. Configure monitoring rules
The ods_raw_log_d table is used to receive website access records synchronized from OSS and serves as a source table for user persona analysis. To prevent invalid processing and quality issues, configure a strong rule for this table to check if the row count is greater than 0. This rule determines whether the sync task has written data to the table's partition.
If the number of rows in the corresponding partition of the ods_raw_log_d table is 0, an alert is triggered, the ods_raw_log_d node fails, and the downstream tasks of the ods_raw_log_d node are blocked.
Perform the following steps:
Click the Rule Management tab. In the Quality Monitoring Perspective, select the quality monitor that you created, such as
raw_log_number_of_table_rows_not_0. Then, click the Create Rule button on the right to go to the Create Rule page.In System Templates, find the Table Row Count Is Greater Than 0 rule, click Use, and then set Importance to Strong Rule.
NoteIn this tutorial, the rule is defined as a strong rule. If the row count of the
ods_raw_log_dtable is 0, an alert is triggered and the execution of downstream tasks is blocked.Click OK.
NoteFor more information about configuration items, see Configure rules for a single table.
3. Test run quality monitoring
A test run verifies whether the check rules in a quality monitor are configured correctly. To ensure that the quality rules are correct and meet expectations, test run the quality monitor after you create the rules.
On the Rule Management tab, in the Quality Monitoring Perspective, select the quality monitor that you created, such as
raw_log_number_of_table_rows_not_0. Then, click
Test Run on the right to open the Test Run dialog box.In the Test Run dialog box, select a Scheduled Time and click Test Run.
After the test run is complete, follow the on-screen instructions and click View Details to check whether the test passed.
4. Subscription Quality Monitoring
Data Quality provides a monitoring and alerting feature. Subscribe to quality monitors to promptly receive notifications about quality check exceptions and handle them. This ensures data security, stability, and timely output.
On the Rule Management tab, in the Quality Monitoring Perspective, select the quality monitor that you created, such as
raw_log_number_of_table_rows_not_0. Then, click
Alert Subscription on the right.As prompted on the page, add a Subscription Method and Recipient, and then click Save in the Actions column.
After you configure the subscription, in the navigation pane on the left, click . Select My Subscriptions to view and modify your subscribed tasks.
Configure quality monitoring rules for ods_user_info_d
The ods_user_info_d table is used to receive basic user information synchronized from ApsaraDB RDS for MySQL. Based on the business properties of this table, configure rules to check that the table row count is not 0 and that the business primary key is unique. Then, associate the rules with a quality monitor to trigger data quality checks.
1. Configure a quality monitor
A quality monitor checks whether the data in a specified data range (partition) of a target table meets expectations.
In this step, you need to configure the Data Range of the quality monitor to dt=$[yyyymmdd-1]. When the monitor runs, it matches the data in this partition to determine whether it meets your expectations.
Each time the scheduling task for the ods_user_info_d table runs, it triggers the quality monitor. The monitor uses its associated quality rules to check the data in the corresponding data range and determine if it meets the data quality check rules.
Perform the following steps:
On the Quality Monitoring tab, click Create Quality Monitor.
Configure the quality monitor.
Key parameters:
Parameter
Configuration example
Data Range
dt=$[yyyymmdd-1]
Trigger Method
Triggered by production scheduling. Select the
ods_user_info_dnode created in Synchronize data.Select Quality Rule
Do not configure this now. Configure it in a later section.
NoteFor more information about how to configure a quality monitor, see Configure rules for a single table.
2. Configure monitoring rules
The ods_user_info_d table is used to receive basic user information synchronized from ApsaraDB RDS for MySQL and serves as a source table for user persona analysis. To prevent invalid processing and quality issues, configure a strong rule for this table to check if the row count is greater than 0. This rule determines whether the sync task has written data to the table's partition.
After the rule takes effect, if the number of rows in the corresponding partitions of the ods_user_info_d table is 0, an alert is triggered, the ods_user_info_d node fails, and the downstream tasks of the ods_user_info_d node are blocked.
Perform the following steps:
On the Rule Management tab, in the Quality Monitoring Perspective, select the quality monitor that you created, such as
user_info_quality_control. Then, click Create Rule to go to the Create Rule page.In System Template, find the Table Row Count Is Greater Than 0 rule, click Use, and then set Degree Of Importance to Strong Rule.
NoteIn this tutorial, the rule is defined as a strong rule. If the row count of the
ods_user_info_dtable is 0, an alert is triggered and the execution of downstream tasks is blocked.In the System Template, find the Unique Value Count, Fixed Value rule, click Use, and then modify the Rule Scope, Monitoring Threshold, and Degree Of Importance as follows.
Rule Scope:
uid(STRING)Monitoring Threshold:
Normal threshold = 0Severity:
soft rule
Click OK.
NoteFor more information about configuration items, see Configure rules for a single table.
3. Other configurations
The procedures to test run the quality monitor and subscribe to the quality monitor are the same as those described in Configure quality monitoring rules for ods_raw_log_d.
Configure quality monitoring rules for ads_user_info_1d
The ads_user_info_1d table is the final sink table. Based on its business properties, monitor the table row count fluctuation and check the uniqueness of the business primary key. This lets you observe daily UV fluctuations and stay informed about online traffic changes. Associate the rules with a quality monitor to trigger data quality checks.
1. Configure a partition filter expression
A quality monitor checks whether the data in a specified data range (partition) of a target table meets expectations.
In this step, you must set the Data Range parameter of the monitor to dt=$[yyyymmdd-1]. When the monitor runs, it matches the data in this partition to determine whether the data quality meets your expectations.
Each time the scheduling task for the ads_user_info_1d table runs, it triggers the quality monitor. The monitor uses its associated quality rules to check the data in the corresponding data range and determine if it meets the data quality check rules.
Perform the following steps:
On the Quality Monitoring tab, click Create Quality Monitor.
Configure the quality monitor.
Key parameters:
Parameter
Configuration example
Data Range
dt=$[yyyymmdd-1]
Trigger Method
Triggered by production scheduling. Select the
ads_user_info_1dnode created in Transform data.Select Quality Rule
Do not configure this now. Configure it in a later section.
NoteFor more information about how to configure a quality monitor, see Configure rules for a single table.
2. Create monitoring rules
The ads_user_info_1d table is used for user persona analysis. To detect daily UV fluctuations, monitor the row count fluctuation of the aggregated data and check the uniqueness of the primary key. This helps you observe daily UV fluctuations and stay informed about online traffic changes.
When the rule is in effect, it triggers an alert if it detects that the primary key in the ads_user_info_1d table is not unique. It triggers a warning alert if the 7-day row count fluctuation rate is greater than 10% and less than 50%. It triggers an error alert if the 7-day row count fluctuation rate is greater than 50%.
Data quality monitoring has Handling Policies configured:
Strong rule-Red alert: The handling policy is Blocks. This indicates that if a data quality issue is detected in the table, the scheduling node in the production environment that is used to write data to the table is identified, and the system sets the running status of the node to Failed. In this case, the descendant nodes of the node cannot be run, which blocks the production link and prevents the spread of dirty data.For other exceptions, the handling policy is Alert. This means that if a data quality issue is detected, the system sends alert information to the alert subscription channels of the monitoring job.
Therefore, when you configure quality rules:
If you set a strong rule, an error alert blocks downstream nodes. Other alerts do not.
If you set a soft rule, neither red abnormal alerts nor other abnormal alerts block downstream task nodes.
Perform the following steps:
On the Rule Management tab, in the Quality Monitoring Perspective, select the quality monitor that you created, such as
ads_user_info_quality_control. Then, click Create Rule to go to the Create Rule page.In the System Template tab, find the Number Of Rows, 7-day Volatility rule, click +Use, and then configure the Monitoring Threshold and Importance Level as follows.
Monitoring Threshold:
Error threshold > 50%Warning threshold > 10%Normal threshold <= 10%
Importance:
soft rule
In the System Template, find the Table Row Count Is Greater Than 0 rule, click Use, and then set Importance to Strong Rule.
Click OK.
NoteFor more information about configuration items, see Configure rules for a single table.
3. Other configurations
The procedures to test run the quality monitor and subscribe to the quality monitor are the same as those described in Configure quality monitoring rules for ods_raw_log_d.
What to do next
After the data is transformed, you can use the DataAnalysis module to visualize the data. For more information, see Visualize data.