This topic describes how to use Data Quality to monitor the quality of table data.
Prerequisites
Before you begin this tutorial, ensure that you have completed Synchronize data and Process data.
The basic user information from the RDS for MySQL table ods_user_info_d is synchronized to the MaxCompute table ods_user_info_d using Data Integration.
The website access logs of users from the user_log.txt file in OSS are synchronized to the MaxCompute table ods_raw_log_d using Data Integration.
The collected data is processed into basic user persona data in DataStudio.
Background
Data Quality is an end-to-end platform that lets you check the data quality of heterogeneous data sources, configure alert notifications, and manage data sources. Data Quality monitors data in datasets. You can use Data Quality to monitor MaxCompute tables. When offline MaxCompute data changes, Data Quality checks the data and blocks the nodes that use the data. This prevents downstream data from being affected by dirty data. In addition, Data Quality lets you manage the check result history so that you can analyze and evaluate the data quality.
In this example, the Data Quality feature of DataWorks is used to promptly detect changes to source data and identify dirty data generated during the extract, transform, and load (ETL) process for the user persona analysis case. The following table describes the monitoring requirements for the analysis and transformation procedure of user persona data.
Table name | Requirement details |
ods_raw_log_d | Monitor the daily synchronized raw log data to ensure the table is not empty. This prevents invalid processing and mitigates issues promptly. |
ods_user_info_d | Check user information data that is synchronized daily. Ensure the row count is not zero (strong rule) and the business primary key is unique (soft rule). This prevents invalid processing and stops data issues early. |
dwd_log_info_di | Not monitored separately. |
dws_user_info_all_di | Not monitored separately. |
ads_user_info_1d | Configure a rule to monitor the daily row count of your user data table. This lets you track changes in daily unique visitors (UVs) and promptly understand your application's status. |
Go to the rule configuration page
Go to the Data Quality page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Quality.
Go to the rule configuration page for a table.
In the navigation pane on the left of the Data Quality page, click and find the target table based on the following parameters.
Data Source: MaxCompute
Database: Your current production project, such as workshop2024_01.
Table: In this tutorial, you need to configure quality monitoring for the
ods_raw_log_d,ods_user_info_d, andads_user_info_1dtables.
Find the target table in the search results and click Rule Management in the Actions column. The Table Quality Details page for the table appears. The following section describes how to configure monitoring rules for the table.
Configure monitoring rules
Configure monitoring rules for the ods_raw_log_d table
The ods_raw_log_d table stores user website access records synchronized from OSS. Based on the business properties of this table, you can configure a rule to verify that the table row count is not 0. You can then associate this rule with a quality monitor to trigger a data quality check.
1. Configure a monitor
A monitor checks whether the data quality in a specified timestamp range (partition) of a table meets your expectations.
In this step, you must set the Data Range parameter of the monitor to dt=$[yyyymmdd-1]. When the monitor is run, it searches for the data partitions that match the parameter value and checks whether the data quality meets your expectations.
This means that each time the scheduling job for the ods_raw_log_d table runs, a data quality monitor is triggered. The monitor then uses its associated quality rules to check the data in the corresponding data range and determine whether the data conforms to the quality check rules.
Perform the following steps:
On the Monitor tab, click Create Monitor.
Configure the monitor.
Key parameters:
Parameter
Configuration example
Data Range
dt=$[yyyymmdd-1]
Trigger Method
Triggered by production scheduling. Select the
ods_raw_log_dnode that is created during data synchronization.Monitoring Rule
Configure this in a later section.
NoteFor more information about how to configure data quality monitoring, see Configure a monitoring rule for a single table.
In this example, the goal is to monitor whether the table data generated by daily scheduling tasks meets expectations. Because the table generates data for the previous day, the data is considered as expected if the scheduling time is the current day and the calculated result is for the previous day.
2. Configure monitoring rules
The ods_raw_log_d table stores website access logs of users synchronized from OSS. The table is used as a source table in a user persona analysis scenario. To prevent invalid data processing and data quality issues, you need to create and configure a strong rule that monitors whether the number of rows in the table is greater than 0. This rule helps you determine whether synchronization tasks wrote data to the related partitions in the table.
If the number of rows in the related partitions of the ods_raw_log_d table is 0, an alert is triggered, the ods_raw_log_d node fails, and the downstream tasks of the ods_raw_log_d node are blocked.
Perform the following steps:
On the Rule Management tab, in the Monitor Perspective section, select an existing quality monitor, such as
raw_log_number_of_table_rows_not_0. Then, click Create Rule.On the System Template Rules tab, find the Table Row Count Is Greater Than 0 rule, click +Use, and then set Degree of importance to Strong rules.
NoteIn this example, the rule is defined as a strong rule. This indicates that if the number of rows in the
ods_raw_log_dtable is 0, an alert is triggered and the execution of downstream tasks is blocked.Click Determine.
NoteFor more information about configuration items, see Configure rules for a single table.
3. Perform a test run on the monitor
A test run verifies whether the monitoring rules in the monitor are configured correctly. To ensure that the rules are correct and meet your expectations, you can perform a test run after you create them.
On the Rule Management tab, under Monitor Perspective, select the quality monitoring rule that you created, such as
raw_log_number_of_table_rows_not_0. Then, click Test Run on the right to open the Test Run dialog box.In the Test Run dialog box, select Scheduling Time and click Test Run.
After the test run is complete, click View Details to check whether the test passed.
4. Subscribe to the monitor
Data Quality provides a monitoring and alerting feature. You can subscribe to monitors to receive alert notifications about data quality issues. This lets you resolve issues promptly and ensures data security, stability, and timeliness.
On the Rule Management tab, in the Monitor Perspective, select the quality monitoring rule, such as
raw_log_number_of_table_rows_not_0, and then click Alert Subscription on the right.As prompted, add a Notification Method and Recipient, and then click Save in the Actions column.
After you configure the subscription management settings, in the navigation pane on the left, click and select My Subscriptions to view and modify the subscribed tasks.
Configure monitoring rules for the ods_user_info_d table
The ods_user_info_d table stores basic user information synchronized from ApsaraDB RDS for MySQL. Based on the business properties of this table, you can configure a rule to monitor whether the number of rows in the table is 0 and a rule to monitor whether the primary key values are unique. You can then associate the rules with a monitor to trigger a quality check for the table.
1. Configure a monitor
A monitor checks whether the data quality in a specified timestamp range (partition) of a table meets your expectations.
In this step, you need to set the Data Range parameter of the quality monitor to dt=$[yyyymmdd-1]. When the monitor runs, it matches the data in this partition to determine whether it meets your expectations.
This means that each time the scheduling task for the ods_user_info_d table runs, the data quality monitor is triggered. The associated quality rules are then used to check whether the data in the corresponding range meets the data quality requirements.
Perform the following steps:
On the Monitor tab, click Create Monitor.
Configure the monitor.
Key parameters:
Parameter
Configuration example
Data Range
dt=$[yyyymmdd-1]
Trigger Method
Triggered by production scheduling. Select the
ods_user_info_dnode that is created during data synchronization.Monitoring Rule
Do not configure this now. You will configure it in a later section.
NoteFor more information about how to configure data quality monitoring, see Configure a monitoring rule for a single table.
2. Configure monitoring rules
The ods_user_info_d table stores basic user information synchronized from RDS for MySQL. The table is used as a source table in a user persona analysis scenario. To prevent invalid data processing and data quality issues, you need to create and configure a strong rule that monitors whether the number of rows in the table is greater than 0. This rule helps you determine whether synchronization tasks wrote data to the related partitions in the table.
After the rule takes effect, if the number of rows in the related partitions of the ods_user_info_d table is 0, an alert is triggered, the ods_user_info_d node fails, and the downstream tasks of the ods_user_info_d node are blocked.
Perform the following steps:
On the Rule Management tab, in the Monitor Perspective section, select a monitor. In this example, the
user_info_quality_controlmonitor is selected. Then, click Create Rule.On the System Template Rules tab, find the Table Row Count Is Greater Than 0 rule and click +Use. Then, set Degree of importance to Strong rules.
NoteIn this example, the rule is defined as a strong rule. This indicates that if the number of rows in the
ods_user_info_dtable is 0, an alert is triggered and the execution of downstream tasks is blocked.On the System Template Rules tab, find the Unique Value Count, Fixed Value rule, click +Use, and then modify the Rule Scope, Monitoring Threshold, and Degree of importance parameters as follows.
Rule Scope:
uid(STRING)Monitoring Threshold:
Normal threshold = 0Degree of importance:
soft rule
Click Determine.
NoteFor more information about configuration items, see Configure rules for a single table.
3. Other configurations
The procedures to run a quality monitoring test and subscribe to quality monitoring are the same as those described in Configure monitoring rules for the ods_raw_log_d table.
Configure monitoring rules for the ads_user_info_1d table
The ads_user_info_1d table is the final result table. Based on the business properties of the table, you can configure a rule that monitors the fluctuation of the number of rows in the table and a rule that checks whether the primary key values are unique. This lets you observe the fluctuation of daily unique visitors (UVs) and learn about online traffic fluctuations at the earliest opportunity. You can then associate the rules with a quality monitor to trigger data quality checks for the table.
1. Configure the partition filter expression
A monitor checks whether the data quality in a specified timestamp range (partition) of a table meets your expectations.
In this step, you must set the Data Range parameter of the monitor to dt=$[yyyymmdd-1]. When the monitor runs, it matches the data in this partition to determine whether the data quality meets your expectations.
This means that each time the scheduling task for the ads_user_info_1d table is run, quality monitoring is triggered and its associated quality rules are used to check whether the data in the corresponding data range meets the data quality check rules.
Perform the following steps:
On the Monitor tab, click Create Monitor.
Configure the monitor.
Key parameters:
Parameter
Configuration Example
Data Range
dt=$[yyyymmdd-1]
Trigger Method
Triggered by production scheduling. Select the
ads_user_info_1dnode that is created in Data Processing.Monitoring Rule
Do not configure this now. This is configured in a later section.
NoteFor more information about how to configure data quality monitoring, see Configure a monitoring rule for a single table.
2. Configure monitoring rules
The ads_user_info_1d table is used for user persona analysis. To detect the fluctuation of daily UVs, you need to create and configure a rule that monitors the fluctuation of the number of rows in the aggregate data and a rule that monitors whether the primary key values are unique. This helps you observe the fluctuation of daily UVs and learn about online traffic fluctuations at the earliest opportunity.
When this rule is active, it triggers an alert if duplicate primary keys are detected in the ads_user_info_1d table. The rule also triggers a warning alert if the 7-day row count fluctuation rate is greater than 10% but less than 50%, and a critical alert if the rate exceeds 50%.
Handling Policies is configured in data quality monitoring:
Strong rule-Red alert: The handling policy is Blocks. This indicates that if a data quality issue is detected in the table, the scheduling node in the production environment that writes data to the table is identified, and the system sets the running status of the node to Failed. In this case, the descendant nodes of the node cannot run, which blocks the production pipeline and prevents the spread of dirty data.For other exceptions, the handling policy is Alert. This means that if a data quality issue is detected, the system sends alert information to the alert subscription channels of the monitoring job.
Note the following when you configure monitoring rules:
If you set a strong rule, an error alert blocks descendant nodes. Other alerts do not.
If you set a soft rule, no alerts block descendant nodes.
Perform the following steps:
On the Rule Management tab, in the Monitoring Perspective, select an existing quality monitor, such as
ads_user_info_quality_control. Then, click Create Rule.On the System Template Rules tab, find the Number Of Rows, 7-day Volatility rule, click +Use, and then configure the Monitoring Threshold and Degree of importance as follows.
Monitoring Threshold:
Red Threshold > 50%Orange Threshold > 10%Normal Threshold <= 10%
Degree of importance:
soft rule
On the System Template Rules tab, find the Table Row Count Is Greater Than 0 rule, click +Use, and then set Degree of importance to Strong rules.
Click Determine.
NoteFor more information about configuration items, see Configure rules for a single table.
3. Other configurations
The procedures to perform a test run on a quality monitor and subscribe to a quality monitor are the same as those described in the Configure monitoring rules for the ods_raw_log_d table section.
What to do next
After the data is processed, you can use DataAnalysis to visualize the data. For more information, see Visualize data on a dashboard.