All Products
Search
Document Center

DataWorks:Configure rules to monitor data quality

Last Updated:Jan 15, 2024

This topic describes how to use Data Quality to monitor the data quality of table data.

Prerequisites

Data is collected and processed.

  • The basic user information in the ApsaraDB RDS for MySQL table ods_user_info_d is synchronized to the MaxCompute table ods_user_info_d by using Data Integration.

  • The website access logs of users in user_log.txt in Object Storage Service (OSS) are synchronized to the MaxCompute table ods_raw_log_d by using Data Integration.

  • The collected data is processed into basic user profile data in DataStudio.

Background information

Data Quality is an end-to-end platform that allows you to check the data quality of heterogeneous data sources, configure alert notifications, and manage data sources. Data Quality monitors data in datasets. You can use Data Quality to monitor MaxCompute tables. When offline MaxCompute data changes, Data Quality checks the data and blocks nodes that use the data. This prevents downstream data from being affected by dirty data. In addition, Data Quality allows you to manage the check result history so that you can analyze and evaluate the data quality.

In this example, Data Quality is used to detect changes to source data in the user profile analysis case and dirty data that is generated when the extract, transform, and load (ETL) operations are performed on the source data at the earliest opportunity. The following table describes the monitoring requirements for the analysis and processing procedure of user profile data.

Table name

Detailed requirement

ods_raw_log_d

Configure a rule that monitors whether the number of rows synchronized to the raw log data table is 0 on a daily basis. This rule helps prevent invalid data processing.

ods_user_info_d

Configure a strong rule that monitors whether the number of rows synchronized to the user information table is 0 on a daily basis, and a weak rule that monitors whether the business primary key in the table is unique on a daily basis. These rules help prevent invalid data processing.

dwd_log_info_di

No separate rule is required.

dws_user_info_all_di

No separate rule is required.

ads_user_info_1d

Configure a rule that monitors the fluctuation of the number of rows in the user information table on a daily basis. The rule is used to observe the fluctuation of daily unique visitors (UVs) and helps you learn the application status at the earliest opportunity.

Go to the rule configuration page

  1. Go to the Data Quality page.

    Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > Data Quality. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Quality.

  2. Go to the rule configuration page.

    In the left-side navigation pane of the Data Quality page, choose Rule Management > Configure Rule (by Table) and find the desired table based on the following parameters:

    • Connection: Set it to MaxCompute.

    • Database: Set it to WorkShop2023.

    • Table Name: Select ods_raw_log_d, ods_user_info_d, and ads_user_info_1d respectively and configure monitoring rules for the tables in this example.

  3. Find the desired table in the search result and click Configure Monitoring Rules in the Actions column to go to the rule configuration page of the table. The following section describes how to configure monitoring rules for each table.

Configure monitoring rules

Configure monitoring rules for the ods_raw_log_d table

The MaxCompute table ods_raw_log_d is used to store website access logs of users synchronized from OSS. You can configure a rule that monitors whether the number of rows in the table is 0 for the table based on the business properties of the table. After you associate the rule with scheduling nodes, the rule is triggered when the associated nodes meet the specified conditions in the rule.

1. Configure a partition filter expression

Data Quality uses partition filter expressions to determine the table partitions whose data quality needs to be monitored.

Data Quality monitors whether data in the partitions that match the dt=$[yyyymmdd-1] partition filter expression in the ods_raw_log_d table is as expected. Each time the monitoring rule that you create for the ods_raw_log_d table is executed, the rule monitors whether data in the data timestamp-based partitions in the table meets the specified conditions in the monitoring rule. You can perform the following steps to create and configure a partition filter expression.

image.png

  1. On the rule configuration page of the table, click the image.png icon next to Partition Filter Expression.

  2. In the Add Partition dialog box, set Partition Filter Expression to dt=$[yyyymmdd-1].

  3. Click Verify to check whether the calculation result is as expected. After you confirm the result, click OK.

In this example, the monitoring rule is configured to monitor whether the table data generated by scheduling nodes every day is as expected. The table always generates data whose data timestamp is the previous day. In the Add Partition dialog box, if the value of the Scheduling Time parameter is the current day and the value of the Result parameter is the previous day, the table data is as expected.

2. Configure monitoring rules

The MaxCompute table ods_raw_log_d is used to store website access logs of users synchronized from OSS. The table is used as a source table in the user profile analysis scenario. To prevent invalid data processing and data quality issues, you need to create and configure a strong rule that monitors whether the number of rows in the table is greater than 0 to determine whether synchronization tasks have written data to the related partitions in the table.

If the number of rows in the related partitions of the ods_raw_log_d table is 0, an alert is triggered, the ods_raw_log_d node fails and exits, and the descendant nodes of the ods_raw_log_d node are blocked from running.

image.png

  1. Click Create Rule. The Create Rule panel appears.

  2. Click Add Monitoring Rule on the Template Rules tab, configure the following parameters, and then click Batch Create.

    • Rule Name: the name of the monitoring rule. In this example, set it to "whether the number of rows in the table is greater than 0".

    • Rule Type: the strength of the monitoring rule. The rule strength determines whether to block the running of descendant nodes.

      In this example, Rule Type is set to Strong. If the number of rows in the ods_raw_log_d table is 0, an alert is triggered and the descendant nodes of the ods_raw_log_d node are blocked from running.

    • For information about other items configured for a monitoring rule, see Configure monitoring rules by table (for a single table).

3. Test the monitoring rule

You can perform a test on the monitoring rule to ensure that the rule configurations are valid and meet your business requirements. You can test a monitoring rule after you create the rule.

image.png

  1. Click Test.

  2. In the Test dialog box, configure the Scheduling Time parameter and click Test.

  3. After the test run is complete, follow the on-screen instructions to view the test run result.

4. Associate the rule with scheduling nodes

Data Quality allows you to associate a monitoring rule with scheduling nodes to detect changes to source data and dirty data that is generated during the ETL process at the earliest opportunity. Associated nodes can trigger the execution of the monitoring rule.

In this example, after the ods_raw_log_d node synchronizes OSS data to the ods_raw_log_d table, the monitoring rule is executed to check whether the number of rows in the related partitions in the table is 0. If the table fails the check, the descendant nodes of the ods_raw_log_d node are blocked from running. This helps prevent dirty data from spreading.

image.png

  1. On the rule configuration page of the table, click Manage Linked Nodes.

  2. In the Manage Linked Nodes dialog box, enter ods_raw_log_d in the search box, select the node, and then click Create. When the ods_raw_log_d node is run, the monitoring rule that you configured for the ods_raw_log_d table is triggered to detect data quality issues.

5. Subscribe to table monitoring rules to view the rule check result

Data Quality provides the monitoring and alerting feature. You can subscribe to table monitoring rules to receive alert notifications about data quality issues that occurred on the table and handle the data quality issues at the earliest opportunity. This ensures data security, data stability, and timeliness of data generation.

image.png

After you configure subscriptions, you can click My Subscriptions in the left-side navigation pane to view or modify the subscriptions.

Configure monitoring rules for the ods_user_info_d table

The MaxCompute table ods_user_info_d is used to store basic user information synchronized from ApsaraDB RDS for MySQL. You can configure a rule that monitors whether the number of rows in the table is 0 and a rule that monitors whether the business primary key is unique for the table based on the business properties of the table. After you associate the rules with scheduling nodes, the rules are triggered when the associated nodes meet the specified conditions in the rules. You can perform the following steps to configure monitoring rules for the table:

1. Configure a partition filter expression

In this step, you can configure the dt=$[yyyymmdd-1] partition filter expression for the ods_user_info_d table. Each time the monitoring rules that you created for the ods_user_info_d table are executed, the rules monitor whether data in the data timestamp-based partitions in the table meets the specified conditions in the monitoring rules.

image.png

2. Configure monitoring rules

The MaxCompute table ods_user_info_d is used to store basic user information synchronized from ApsaraDB RDS for MySQL. The table is used as a source table in the user profile analysis scenario. To prevent invalid data processing and data quality issues, you need to create and configure a strong rule that monitors whether the number of rows in the table is greater than 0 to determine whether synchronization tasks have written data to the related partitions in the table.

After the monitoring rules take effect, if the number of rows in the related partitions of the ods_user_info_d table is 0, an alert is triggered, the ods_user_info_d node that generates the ods_user_info_d table fails and exits, and the descendant nodes of the ods_user_info_d node are blocked from running. You can create and configure monitoring rules for the table by following the steps shown in the following figures. For more information about configuration items, see Configure monitoring rules by table (for a single table).

  • Strong rule used to monitor whether the number of rows in the table is 0

    image.png

  • Weak rule used to monitor whether the business primary key is unique

    image.png

3. Associate the rules with scheduling nodes

Data Quality allows you to associate a monitoring rule with scheduling nodes to detect changes to source data and dirty data that is generated during the ETL process at the earliest opportunity. Associated nodes can trigger the execution of the monitoring rule.

In this example, after the ods_user_info_d node synchronizes ApsaraDB RDS for MySQL data to the ods_user_info_d table, the monitoring rule is executed to check whether the number of rows in the related partitions in the table is 0. If the table fails the check, the descendant nodes of the ods_user_info_d node are blocked from running. This helps prevent dirty data from spreading.

image.png

4. Perform other operations

The operations to test and subscribe to the monitoring rules that you configure for this table are the same as those in Configure monitoring rules for the ods_raw_log_d table.

Configure monitoring rules for the ads_user_info_1d table

The ads_user_info_1d table is the final result table. You can configure a rule that monitors the fluctuation of the number of rows in the table and a rule that monitors whether the business primary key is unique for the final result table based on the business properties of the table. This helps you observe the fluctuation of daily UVs and learn the online traffic fluctuation at the earliest opportunity. After you associate the rules with scheduling nodes, the rules are triggered when the associated nodes meet the specified conditions in the rules. You can perform the following steps to configure monitoring rules for the table:

1. Configure a partition filter expression

In this step, you can configure the dt=$[yyyymmdd-1] partition filter expression for the ads_user_info_1d table. Each time the monitoring rules that you create for the ads_user_info_1d table are executed, the rules monitor whether data in the data timestamp-based partitions in the table meets the specified conditions in the monitoring rules. After you associate the rules with scheduling nodes, the rules are triggered when the associated nodes meet the specified conditions in the rules.

image.png

2. Configure monitoring rules

The ads_user_info_1d table is used for user profile analysis. To detect the fluctuation of daily UVs, you need to create and configure a rule that monitors the fluctuation of the number of rows in the aggregate data in the table and a rule that monitors whether the business primary key is unique for the table. This helps you observe the fluctuation of daily UVs and learn the online traffic fluctuation at the earliest opportunity.

After the monitoring rules take effect, if the business primary key is not unique for the ads_user_info_1d table, an alert is triggered. If the fluctuation rate of the number of rows in the ads_user_info_1d table within seven days is greater than 10% and less than 50%, a warning alert is triggered. If the fluctuation rate of the number of rows in the ads_user_info_1d table within seven days is greater than or equal to 50%, a critical alert is triggered.

Note
  • Strong: If the critical threshold is exceeded, critical alerts are reported and descendant nodes are blocked. If the warning threshold is exceeded, warning alerts are reported but descendant nodes are not blocked.

  • Weak: If the critical threshold is exceeded, critical alerts are reported but descendant nodes are not blocked. If the warning threshold is exceeded, warning alerts are not reported and descendant nodes are not blocked.

Weak rule used to monitor the fluctuation of the number of rows in the table

image.png

3. Associate the rules with scheduling nodes

Data Quality allows you to associate a monitoring rule with scheduling nodes to detect changes to source data and dirty data that is generated during the ETL process at the earliest opportunity. Associated nodes can trigger the execution of the monitoring rule.

In this example, the monitoring rules are executed to check data in the related partitions of the processed ads_user_info_1d table. If the table fails the check, the descendant nodes of the ads_user_info_1d node are blocked from running. This helps prevent dirty data from spreading.

image.png

4. Perform other operations

The operations to test and subscribe to the monitoring rules that you configure for this table are the same as those in Configure monitoring rules for the ods_raw_log_d table.

What to do next

After data is processed, you can use the DataAnalysis service to visualize the data. For more information, see Visualize data on a dashboard.