You can use the Data Quality service of DataWorks to monitor data accuracy, consistency, and integrity. This topic describes how to use the Data Quality service to monitor data quality at each layer of a data warehouse.

Prerequisites

The steps in the Build an online operation analysis platform tutorial are completed. A DataWorks workspace is created in the China (Shanghai) region. For more information, see Business scenario and development process. The involved data is classified. In this tutorial, the data class is A2. For more information, see Classify data.

Background information

Data Quality can monitor the quality of data and generate alerts based on data classes. You can configure fine-grained monitoring rules based on the following items. For more information about Data Quality, see Overview.
  • Monitoring category: data amount, primary key, discrete value, accumulated value, business rule, and logical rule.
  • Monitoring granularity: field and table.
  • Monitoring layer: operational data store (ODS), common data model (CDM), and application data store (ADS). The CDM layer includes the data warehouse detail (DWD) layer and data warehouse summary (DWS) layer. Data Quality focuses on the integrity and consistency of data at the ODS and DWD layers, and the accuracy of data at the DWS and ADS layers.

Procedure

  1. Monitor data quality at the ODS layer.

    The data in the ods_user_trace_log table at the ODS layer comes from log files in Object Storage Service (OSS). The table is used as the source table. Therefore, you must check whether data exists in the partitions of the table as early as possible. If the partitions contain no data, prevent descendant nodes from running because no source data is available.

    1. Go to the Data Quality page.
      Log on to the DataWorks console. Go to the DataStudio page, click the DataWorks icon in the upper-left corner, and then choose All Products > Data Quality.
    2. Go to the Monitoring Rules page of the ods_user_trace_log table.
      On the Data Quality page, click Monitoring Rules in the left-side navigation pane. On the Monitoring Rules page that appears, find the ods_user_trace_log table and click View Monitoring Rules in the Actions column. The Monitoring Rules page of the table appears.
    3. Add a partition filter expression.
      1. Click + in the Partition Expression section. In the Add Partition dialog box that appears, select dt=$[yyyymmdd-1] from the Partition Expression drop-down list. This partition filter expression specifies the day before the current day. The value is the same as the data timestamp specified by the ${bdp.system.bizdate} parameter. For more information about partition filter expressions, see Scheduling parameters. If the table has no partition key column, select NOTAPARTITIONTABLE from the drop-down list.
      2. Click Verify to verify the calculation result.
      3. Click OK. The partition filter expression is added.
    4. Create a rule to monitor whether the ods_user_trace_log table contains data.
      1. On the Monitoring Rules page of the ods_user_trace_log table, click Create rules.
      2. In the Create rules pane that appears, click Add Monitoring Rule on the Template Rules tab.
      3. Specify the rule parameters.
        Parameter Description
        Rule Name The name of the rule. Enter a custom name.
        Rule Type The type of the rule. Set the value to Strong. Valid values: Strong and Weak.
        • If you select Strong, error alerts are reported and descendant nodes are blocked, whereas warning alerts are reported but descendant nodes are not blocked.
        • If you select Weak, error alerts are reported but descendant nodes are not blocked, whereas warning alerts are not reported and descendant nodes are not blocked.
        Rule Source The source of the rule. Set the value to Built-in Template.
        Field The fields to be monitored. Set the value to All Fields in Table.
        Template The template to apply to the rule. Set the value to Number of rows, fixed value.
        Comparison Method The comparison method of the rule. Set the value to Greater Than.
        Expected Value The expected value of the rule. Set the value to 0.
    5. Create a rule to monitor duplicate values.
      1. Click Add Monitoring Rule again.
      2. Specify the rule parameters.
        Parameter Description
        Rule Name The name of the rule. Enter a custom name.
        Rule Type The type of the rule. Set the value to Strong. Valid values: Strong and Weak.
        • If you select Strong, error alerts are reported and descendant nodes are blocked, whereas warning alerts are reported but descendant nodes are not blocked.
        • If you select Weak, error alerts are reported but descendant nodes are not blocked, whereas warning alerts are not reported and descendant nodes are not blocked.
        Rule Source The source of the rule. Set the value to Built-in Template.
        Field The fields to be monitored. Set the value to ts(bigint). The ts(bigint) field specifies the UNIX timestamp, which the rule monitors to avoid duplicate data at the ODS layer.
        Template The template to apply to the rule. Set the value to Repeated value, fixed value.
        Comparison Method The comparison method of the rule. Set the value to Equal To.
        Expected Value The expected value of the rule. Set the value to 0.
    6. Create a rule to monitor null values.
      1. Click Add Monitoring Rule again.
      2. Specify the rule parameters.
        Parameter Description
        Rule Name The name of the rule. Enter a custom name.
        Rule Type The type of the rule. Set the value to Strong. Valid values: Strong and Weak.
        • If you select Strong, error alerts are reported and descendant nodes are blocked, whereas warning alerts are reported but descendant nodes are not blocked.
        • If you select Weak, error alerts are reported but descendant nodes are not blocked, whereas warning alerts are not reported and descendant nodes are not blocked.
        Rule Source The source of the rule. Set the value to Built-in Template.
        Field The fields to be monitored. Set the value to uid(string). The uid(string) field specifies the user ID, which the rule monitors to avoid dirty data that contains a null value.
        Template The template to apply to the rule. Set the value to Number of null values, fixed value.
        Comparison Method The comparison method of the rule. Set the value to Equal To.
        Expected Value The expected value of the rule. Set the value to 0.
    7. Save the created rules. After you complete the preceding steps, click Batch Create.
    8. Test the created rules.
      Click Test at the top of the Monitoring Rules page of the table. In the Test dialog box that appears, click Test to test the created rules.
    9. View the test result.
      After the test is completed, click The test is complete. Click to view the results to view the test result.

      On the page that appears, you can check whether the table data conforms to the created rules. Based on the test result, you can determine whether data generated in the table is as expected. We recommend that you test each monitoring rule configured for a table to verify that these monitoring rules are applicable.

    10. Link the monitoring rules to the nodes that generate data in the table.
      After you configure and test monitoring rules for the table, you must link the monitoring rules to the nodes that generate data in the table. Then, Data Quality can use the monitoring rules to check the quality of the data generated by the nodes each time the nodes are run. This guarantees data accuracy.
      1. On the Monitoring Rules page of the table, click Manage Linked Nodes.
      2. In the Manage Linked Nodes dialog box that appears, enter the names of the nodes that you want to link to the current partition filter expression and click Create.
      3. Click Close to close the Manage Linked Nodes dialog box. If a check sign (√) appears before Manage Linked Nodes, the monitoring rules are linked to the specified nodes.
    11. Configure subscriptions.
      After you link the monitoring rules to nodes, Data Quality checks the data quality of the table each time the specified nodes are run. Data Quality allows you to subscribe to monitoring rules of tables. After the subscriptions are configured, Data Quality generates alerts for abnormal monitoring results based on alert rules and sends alert notifications to you.

      On the Monitoring Rules page of the table, click Manage Subscriptions. Currently, Data Quality supports four notification methods: Email, Email and SMS, DingTalk Chatbot, and DingTalk Chatbot @ALL.

      After you configure subscriptions, you can click My Subscriptions in the left-side navigation pane to view configured subscriptions. We recommend that you subscribe to all monitoring rules.

  2. Monitor data quality at the CDM layer.
    The method of monitoring data at the CDM layer is the same as that of monitoring data at the ODS layer. The difference lies in rule details.
    1. Add a partition filter expression.
      Go to the Monitoring Rules page of the dw_user_trace_log table and add the dt=$[yyyymmdd-1] partition filter expression for the table.
    2. Create rules to monitor the number of rows and null values in the table. Create rules for monitoring the number of rows and null values in the dw_user_trace_log table in the same way as creating monitoring rules for the ods_user_trace_log table.
    3. Create a rule to monitor the fluctuation of the number of table rows.
      Parameter Description
      Rule Name The name of the rule. Enter a custom name.
      Rule Type The type of the rule. Set the value to Strong. Valid values: Strong and Weak.
      • If you select Strong, error alerts are reported and descendant nodes are blocked, whereas warning alerts are reported but descendant nodes are not blocked.
      • If you select Weak, error alerts are reported but descendant nodes are not blocked, whereas warning alerts are not reported and descendant nodes are not blocked.
      Rule Source The source of the rule. Set the value to Built-in Template.
      Field The fields to be monitored. Set the value to All Fields in Table.
      Template The template to apply to the rule. Set the value to Number of table rows, upper cycle volatility.
      Comparison Method The comparison method of the rule. Set the value to Absolute Value.
      Thresholds The warning threshold and error threshold of the rule. Set Warning Threshold to 10 and Error Threshold to 50, which indicate that an error alert will be triggered when the fluctuation of the number of table rows reaches 50%.
    4. Test the monitoring rules and link the monitoring rules to the nodes that generate data in the table in the same way as testing and linking monitoring rules for the ods_user_trace_log table.
  3. Monitor data quality at the ADS layer.
    The method of monitoring data at the ADS layer is the same as that of monitoring data at the ODS layer. The difference lies in rule details.
    1. Add a partition filter expression.
      Go to the Monitoring Rules page of the rpt_user_trace_log table and add the dt=$[yyyymmdd-1] partition filter expression for the table.
    2. Create rules to monitor the number of rows, fluctuation, and null values in the table.
      Create rules for monitoring the number of rows, fluctuation, and null values in the rpt_user_trace_log table in the same way as creating monitoring rules for the dw_user_trace_log table. The ADS layer in a data warehouse is close to the application layer and involves only a small amount of data. In addition, fewer restrictions are imposed on data at the ADS layer. Therefore, set Rule Type to Weak when you create monitoring rules for the rpt_user_trace_log table.
    3. Create a rule to monitor the abnormal number of page views (PVs) in the table.
      You can create a custom rule to monitor application data at the ADS layer.
      1. On the Monitoring Rules page of the rpt_user_trace_log table, click Create rules. In the Create rules pane that appears, click the Custom Rules tab and then click Add Monitoring Rule.
      2. Specify the rule parameters.
        Parameter Description
        Rule Name The name of the rule. Enter a custom name.
        Rule Type The type of the rule. Set the value to Weak. Valid values: Strong and Weak.
        • If you select Strong, error alerts are reported and descendant nodes are blocked, whereas warning alerts are reported but descendant nodes are not blocked.
        • If you select Weak, error alerts are reported but descendant nodes are not blocked, whereas warning alerts are not reported and descendant nodes are not blocked.
        Field The fields to be monitored. Set the value to pv(bigint).
        Sampling Method The statistical function of the rule. Set the value to sum.
        Check type The threshold type of the rule. Set the value to Numeric type.
        Verification Method The verification method of the rule. Set the value to Compare with a specified value.
        Comparison Method The comparison method of the rule. Set the value to Greater Than.
        Expected Value The expected value of the rule. Set the value to 100, which indicates that when the number of PVs decreases to 100 or less, you will receive an alert.
      3. After you complete the preceding steps, click Batch Create.
    4. Test the monitoring rules and link the monitoring rules to the nodes that generate data in the table in the same way as testing and linking monitoring rules for the ods_user_trace_log table.