This topic describes how to monitor data quality by configuring a quality monitoring rule for each table and subscribing to the monitoring rule.

Prerequisites

The data is collected and processed. For more information, see Collect data and Process data.

Background information

DataWorks provides the Data Quality module for you to control the data quality of heterogeneous data stores. In Data Quality, you can check data quality, configure alert notifications, and manage connections. Data Quality monitors data in datasets. Currently, it allows you to monitor MaxCompute tables and DataHub topics. When offline MaxCompute data changes, Data Quality checks the data and blocks nodes that uses the data if it detects exceptions. This prevents the nodes from being affected. In addition, Data Quality allows you to manage the check result history so that you can analyze and evaluate the data quality.

For streaming data, Data Quality uses DataHub to monitor data streams and sends alert notifications to subscribers if it detects stream discontinuity. You can also set the alert severity such as warning and error alerts, and the alert frequency to minimize repeated alerts.

Development process in Data Quality

  1. Configure a monitoring rule for an existing table and test the monitoring rule to check whether the monitoring rule works properly on the table.
    Based on the test result, you can determine whether data generated in the table is as expected. We recommend that you test every monitoring rule configured for a table to verify that these monitoring rules are applicable.
  2. Link the tested monitoring rule to the nodes that generate data in the table.
    After the monitoring rule is configured and tested, you must link the monitoring rule to the nodes that generate data in the table. Then, Data Quality can use the monitoring rule to check the quality of the data generated by the nodes each time the nodes are run. This guarantees that the nodes generates accurate data.
  3. Subscribe to the monitoring rule. The monitoring rule is triggered each time a linked node is run to improve the data accuracy.
    Data Quality allows you to subscribe to monitoring rules of important tables. After the subscriptions are configured, Data Quality generates alerts for abnormal monitoring results based on alert rules and sends alert notifications to you. In this way, you can track the monitoring results.
Note
  • Each time you configure a monitoring rule for a table, you must test the monitoring rule, link the monitoring rule to the nodes that generate data in the table, and subscribe to the monitoring rule.
  • Data Quality may incur additional computing fees. For more information, see Data quality overview.

Configure monitoring rules

After completing the operations described in Collect data and Process data, verify that you have created the following tables: ods_raw_log_d, ods_user_info_d, ods_log_info_d, dw_user_info_all_d, and rpt_user_info_d.

Click the DataWorks icon in the upper-left corner and choose All Products > Data Quality. On the page that appears, click Monitoring Rules in the left-side navigation pane. On the Monitoring Rules page, you can enter a table name in the search box to search for the required table.

  1. Configure a monitoring rule for the ods_raw_log_d table.
    1. Click View Monitoring Rules in the Actions column of the ods_raw_log_d table.
    2. On the page that appears, click + in the Partition Expression section to add a partition expression.
      Configure a monitoring rule

      The ods_raw_log_d table stores the log data synchronized from OSS through the oss_workshop_log connection. The partition key values in the table are in the format of ${bdp.system.bizdate}. The bizdate parameter specifies the date that is one day before the batch synchronization node is run.

      You can configure a partition expression for such daily generated log data. The following figure shows the available partition expressions. You can select dt=$[yyyymmdd-1] as the partition expression. For more information about partition expressions, see Scheduling parameters.Add Partition dialog box
      Note If your table does not contain any partition key columns, you can select NOTAPARTITIONTABLE. Select a partition expression based on the actual partition key values.

      After selecting a partition expression, click OK.

    3. Click Create rules at the top. The Create rules dialog box appears.
    4. On the Template rules tab, click Add Monitoring Rule, and set Field to All Fields in Table(table), Template to Number of rows, fixed value, Rule Type to Hard, Comparison Method to Greater Than, and Expected Value to 0.
      Template rules tab

      The data in the ods_raw_log_d table comes from the log files that are uploaded to OSS. The table is used as the source table. Therefore, you must check whether data exists in the partitions of the table as early as possible. If the partitions contain no data, prevent descendant nodes from running because no source data is available.

      Note Data Quality only blocks nodes and sets the status of node instances to Failed when an error alert is generated for a hard rule.
      Click Batch add.Save your configuration
      Note The preceding configuration is to make sure that partitions of the table contain data that can be used by descendant nodes.
    5. Click Test at the top of the page. In the Test dialog box that appears, set Data Timestamp and click Test.
      Test the monitoring rule
      Data Quality immediately tests the monitoring rule after you click Test. After the test is successful, click The test is complete. Click to view the results to view the test result.Test the monitoring rule
    6. Link the monitoring rule to the nodes that generate data in the table.
      Data Quality allows you to link a monitoring rule of a table to the nodes that generate data in the table. After you link the monitoring rule to the nodes, Data Quality checks the quality of the data generated by the nodes each time the nodes are run. You can link a monitoring rule to a node in one of the following ways:
      • Link the monitoring rule to the node in Operation Center

        Click the DataWorks icon in the upper-left corner and choose All Products > Operation Center.

        On the page that appears, choose Cycle Task Maintenance > Cycle Task in the left-side navigation pane. On the page that appears, right-click the oss_data synchronization node and select Configure Data Quality Rules.

        In the Configure Data Quality Rules dialog box that appears, set Table Name to ods_raw_log_d and Partition Expression to dt=$[yyyymmdd-1] and click Add.

      • Link the monitoring rule to the node in Data Quality

        On the Monitoring Rules page of the ods_raw_log_d table, click Manage Linked Nodes at the top.

        After you click Manage Linked Nodes, you can link the monitoring rule to the nodes that have been committed to the scheduling system. Data Quality lists recommended nodes based on the lineage. You can also link the monitoring rule to other nodes.

        In the Manage Linked Nodes dialog box, enter the node ID or name and click Create. The monitoring rule is linked to the node.

        After the monitoring rule is linked to the node, a check mark (✓) appears before the Manage Linked Nodes button.

    7. Configure subscriptions.
      Click Manage Subscriptions. In the Manage Subscriptions dialog box, set Notification Method and Recipient, and click Save and Close in sequence. Currently, Data Quality supports the following four notification methods: Email, Email and SMS, DingTalk Chatbot, and DingTalk Chatbot @ALL.Manage subscriptions
      After you configure subscriptions, you can go to the My Subscriptions page to view or modify the subscriptions.My Subscriptions page
      Note We recommend that you subscribe to all monitoring rules so that you can receive the monitoring results in a timely manner.
  2. Configure monitoring rules for the ods_user_info_d table.

    The ods_user_info_d table stores user information. You must configure monitoring rules to verify that the table contains the specified number of rows and that the primary key values in the table are unique to avoid duplicate data.

    1. Add the partition expression dt=$[yyyymmdd-1]. After adding the partition expression, you can view the partition expression in the Partition Expression section.
      Add a partition expression
    2. Click Create rules at the top. The Create rules dialog box appears. On the Template rules tab, click Add Monitoring Rule, set Field to All Fields in Table(table), Template to Number of rows, fixed value, Rule Type to Hard, Comparison Method to Greater Than, and Expected Value to 0.
      Template rules tab
    3. Configure a rule to monitor the values in the primary key column uid. Click Add Monitoring Rule again, set Field to uid(string), Template to Repeated value, fixed value, Rule Type to Soft, Comparison Method to Less Than, and Expected Value to 1.
      Template rules tab
    4. Click Batch add.
    Note The preceding configuration is to avoid duplicate data, which affects descendant nodes.
  3. Configure a monitoring rule for the ods_log_info_d table.

    The ods_log_info_d table stores the data that is parsed from the ods_raw_log_d table. The log data in the ods_log_info_d table does not need to be monitored. You only need to configure a monitoring rule to verify that the table contains data.

    1. Add the partition expression dt=$[yyyymmdd-1].
      Partition expression
    2. Configure a monitoring rule to verify that the table contains data: Click Create rules at the top. The Create rules dialog box appears. On the Template rules tab, click Add Monitoring Rule, set Rule Type to Hard, Field to All Fields in Table(table), Template to Number of rows, fixed value, Comparison Method to Unequal To, and Expected Value to 0.
      Save your configuration
    3. Click Batch add.
  4. Configure a monitoring rule for the dw_user_info_all_d table.

    The dw_user_info_all_d table aggregates data in the ods_user_info_d and ods_log_info_d tables. The workflow is simple, and a monitoring rule has been configured for the ods_log_info_d table to verify that the table contains data. Therefore, you do not need to configure a monitoring rule for the dw_user_info_all_d table. This saves computing resources.

  5. Configure monitoring rules for the rpt_user_info_d table.

    The rpt_user_info_d table stores the data aggregation results. You can configure rules to monitor the number of rows in the table for any changes and verify that the primary key values are unique.

    1. On the Monitoring Rules page of the rpt_user_info_d table, click + in the Partition Expression section. In the Add Partition dialog box that appears, select the partition expression dt=$[yyyymmdd-1].
      Partition expression
    2. Configure a monitoring rule to verify that the primary key values are unique. Click Create rules at the top. The Create rules dialog box appears. On the Template rules tab, click Add Monitoring Rule, set Field to uid(string), Template to Repeated value, fixed value, Rule Type to Soft, Comparison Method to Less Than, and Expected Value to 1.
      Template rules tab
    3. Configure a rule to monitor the number of rows in the table for any changes. Click Add Monitoring Rule again, set Field to All Fields in Table(table), Template to Number of rows, 7-day volatility, Rule Type to Soft, Warning Threshold to 0.1%, and Error Threshold to 50%. Adjust the thresholds based on your business logic.
      Note
      • The values of Warning Threshold and Error Threshold must be greater than 0%.
      • The purpose of monitoring the number of rows is to monitor the fluctuations of daily unique visitors (UVs) so as to keep up with the traffic changes of the application in a timely manner.
    4. Click Batch add.

A hard rule is more likely to be configured for a table at the ODS layer in a data warehouse. This is because data at the ODS layer is used as source data in the data warehouse and must be accurate to prevent data at other layers from being affected.

Data Quality also provides the Task Query module, where you can view the monitoring results of configured rules. For more information, see View ODPS data source tasks.