This topic describes how to monitor data quality by configuring a quality monitoring rule for each table and alert notifications.

Prerequisites

The data is collected and processed before this experiment. For more information, see Collect data and Process data.

Background information

Data Quality is a one-stop platform that allows you to check the data quality of heterogeneous data stores, configure alert notifications, and manage connections. Data Quality monitors data in datasets and allows you to monitor MaxCompute tables and DataHub topics. When offline MaxCompute data changes, Data Quality checks the data and blocks nodes that involves the data. This prevents downstream data from being affected by dirty data. In addition, Data Quality allows you to manage the check result history so that you can analyze and evaluate the data quality.

For streaming data, Data Quality uses DataHub to monitor data streams and sends alert notifications to subscribers if it detects stream discontinuity. You can also set the alert severity such as warning and error alerts, and the alert frequency to minimize repeated alerts.

Development process in Data Quality

  1. Configure a monitoring rule for an existing table and test the monitoring rule to check whether the monitoring rule takes effect on the table.
    Based on the test result, you can determine whether data that is generated in the table is as expected. We recommend that you test every monitoring rule configured for a table to verify that these monitoring rules are applicable.
  2. After the test is successful, link the tested monitoring rule to scheduled nodes.
    After you configure and test the monitoring rule for the table, you must link the monitoring rule with the nodes that generate data in the table. Then, Data Quality can use the monitoring rule to check the quality of the data generated by the nodes each time the nodes are run. This ensures data accuracy.
  3. After the monitoring rule is linked to the scheduled node, the monitoring rule that is used to check the data quality is triggered each time the linked node is run. This improves data accuracy.
    Data Quality allows you to subscribe to monitoring rules. You can subscribe to the monitoring rules of important tables. After the subscriptions are configured, Data Quality generates alerts based on the monitoring results. This way, you can track the monitoring results. If the monitoring results returned by Data Quality are abnormal, Data Quality send alert notifications to you based on alert rules.
Note
  • Each time you configure a monitoring rule for a table, you must test the monitoring rule, link the monitoring rule to the scheduled nodes, and subscribe to the monitoring rule.
  • Data Quality may charge you additional computing fees. For more information, see Overview.

Configure monitoring rules of tables

After data collection and data processing are completed, verify that you have created the following tables: ods_raw_log_d, ods_user_info_d, ods_log_info_d, dw_user_info_all_d, and rpt_user_info_d. Then, perform the following operations:

  1. Go to the Monitoring Rules page of the ods_raw_log_d table.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. After you select the region where the required workspace resides, find the workspace and click Data Analytics.
    4. Click the Icon icon in the upper-left corner and choose All Products > Data governance > Data Quality.
    5. In the left-side navigation pane, click Monitoring Rules and select MaxCompute from the Engine/Data Source drop-down list.
    6. Select an engine instance where a required table exists from the Engine/Database Instance drop-down list. Find the required table for which you want to configure a monitoring rule from the table list, such as the ods_raw_log_d table in this example.
    7. Find the ods_raw_log_d table and click View Monitoring Rules.
  2. Configure a monitoring rule for the ods_raw_log_d table.
    1. Click the plus sign (+) in the Partition Expression section to add a partition filter expression.
      The ods_raw_log_d table stores the log data that is synchronized from Object Storage Service (OSS) by using the oss_workshop_log connection. The partition key values in the table are in the format of ${bdp.system.bizdate}. The bizdate parameter specifies the date that is one day before the batch synchronization node is run.
      You can configure a partition filter expression for such log data that is generated every day. In the Add Partition dialog box, select dt=$[yyyymmdd-1] and click OK. For more information about partition filter expressions, see Scheduling parameters.
      Note If your table does not contain any partition key columns, you can select NOTAPARTITIONTABLE. Select a partition filter expression based on the actual partition key values.
    2. Click Create rules. The Template Rules tab appears.
    3. Click Add Monitoring Rule, and set the Template parameter to Number of rows, fixed value, the Rule Type parameter to Rule Type, the Comparison Method parameter to Greater Than, and the Expected Value parameter to 0.
      The data in the ods_raw_log_d table comes from the log files that are uploaded to OSS. The table is used as the source table. Therefore, you must check whether data exists in the partitions of the table as early as possible. If the partitions contain no data, prevent descendant nodes from running. If no source data can be used, no effective results are generated when the descendant nodes are run.
      Note Data Quality only blocks nodes and sets the status of node instances to Failed when an error alert is generated for a hard rule.
      Then, click Batch Create.
      Note The preceding configuration is to ensure that partitions of the table contain data that can be used by descendant nodes.
    4. Click Test. In the Test dialog box, set the Data Timestamp parameter and click Test.
      Data Quality tests the configured monitoring rule after you click Test. After the test is successful, click The test is complete. Click to view the results to go to the page of the test results.
    5. Link the monitoring rule to the nodes.
      Data Quality allows you to link a monitoring rule of a table to the scheduled nodes. After you link the monitoring rule to the nodes, Data Quality checks the quality of the data generated by the nodes each time the nodes are run. You can link a monitoring rule to a node in one of the following ways:
      • Link the monitoring rule to the node in Operation Center

        Click the Icon icon in the upper-left corner and choose All Products > Operation Center.

        In the left-side navigation pane, choose Cycle Task Maintenance > Cycle Task. In the directed acyclic graph (DAG), right-click the oss_Data synchronization node and select Configure Data Quality Rules.

        In the Configure Data Quality Rules dialog box, set the Table Name parameter to ods_raw_log_d and the Partition Expression parameter to dt=$[yyyymmdd-1] and click Add.

      • Link the monitoring rule to the node in Data Quality

        On the Monitoring Rules page of the table, click Manage Linked Nodes to link the monitoring rule to the node.

        After you click Manage Linked Nodes, you can link the monitoring rule to the nodes that have been committed to the scheduling system. Data Quality lists recommended nodes based on the lineage. You can also link the monitoring rule to other nodes.

        In the Manage Linked Nodes dialog box, enter the node ID or name and click Create. Then, the monitoring rule is linked to the node.

    6. Configure subscriptions.

      On the Monitoring Rules page of the table, click Manage Subscriptions to specify the notification method and recipient. Data Quality supports four notification methods: Email, Email and SMS, DingTalk Chatbot, and DingTalk Chatbot @ALL.

      After you configure subscriptions, click My Subscriptions in the left-side navigation pane to view or modify the subscriptions.
      Note We recommend that you subscribe to all monitoring rules so that you can receive the monitoring results at the earliest opportunity.
  3. Configure monitoring rules for the ods_user_info_d table.
    The ods_user_info_d table stores user information. You must configure monitoring rules to verify that the table contains the specified number of rows and that the primary key values in the table are unique to avoid duplicate data.
    1. Configure a monitoring rule for a partition field by adding the partition filter expression dt=$[yyyymmdd-1]. After the partition filter expression is added, you can view it in the Partition Expression section.
    2. Then, click Create rules to configure a monitoring rule in Data Quality.
      Add a monitoring rule for all fields in a table and a monitoring rule for the values in the primary key column:
      • Set the Field parameter to All Fields in Table(table).

        Set the Template parameter to Number of rows, fixed value, the Rule Type parameter to Rule Type, the Comparison Method parameter to Greater Than, and the Expected Value parameter to 0.

      • Set the Field parameter to uid(string).

        Configure a rule to monitor the values in the primary key column uid. Set the Template parameter to Repeated value, fixed value, the Rule Type parameter to Soft, the Comparison Method parameter to Less Than, and the Expected Value parameter to 1.

    3. Then, click Batch Create.
    Note The preceding configuration is to avoid duplicate data, which prevents downstream data from being affected by dirty data.
  4. Configure a monitoring rule for the ods_log_info_d table.
    The ods_log_info_d table stores the data that is parsed from the ods_raw_log_d table. The log data in the preceding table does not need to be monitored. You can configure only a monitoring rule to verify that the table contains data.
    1. Add the partition filter expression dt=$[yyyymmdd-1].
    2. Click Create rules and click Add Monitoring Rule in the panel that appears.
      Batch Create

      Configure a monitoring rule to verify that the table contains data: Set the Rule Type parameter to Rule Type, the Template parameter to All Fields in Table(table), the Comparison Method parameter to Unequal To, and the Expected Value parameter to 0.

    3. Then, click Batch Create.
  5. Configure a monitoring rule for the dw_user_info_all_d table.
    The dw_user_info_all_d table aggregates data in the ods_user_info_d and ods_log_info_d tables. The workflow is simple, and a monitoring rule has been configured for the ods_user_info_d table to verify that the table contains data. Therefore, a monitoring rule for the dw_user_info_all_d table is not required. This saves computing resources.
  6. Configure monitoring rules for the rpt_user_info_d table.
    The rpt_user_info_d table stores the data aggregation results. You can configure rules to monitor the number of rows in the table for any changes and verify that the primary key values are unique.
    1. Click the plus sign (+) in the Partition Expression section. Select the partition filter expression dt=$[yyyymmdd-1].
    2. Click Create rules. In the panel that appears, click Add Monitoring Rule to configure a monitoring rule for the primary key values. Set the Field parameter to uid(string), the Template parameter to Repeated value, fixed value, the Rule Type parameter to Soft, the Comparison Method parameter to Less Than, and the Expected Value parameter to 1.
    3. Configure a rule to monitor the number of rows in the table for any changes: Set the Template parameter to Number of rows, 7-day volatility, the Rule Type parameter to Soft, the Warning Threshold parameter to 1%, and the Error Threshold parameter to 50%. Adjust the thresholds based on your business logic.
      Note
      • The values of the Warning Threshold and Error Threshold parameters must be greater than 0%.
      • The purpose of monitoring the number of rows is to monitor the fluctuations of daily unique visitors (UVs). Therefore, you can keep up with the traffic changes of the application at the earliest opportunity.
    4. Then, click Batch Create.

A hard rule is more likely to be configured for a table at the operational data store (ODS) layer in a data warehouse. This is because data at the ODS layer is used as source data in the data warehouse and must be accurate to prevent data at other layers from being affected.

Data Quality also provides the Node Query module, where you can view the monitoring results of configured rules. For more information, see View monitoring results.