Monitoring rules are the most important part of Data Quality. You can configure rules to monitor data in E-MapReduce (EMR), Hologres, AnalyticDB for PostgreSQL, MaxCompute, and DataHub. This topic describes how to configure monitoring rules for DataHub.

Background information

DataHub real-time monitoring supports the following features:
  • Templates for monitoring stream discontinuity
  • Stream processing features, such as custom Flink SQL, dimension table JOIN, multi-stream JOIN, and window functions

Procedure

  1. Add a DataHub data source.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides. Find the workspace and click Data Integration in the Actions column.
    4. In the left-side navigation pane of the page that appears, click Connection. The Data Source page appears.
    5. Click Add data source in the upper-right corner. In the Add data source dialog box, set the parameters as required to add a DataHub data source. For more information, see Add a DataHub data source.
  2. Select the DataHub data source.
    1. Click the More icon icon in the upper-left corner and choose All Products > Data governance > Data Quality.
    2. In the left-side navigation pane, choose Rule management > Configure by Table.
    3. On the page that appears, select DataHub from the Engine/Data Source drop-down list and select the DataHub data source that you added in the previous step from the Engine/Database Instance drop-down list. All the topics in the data source are displayed.
      GUI element Description
      Configure Flink Resources After you add a data source, click Configure Flink Resources to configure Realtime Compute for Apache Flink or Log Service resources related to the data source.
      Topics The Topics tab lists all the topics in the DataHub data source. You can perform the following operations in the Actions column for a topic:
      • Click View Monitoring Rules to create rules for the topic. You can create template rules and custom rules.
      • Click Manage Subscriptions to view and modify subscribers to the topic, and change the notification method. You can use a DingTalk chatbot to receive notifications. The changed notification method takes effect for all subscribers to the topic.
      Dimension Tables When you create custom rules for a topic, you can create and join dimension tables. If the collected data streams lack some fields for a dimension table, you must supplement fields to data streams before data analysis and declare the dimension table in Data Quality.

      DataHub supports the dimension tables of ApsaraDB for HBase, Lindorm, ApsaraDB RDS, Tablestore, Taobao Distributed Data Layer (TDDL), and MaxCompute.

      Flink SQL does not design the DDL syntax for dimension tables. You can use the standard CREATE TABLE statement. However, you must add period for system_time to specify the period of a dimension table and declare that the dimension table stores time-varying data.
      Note When you declare a dimension table, you must specify the primary key. When you join a dimension table with another table, the ON condition must contain an equivalence condition for each primary key of the tables.
    4. Click the Topics tab. Find the topic for which you want to configure monitoring rules and click View Monitoring Rules in the Actions column.
  3. On the rule configuration page of the topic, click Create Rule.
  4. Create a monitoring rule.
    In Data Quality, you can create template rules and custom rules.
    • On the Template Rules tab of the Create Rule panel, click Create Template Rule. The following template types are available: Data Delay and Stream Discontinuity.
      For example, you can select Data Delay for the Template Type parameter. Data Delay
      Parameter Description
      Rule Name The name of the rule. The name can be a maximum of 255 characters in length.
      Field Type The fields to be monitored. By default, this parameter is set to All Fields in Table.
      Template Type
      • Data Delay: monitors the interval between the time when data is generated and the time when data is written to DataHub based on the data timestamp field. If the interval exceeds a specified threshold, an alert is generated.
        Note The data timestamp field supports two data types: TIMESTAMP and STRING (yyyy-MM-dd HH:mm:ss).
      • Stream Discontinuity: monitors the period during which no data is written to DataHub. If the period exceeds a specified threshold, an alert is generated.

        Before you configure a stream discontinuity rule, you must activate Realtime Compute for Apache Flink and create a project. On the Monitoring Rules page, click Configure Flink Resources in the upper-right corner. In the dialog box that appears, specify the Realtime Compute for Apache Flink project and click OK.

      Alerts Threshold The maximum number of alerts generated for data latency. Data Quality reports an alert when the number of alerts generated for data latency exceeds this threshold. This parameter is displayed only when you select Data Delay for the Template Type parameter.
      Data Timestamp Field The data timestamp field of the topic for which the rule is created. This field supports two data types: TIMESTAMP and STRING (yyyy-MM-dd HH:mm:ss). This parameter is displayed only when you select Data Delay for the Template Type parameter.
      Alert Frequency The interval at which alerts are reported. You can set the alert interval to 10 minutes, 30 minutes, 1 hour, or 2 hours.
      Warning Threshold The warning threshold, in seconds. The value must be an integer and less than the error threshold.
      Error Threshold The error threshold, in seconds. The value must be an integer and greater than the warning threshold.
    • If template rules do not meet your requirements for monitoring the data quality of DataHub topics, you can create a custom rule. On the Custom Rules tab of the Create Rule panel, click Create Custom Rule.
      Note
      • The field in the SELECT clause must be a column. Make sure that you can compare the field values with the warning threshold and error threshold.
      • The FROM clause must include the current topic and all columns of the topic.
      Parameter Description
      Rule Name The name of the rule. The name must be unique in the topic and can be a maximum of 20 characters in length.
      Script The custom SQL script that is used to set a rule. The return value of the SELECT clause must be unique. Examples:
      • Use a simple SQL statement.
        select id as a from zmr_tst02;
      • Join the topic and a dimension table named test_dim.
        select e.id as eid
        from zmr_test02 as e 
        join test_dim for system_time as of proctime() as w 
        on e.id=w.id
      • Join the topic and another topic named dp1test_zmr01.
        select count(newtab.biz_date) as aa
        from (select o.*
        from zmr_test02 as o
        join dp1test_zmr01 as p
        on o.id=p.id)newtab
        group by id.biz_date,biz_date_str,total_price,'timestamp'
      Warning Threshold The warning threshold, in minutes. The value must be an integer and less than the error threshold.
      Error Threshold The error threshold, in minutes. The value must be an integer and greater than the warning threshold.
      Minimum Alert Interval The minimum interval at which alerts are reported, in minutes.
      Description The description of the rule.
  5. In the Manage Subscriptions dialog box, specify the notification method and notification recipient.
    Data Quality supports the following notification methods: Email, Email and SMS, DingTalk Chatbot, DingTalk Chatbot @ALL, Lark Group Chatbot, and Enterprise WeChat Chatbot.
    Note Add a DingTalk chatbot, Lark chatbot, or WeCom chatbot and obtain a webhook URL. Then, copy the webhook URL to the Recipient field in the Manage Subscriptions dialog box.
  6. Click Save. After rules are created for the topic, you can perform the following operations:
    • Click View Log to view the operation logs of the rules.
    • Click Manage Subscriptions to view and modify subscribers to the rules, and change the notification method. The changed notification method takes effect for all the subscribers to the rules.
      Data Quality supports the following notification methods: Email, Email and SMS, DingTalk Chatbot, DingTalk Chatbot @ALL, Lark Group Chatbot, and Enterprise WeChat Chatbot.
      Note Add a DingTalk chatbot, Lark chatbot, or WeCom chatbot and obtain a webhook URL. Then, copy the webhook URL to the Recipient field in the Manage Subscriptions dialog box.