The Monitoring Rules page is the most important part of Data Quality, where you can configure rules to monitor data in E-MapReduce, MaxCompute, and Datahub. This topic describes how to configure monitoring rules for Datahub.

Background information

Datahub monitoring supports the following features:
  • Templates for monitoring stream discontinuity and data latency
  • Stream processing features, such as custom Flink SQL, dimension table JOIN, multi-stream JOIN, and window functions

Procedure

  1. Add a Datahub connection.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. Select the region of the workspace, and click Data Integration.
    4. In the left-side navigation pane, click Connection.
    5. Click Add a Connection in the upper-right corner to add a Datahub connection. For more information, see Configure a Datahub connection.
  2. Select the Datahub connection.
    1. On the current page, click Icon in the upper-left corner and choose All Products > Data Quality.
    2. On the page that appears, click Monitoring Rules in the left-side navigation pane.
    3. Set Engine/Data Source to Datahub and select the Datahub connection. All the topics in the selected Datahub data store appear.
      Parameter Description
      Configure Flink Resources After you add a connection, click Configure Flink Resources to configure Flink and Log Service resources related to the connection.
      Topics The Topics tab lists all topics of the Datahub data store. You can click the following actions in the Actions column for a topic:
      • View Monitoring Rules: Click it to create rules for the topic. You can create template rules and custom rules as needed.
      • Manage Subscriptions: Click it to view and modify subscribers to the current topic, and change the notification method. You can configure the DingTalk chatbot notification method. The changed notification method takes effect for all subscribers to the topic.
      Dimension Tables When you create custom rules for a topic, you can create dimension tables and use the JOIN clause to join dimension tables. If the collected data streams lack some fields for a dimension table, you need to supplement fields to data streams before data analysis and declare the dimension table in Data Quality.

      Datahub supports the dimension tables of ApsaraDB for HBase, Lindorm, ApsaraDB for RDS, Table Store, Taobao Distributed Data Layer (TDDL), and MaxCompute.

      Flink SQL does not design the data definition language (DDL) syntax for dimension tables. You can use the standard CREATE TABLE statement. However, you need to add period for system_time to specify the period of a dimension table and declare that the dimension table stores time-varying data.
      Note When you declare a dimension table, you must specify the primary key. When you join a dimension table with another table, the ON condition must contain an equivalence condition that includes the primary key of either table.
    4. Click the Topics tab. Find the target topic and click View Monitoring Rules in the Actions column.
  3. On the Monitoring Rules page of the topic, click Create Rule.
  4. Create a monitoring rule.
    In Data Quality, you can create template rules and custom rules as needed.
    • Click Create Template Rule. Two templates are available: Data Delay and Stream Discontinuity.
      For example, you can select Data Delay for Template Type.Data Delay
      Parameter Description
      Rule Name The name of the rule. The name can be up to 255 characters in length.
      Field Type The fields to be monitored. By default, this parameter is set to All Fields in Table.
      Template Type
      • Data Delay: monitors the interval between the time when data is generated and the time when data is written to Datahub based on the data timestamp field. If the interval exceeds a specified threshold, an alert is generated.
        Note
        • Before you configure a stream discontinuity rule, you need to activate Alibaba Cloud Realtime Compute in Flink and create a project.
        • The data timestamp field supports two data types: TIMESTAMP and STRING (yyyy -MM -dd H dd HH:mm:ss).
      • Stream Discontinuity: monitors the period during which no data is written to Datahub. If the period exceeds a specified threshold, an alert is generated.
      Alerts Threshold The maximum number of alerts generated for data latency. Data Quality reports an alert when the number of alerts generated for data latency exceeds this threshold. This parameter only takes effect when you select Data Delay for Template Type.
      Data Timestamp Field The data timestamp field of the topic for which the rule is created. This field supports two data types: TIMESTAMP and STRING (yyyy-MM-dd HH:mm:ss). This parameter only takes effect when you select Data Delay for Template Type.
      Alert Frequency The interval for reporting an alert. You can set the alert interval to 10 minutes, 30 minutes, 1 hour, or 2 hours.
      Warning Threshold The warning threshold, in seconds. The value must be an integer and less than the error threshold.
      Error Threshold The error threshold, in seconds. The value must be an integer and greater than the warning threshold.
    • If template rules do not meet your requirements for monitoring the data quality of Datahub topics, you can click Create Custom Rule to create a rule as required.
      Note
      • The field in the SELECT clause must be a column. Ensure that you can compare the field values with the warning and error thresholds.
      • The FROM clause must include the current topic and all its columns.
      Parameter Description
      Rule Name The name of the rule. The name must be unique in the topic and can be up to 20 characters in length.
      Script The custom SQL script, which is used to set a rule. The return value of the SELECT clause must be unique. Example:
      • Use a simple SQL statement.
        select id as a from zmr_tst02;
      • Join the topic and a dimension table named test_dim.
        select e.id as eid
        from zmr_test02 as e 
        join test_dim for system_time as of proctime() as w 
        on e.id=w.id
      • Join the topic and another topic named dp1test_zmr01.
        select count(newtab.biz_date) as aa
        from (select o.*
        from zmr_test02 as o
        join dp1test_zmr01 as p
        on o.id=p.id)newtab
        group by id.biz_date,biz_date_str,total_price,'timestamp'
      Warning Threshold The warning threshold, in minutes. The value must be an integer and less than the error threshold.
      Error Threshold The error threshold, in minutes. The value must be an integer and greater than the warning threshold.
      Minimum Alert Interval The minimum interval for reporting an alert, in minutes.
      Description The description of the custom topic.
  5. Click Batch Create to add the created rules to the topic.
    • View Log: Click it to view the operational logs of rules.
    • Manage Subscriptions: Click it to view and modify subscribers to the current rule, and change the notification method. The changed notification method takes effect for all subscribers to the rule.
      Data Quality supports the following four methods: Email, Email and SMS, DingTalk Chatbot, and DingTalk Chatbot @ALL.
      Note Add a DingTalk chatbot and obtain a webhook URL. Then, copy the webhook URL to the Manage Subscriptions dialog box.