Data Quality allows you to configure monitoring rules for data in E-MapReduce (EMR), Hologres, AnalyticDB for PostgreSQL, Cloudera's Distribution including Apache Hadoop (CDH), and MaxCompute data sources. This topic describes how to configure a monitoring rule to monitor data in a MaxCompute data source.

Prerequisites

Before you configure monitoring rules for EMR, Hologres, AnalyticDB for PostgreSQL, and CDH data sources, you must collect metadata from the data sources. For more information about how to collect metadata, see Collect metadata from an EMR data source.

Limits

  • Data standard-generated rules are not supported.
  • After you configure monitoring rules for the EMR, Hologres, AnalyticDB for PostgreSQL, or CDH data source, if you want monitoring rule-based data check to be properly triggered, you must make sure that the scheduling node that generates table data is run on the exclusive resource group for scheduling that is connected to the data source for which you configured the monitoring rules.

Go to the Rule Configuration-Configure by Table page

  1. Go to the Data Quality page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides. On the Workspaces page, find your workspace and click DataStudio in the Actions column.
    4. In the upper-left corner of the DataStudio page, click the Icon icon and choose All Products > Data Governance > Data Quality.
  2. In the left-side navigation pane, choose Rules > Configure by Table. The Rule Configuration-Configure by Table page appears.
    Data Quality allows you to configure template rules and custom rules.
    Important In Data Quality, you must configure monitoring rules based on partition filter expressions. You can click the Add icon on the left side of the rule configuration page to configure a partition filter expression for a table.

    To configure a monitoring rule for a non-partitioned table, you can specify NOTAPARTITIONTABLE as the partition filter expression. To configure a monitoring rule for a partitioned table, you can specify a data timestamp expression, such as $[yyyymmdd], as the partition filter expression. For more information, see Configure a partition filter expression.

Create a template rule

  1. Find the table for which you want to configure monitoring rules and click View Monitoring Rules in the Actions column. On the page that appears, you can configure monitoring rules for this table.
  2. Click Create Rule. The Template Rules tab is displayed by default in the Create Rule panel.
    To create a template rule, you can click Add Monitoring Rule or Quick Create.
    • Add Monitoring Rule
      Click Add Monitoring Rule. The following table describes the parameters that are displayed if you set the Rule Source parameter to Built-in Template.
      ParameterDescription
      Rule NameThe name of the monitoring rule.
      Rule TypeThe strength of the monitoring rule. Valid values: Strong and Soft.
      • Strong: If the critical threshold is exceeded, critical alerts are reported and descendant nodes are blocked. If the warning threshold is exceeded, warning alerts are reported but descendant nodes are not blocked.
      • Soft: If the critical threshold is exceeded, critical alerts are reported but descendant nodes are not blocked. If the warning threshold is exceeded, warning alerts are not reported and descendant nodes are not blocked.
      Auto-Generated ThresholdSpecifies whether to use a dynamic threshold. Configure this parameter based on your business requirements. If you set Auto-Generated Threshold to Yes, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the appropriate thresholds based on intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
      Important You can use the dynamic threshold feature only in DataWorks Enterprise Edition or a more advanced edition.
      Rule SourceThe source for the monitoring rule. Valid values: Built-in Template and Rule Templates.
      Important You can select Rule Templates only in DataWorks Enterprise Edition or a more advanced edition.
      FieldThe fields to be monitored. You can select all fields in a table or a specific field of a numeric type or non-numeric type.
      TemplateData Quality provides built-in monitoring rule templates at the table or field level. You can select only the monitoring rule templates that are displayed. For more information, see Built-in monitoring rule templates.
      Note You can configure field-level monitoring rules of the following types only for numeric fields: average value, sum of values, minimum value, and maximum value.
      Comparison MethodThe comparison method for the monitoring rule. Valid values: Absolute Value, Raise, and Drop.
      ThresholdsCalculate the fluctuation.
      You can calculate the fluctuation by using the following formula: Fluctuation = (Sample value - Baseline)/Baseline.
      • Sample value

        The sample value for the current day. For example, if you want to check the fluctuation in the number of table rows on an SQL node within a day, the sample value is the number of table rows in partitions on that day.

      • Baseline
        The comparison value collected from the previous N days. Examples:
        • If you want to check the fluctuation in the number of table rows on an SQL node within a day, the baseline is the number of table rows in partitions on the previous day.
        • If you want to check the average fluctuation in the number of table rows on an SQL node within seven days, the baseline is the average number of table rows in the last seven days.
      You can specify the warning threshold and critical threshold of the fluctuation to monitor data and identify issues of different severities:
      • If the absolute value of the fluctuation does not exceed the warning threshold, the data is considered to be normal.
      • If the absolute value of the fluctuation exceeds the warning threshold and does not exceed the critical threshold, a warning alert is reported.
      • If the absolute value of the fluctuation exceeds the critical threshold, a critical alert is reported.
      Start-Stop StatusYou can turn on or off the switch to enable or disable the monitoring rule to control whether to apply the monitoring rule in the production environment.
      Important If you disable the monitoring rule, you cannot test the monitoring rule, and the monitoring rule cannot be triggered by auto triggered nodes that are associated with the rule.
      Retain problem dataIf you turn on the switch and a data quality check based on the monitoring rule fails, the system automatically creates a table to store the problematic data that is identified during the data quality check. For more information, see t2304863.html#task_2304863.
      Important
      • The Retain problem data parameter is supported only for MaxCompute tables.
      • The Retain problem data parameter is supported only for specific monitoring rules in Data Quality. For information about the monitoring rules that support the Retain problem data parameter, see t2304863.html#section_ufu_wkj_g9j.
      • If you set this parameter to Off, problematic data is not stored.
      DescriptionThe description of the monitoring rule.
    • Quick Create
      Click Quick Create. Configure the parameters as required. The following table describes the parameters.
      ParameterDescription
      Rule NameThe name of the monitoring rule.
      FieldThe fields to be monitored. You can select all fields in a table or a specific field of a numeric type or non-numeric type.
      TriggerThe trigger condition of the monitoring rule. Valid values: The number of columns is greater than 0 and Table row number dynamic threshold.
      Important You can select Table row number dynamic threshold only in DataWorks Enterprise Edition or a more advanced edition.
  3. Click Batch Create.

Create a custom rule

If template rules do not meet your business requirements for monitoring the data quality based on a partition filter expression, you can create custom rules based on your business requirements.

  1. Find the table for which you want to configure monitoring rules and click View Monitoring Rules in the Actions column. On the page that appears, you can configure monitoring rules for this table.
  2. Click Create Rule. The Template Rules tab is displayed by default in the Create Rule panel.
  3. Click the Custom Rules tab.
    To create a custom rule, you can click Add Monitoring Rule or Quick Create.
    • Add Monitoring Rule
      You can select all fields in a table, SQL statements, or a specific field from the Field drop-down list.
      • The parameters that you can configure vary based on the option you select for Field. If you select all fields in a table or a specific field for Field, you need to configure the parameters that are described in the following table.All Fields in Table
        ParameterDescription
        Rule NameThe name of the monitoring rule.
        Rule TypeThe strength of the monitoring rule. Valid values: Strong and Soft.
        • Strong: If the critical threshold is exceeded, critical alerts are reported and descendant nodes are blocked. If the warning threshold is exceeded, warning alerts are reported but descendant nodes are not blocked.
        • Soft: If the critical threshold is exceeded, critical alerts are reported but descendant nodes are not blocked. If the warning threshold is exceeded, warning alerts are not reported and descendant nodes are not blocked.
        FieldThe fields to be monitored. In this example, select All Fields in Table. If you select All Fields in Table, you can use the WHERE clause to customize filter conditions based on your business requirements.
        Sampling MethodThe sampling method for the monitoring rule. Valid values: count and count/table_count.
        Note The value of count/table_count is the ratio of the number of table rows that you obtain based on filter conditions to the total number of table rows in the current partition.
        FilterThe filter conditions. For example, if you want to query the partitions of the table based on a specific data timestamp, you can specify pt=$[yyyymmdd-1] as a filter condition.
        Check typeThe threshold type for the monitoring rule. Valid values: Numeric type, Fluctuation, and Auto-Generated Threshold.
        Note You can select Auto-Generated Threshold only in DataWorks Enterprise Edition or a more advanced edition.
        Comparison MethodThe comparison method for the monitoring rule. The comparison methods that can be selected vary based on the threshold type.
        • If you set the Check type parameter to Numeric type, the valid values of the Comparison Method parameter are Greater Than, Greater Than or Equal To, Equal To, Unequal To, Less Than, and Less Than or Equal To.
        • If you set the Check type parameter to Fluctuation, the valid values of the Comparison Method parameter are Absolute Value, Raise, and Drop.
        Verification MethodThe verification method for the monitoring rule. The verification methods that can be selected vary based on the threshold type.
        • If you set the Check type parameter to Numeric type, you can set the Verification Method parameter only to Compare with a specified value.
        • If you set the Check type parameter to Fluctuation, the valid values of the Verification Method parameter are Compare the current value with the average value of the last 7 days, Compare the current value with the average value of the last 30 days, Compare the current value with the value 1 day before, Compare the current value with the value 7 days before, Compare the current value with the value 30 days before, The variance between the current value and the value 7 days before, The variance between the current value and the value 30 days before, Compare with the value 1, 7, and 30 days before, and Compare with the value of the previous cycle.
        Expected ValueThe expected value for the monitoring rule. If you set the Check type parameter to Numeric type, you must specify an expected value.
        ThresholdsThe warning threshold and critical threshold of the fluctuation. If you set the Check type parameter to Fluctuation, you must specify a warning threshold and a critical threshold for the fluctuation. You can enter a threshold or adjust the slider to specify a threshold.
        Start-Stop StatusYou can turn on or off the switch to enable or disable the monitoring rule to control whether to apply the monitoring rule in the production environment.
        DescriptionThe description of the custom rule.
      • SQL StatementSQL Statement
        ParameterDescription
        Rule NameThe name of the monitoring rule.
        Rule TypeThe strength of the monitoring rule. Valid values: Strong and Soft.
        • Strong: If the critical threshold is exceeded, critical alerts are reported and descendant nodes are blocked. If the warning threshold is exceeded, warning alerts are reported but descendant nodes are not blocked.
        • Soft: If the critical threshold is exceeded, critical alerts are reported but descendant nodes are not blocked. If the warning threshold is exceeded, warning alerts are not reported and descendant nodes are not blocked.
        FieldThe fields to be monitored. If you select SQL Statement, you can customize the SQL logic for the monitoring rule. The return value is the value in a row of a column.
        Sampling MethodThe sampling method for the monitoring rule. You can set this parameter only to SQL Statement.
        Set FlagThe SET clause of the SQL statement to be used.
        Custom SQLThe custom SQL statement to be used. You can specify only a custom SQL statement that returns the value in a row of a column.

        In the custom SQL statement, enclose the partition filter expression in brackets []. Sample custom SQL statement:

        select count(*) from table_name where ds=$[yyyymmdd];
        Note
        • In this statement, table_name indicates the name of the table for which you configure a monitoring rule. Replace table_name with the actual table name based on your business requirements.
        • For more information about how to configure a partition filter expression, see Configure a partition filter expression.
        • The partition filter expression that is used in the custom SQL statement instead of the partition filter expression that you configure in the previous step is used by the data quality monitoring rule that you configure.
        Check typeThe threshold type for the monitoring rule. Valid values: Numeric type and Fluctuation.
        Comparison MethodThe comparison method for the monitoring rule. The comparison methods that can be selected vary based on the threshold type.
        • If you set the Check type parameter to Numeric type, the valid values of the Comparison Method parameter are Greater Than, Greater Than or Equal To, Equal To, Unequal To, Less Than, and Less Than or Equal To.
        • If you set the Check type parameter to Fluctuation, the valid values of the Comparison Method parameter are Absolute Value, Raise, and Drop.
        Verification MethodThe verification method for the monitoring rule. The verification methods that can be selected vary based on the threshold type.
        • If you set the Check type parameter to Numeric type, you can set the Verification Method parameter only to Compare with a specified value.
        • If you set the Check type parameter to Fluctuation, the valid values of the Verification Method parameter are Compare the current value with the average value of the last 7 days, Compare the current value with the average value of the last 30 days, Compare the current value with the value 1 day before, Compare the current value with the value 7 days before, Compare the current value with the value 30 days before, The variance between the current value and the value 7 days before, The variance between the current value and the value 30 days before, Compare with the value 1, 7, and 30 days before, and Compare with the value of the previous cycle.
        Expected ValueThe expected value for the monitoring rule. If you set the Check type parameter to Numeric type, you must specify an expected value.
        ThresholdsThe warning threshold and critical threshold of the fluctuation. If you set the Check type parameter to Fluctuation, you must specify a warning threshold and a critical threshold for the fluctuation. You can enter a threshold or adjust the slider to specify a threshold.
        DescriptionThe description of the custom rule.
    • Quick CreateQuick Create
      ParameterDescription
      Rule NameThe name of the monitoring rule.
      TriggerThe trigger condition for the monitoring rule. You can select only Values Duplicated in Multiple Fields.
      FieldThe fields to be monitored.
  4. Click Batch Create.

What to do next

If you want to prevent data that does not meet the requirements of a monitoring rule from blocking the running of the associated auto triggered node on the specified data timestamp, you can configure a noise reduction rule for the monitoring rule to denoise the data. For more information, see Mange noise reduction rules.