Data Quality allows you to configure monitoring rules for data in E-MapReduce (EMR), Hologres, AnalyticDB for PostgreSQL, Cloudera's Distribution including Apache Hadoop (CDH), and MaxCompute data sources. This topic describes how to configure a monitoring rule to monitor data in a MaxCompute data source.
Prerequisites
Before you configure monitoring rules for EMR, Hologres, AnalyticDB for PostgreSQL, and CDH data sources, you must collect metadata from the data sources. For more information about how to collect metadata, see Collect metadata from an EMR data source.Limits
- Data standard-generated rules are not supported.
- After you configure monitoring rules for the EMR, Hologres, AnalyticDB for PostgreSQL, or CDH data source, if you want monitoring rule-based data check to be properly triggered, you must make sure that the scheduling node that generates table data is run on the exclusive resource group for scheduling that is connected to the data source for which you configured the monitoring rules.
Go to the Rule Configuration-Configure by Table page
- Go to the Data Quality page.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- In the top navigation bar, select the region where your workspace resides. On the Workspaces page, find your workspace and click DataStudio in the Actions column.
- In the upper-left corner of the DataStudio page, click the
icon and choose .
- In the left-side navigation pane, choose . The Rule Configuration-Configure by Table page appears. Data Quality allows you to configure template rules and custom rules.Important In Data Quality, you must configure monitoring rules based on partition filter expressions. You can click the
icon on the left side of the rule configuration page to configure a partition filter expression for a table.
To configure a monitoring rule for a non-partitioned table, you can specify NOTAPARTITIONTABLE as the partition filter expression. To configure a monitoring rule for a partitioned table, you can specify a data timestamp expression, such as $[yyyymmdd], as the partition filter expression. For more information, see Configure a partition filter expression.
Create a template rule
- Find the table for which you want to configure monitoring rules and click View Monitoring Rules in the Actions column. On the page that appears, you can configure monitoring rules for this table.
- Click Create Rule. The Template Rules tab is displayed by default in the Create Rule panel. To create a template rule, you can click Add Monitoring Rule or Quick Create.
- Add Monitoring RuleClick Add Monitoring Rule. The following table describes the parameters that are displayed if you set the Rule Source parameter to Built-in Template.
Parameter Description Rule Name The name of the monitoring rule. Rule Type The strength of the monitoring rule. Valid values: Strong and Soft. - Strong: If the critical threshold is exceeded, critical alerts are reported and descendant nodes are blocked. If the warning threshold is exceeded, warning alerts are reported but descendant nodes are not blocked.
- Soft: If the critical threshold is exceeded, critical alerts are reported but descendant nodes are not blocked. If the warning threshold is exceeded, warning alerts are not reported and descendant nodes are not blocked.
Auto-Generated Threshold Specifies whether to use a dynamic threshold. Configure this parameter based on your business requirements. If you set Auto-Generated Threshold to Yes, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the appropriate thresholds based on intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity. Important You can use the dynamic threshold feature only in DataWorks Enterprise Edition or a more advanced edition.Rule Source The source for the monitoring rule. Valid values: Built-in Template and Rule Templates. - For more information about monitoring rules configured based on built-in templates, see Built-in monitoring rule templates.
- If you select Rule Templates, you must specify a rule template. For more information, see Create, manage, and use rule templates.
Important You can select Rule Templates only in DataWorks Enterprise Edition or a more advanced edition.Field The fields to be monitored. You can select all fields in a table or a specific field of a numeric type or non-numeric type. Template Data Quality provides built-in monitoring rule templates at the table or field level. You can select only the monitoring rule templates that are displayed. For more information, see Built-in monitoring rule templates. Note You can configure field-level monitoring rules of the following types only for numeric fields: average value, sum of values, minimum value, and maximum value.Comparison Method The comparison method for the monitoring rule. Valid values: Absolute Value, Raise, and Drop. Thresholds Calculate the fluctuation. You can calculate the fluctuation by using the following formula:Fluctuation = (Sample value - Baseline)/Baseline
.- Sample value
The sample value for the current day. For example, if you want to check the fluctuation in the number of table rows on an SQL node within a day, the sample value is the number of table rows in partitions on that day.
- BaselineThe comparison value collected from the previous N days. Examples:
- If you want to check the fluctuation in the number of table rows on an SQL node within a day, the baseline is the number of table rows in partitions on the previous day.
- If you want to check the average fluctuation in the number of table rows on an SQL node within seven days, the baseline is the average number of table rows in the last seven days.
You can specify the warning threshold and critical threshold of the fluctuation to monitor data and identify issues of different severities:- If the absolute value of the fluctuation does not exceed the warning threshold, the data is considered to be normal.
- If the absolute value of the fluctuation exceeds the warning threshold and does not exceed the critical threshold, a warning alert is reported.
- If the absolute value of the fluctuation exceeds the critical threshold, a critical alert is reported.
Start-Stop Status You can turn on or off the switch to enable or disable the monitoring rule to control whether to apply the monitoring rule in the production environment. Important If you disable the monitoring rule, you cannot test the monitoring rule, and the monitoring rule cannot be triggered by auto triggered nodes that are associated with the rule.Retain problem data If you turn on the switch and a data quality check based on the monitoring rule fails, the system automatically creates a table to store the problematic data that is identified during the data quality check. For more information, see t2304863.html#task_2304863. Important- The Retain problem data parameter is supported only for MaxCompute tables.
- The Retain problem data parameter is supported only for specific monitoring rules in Data Quality. For information about the monitoring rules that support the Retain problem data parameter, see t2304863.html#section_ufu_wkj_g9j.
- If you set this parameter to Off, problematic data is not stored.
Description The description of the monitoring rule. - Quick CreateClick Quick Create. Configure the parameters as required. The following table describes the parameters.
Parameter Description Rule Name The name of the monitoring rule. Field The fields to be monitored. You can select all fields in a table or a specific field of a numeric type or non-numeric type. Trigger The trigger condition of the monitoring rule. Valid values: The number of columns is greater than 0 and Table row number dynamic threshold. Important You can select Table row number dynamic threshold only in DataWorks Enterprise Edition or a more advanced edition.
- Add Monitoring Rule
- Click Batch Create.
Create a custom rule
If template rules do not meet your business requirements for monitoring the data quality based on a partition filter expression, you can create custom rules based on your business requirements.
- Find the table for which you want to configure monitoring rules and click View Monitoring Rules in the Actions column. On the page that appears, you can configure monitoring rules for this table.
- Click Create Rule. The Template Rules tab is displayed by default in the Create Rule panel.
- Click the Custom Rules tab. To create a custom rule, you can click Add Monitoring Rule or Quick Create.
- Add Monitoring RuleYou can select all fields in a table, SQL statements, or a specific field from the Field drop-down list.
- The parameters that you can configure vary based on the option you select for Field. If you select all fields in a table or a specific field for Field, you need to configure the parameters that are described in the following table.
Parameter Description Rule Name The name of the monitoring rule. Rule Type The strength of the monitoring rule. Valid values: Strong and Soft. - Strong: If the critical threshold is exceeded, critical alerts are reported and descendant nodes are blocked. If the warning threshold is exceeded, warning alerts are reported but descendant nodes are not blocked.
- Soft: If the critical threshold is exceeded, critical alerts are reported but descendant nodes are not blocked. If the warning threshold is exceeded, warning alerts are not reported and descendant nodes are not blocked.
Field The fields to be monitored. In this example, select All Fields in Table. If you select All Fields in Table, you can use the WHERE clause to customize filter conditions based on your business requirements. Sampling Method The sampling method for the monitoring rule. Valid values: count and count/table_count. Note The value of count/table_count is the ratio of the number of table rows that you obtain based on filter conditions to the total number of table rows in the current partition.Filter The filter conditions. For example, if you want to query the partitions of the table based on a specific data timestamp, you can specify pt=$[yyyymmdd-1]
as a filter condition.Check type The threshold type for the monitoring rule. Valid values: Numeric type, Fluctuation, and Auto-Generated Threshold. Note You can select Auto-Generated Threshold only in DataWorks Enterprise Edition or a more advanced edition.Comparison Method The comparison method for the monitoring rule. The comparison methods that can be selected vary based on the threshold type. - If you set the Check type parameter to Numeric type, the valid values of the Comparison Method parameter are Greater Than, Greater Than or Equal To, Equal To, Unequal To, Less Than, and Less Than or Equal To.
- If you set the Check type parameter to Fluctuation, the valid values of the Comparison Method parameter are Absolute Value, Raise, and Drop.
Verification Method The verification method for the monitoring rule. The verification methods that can be selected vary based on the threshold type. - If you set the Check type parameter to Numeric type, you can set the Verification Method parameter only to Compare with a specified value.
- If you set the Check type parameter to Fluctuation, the valid values of the Verification Method parameter are Compare the current value with the average value of the last 7 days, Compare the current value with the average value of the last 30 days, Compare the current value with the value 1 day before, Compare the current value with the value 7 days before, Compare the current value with the value 30 days before, The variance between the current value and the value 7 days before, The variance between the current value and the value 30 days before, Compare with the value 1, 7, and 30 days before, and Compare with the value of the previous cycle.
Expected Value The expected value for the monitoring rule. If you set the Check type parameter to Numeric type, you must specify an expected value. Thresholds The warning threshold and critical threshold of the fluctuation. If you set the Check type parameter to Fluctuation, you must specify a warning threshold and a critical threshold for the fluctuation. You can enter a threshold or adjust the slider to specify a threshold. Start-Stop Status You can turn on or off the switch to enable or disable the monitoring rule to control whether to apply the monitoring rule in the production environment. Description The description of the custom rule. - SQL Statement
Parameter Description Rule Name The name of the monitoring rule. Rule Type The strength of the monitoring rule. Valid values: Strong and Soft. - Strong: If the critical threshold is exceeded, critical alerts are reported and descendant nodes are blocked. If the warning threshold is exceeded, warning alerts are reported but descendant nodes are not blocked.
- Soft: If the critical threshold is exceeded, critical alerts are reported but descendant nodes are not blocked. If the warning threshold is exceeded, warning alerts are not reported and descendant nodes are not blocked.
Field The fields to be monitored. If you select SQL Statement, you can customize the SQL logic for the monitoring rule. The return value is the value in a row of a column. Sampling Method The sampling method for the monitoring rule. You can set this parameter only to SQL Statement. Set Flag The SET clause of the SQL statement to be used. Custom SQL The custom SQL statement to be used. You can specify only a custom SQL statement that returns the value in a row of a column. In the custom SQL statement, enclose the partition filter expression in brackets []. Sample custom SQL statement:
select count(*) from table_name where ds=$[yyyymmdd];
Note- In this statement, table_name indicates the name of the table for which you configure a monitoring rule. Replace table_name with the actual table name based on your business requirements.
- For more information about how to configure a partition filter expression, see Configure a partition filter expression.
- The partition filter expression that is used in the custom SQL statement instead of the partition filter expression that you configure in the previous step is used by the data quality monitoring rule that you configure.
Check type The threshold type for the monitoring rule. Valid values: Numeric type and Fluctuation. Comparison Method The comparison method for the monitoring rule. The comparison methods that can be selected vary based on the threshold type. - If you set the Check type parameter to Numeric type, the valid values of the Comparison Method parameter are Greater Than, Greater Than or Equal To, Equal To, Unequal To, Less Than, and Less Than or Equal To.
- If you set the Check type parameter to Fluctuation, the valid values of the Comparison Method parameter are Absolute Value, Raise, and Drop.
Verification Method The verification method for the monitoring rule. The verification methods that can be selected vary based on the threshold type. - If you set the Check type parameter to Numeric type, you can set the Verification Method parameter only to Compare with a specified value.
- If you set the Check type parameter to Fluctuation, the valid values of the Verification Method parameter are Compare the current value with the average value of the last 7 days, Compare the current value with the average value of the last 30 days, Compare the current value with the value 1 day before, Compare the current value with the value 7 days before, Compare the current value with the value 30 days before, The variance between the current value and the value 7 days before, The variance between the current value and the value 30 days before, Compare with the value 1, 7, and 30 days before, and Compare with the value of the previous cycle.
Expected Value The expected value for the monitoring rule. If you set the Check type parameter to Numeric type, you must specify an expected value. Thresholds The warning threshold and critical threshold of the fluctuation. If you set the Check type parameter to Fluctuation, you must specify a warning threshold and a critical threshold for the fluctuation. You can enter a threshold or adjust the slider to specify a threshold. Description The description of the custom rule.
- The parameters that you can configure vary based on the option you select for Field. If you select all fields in a table or a specific field for Field, you need to configure the parameters that are described in the following table.
- Quick Create
Parameter Description Rule Name The name of the monitoring rule. Trigger The trigger condition for the monitoring rule. You can select only Values Duplicated in Multiple Fields. Field The fields to be monitored.
- Add Monitoring Rule
- Click Batch Create.