Manage problem data - DataWorks - Alibaba Cloud Documentation Center

Data Quality allows you to retain unexpected data that is detected from tables whose data quality is monitored based on rules. This helps you quickly locate and troubleshoot issues and improve data quality. This topic describes how to manage unexpected data that is detected from a data table during data quality monitoring.

Background information

Tables that store problem data

If a node fails the data quality check based on specific monitoring rules in Data Quality, the system automatically creates a table to store problem data that is detected during the check.

Name format of the table that stores problem data: {Name of the monitored data table}_dirtydata_dw_system_dqc.
Workspace to which the table that stores problem data belongs: determined based on the environment where the current table resides. If the name of the monitored data table is dev_project.table1, the table that stores problem data is also stored in the dev_project directory.
Owner of the table that stores problem data: the owner of the monitored data table.
Retention period of problem data: Problem data is retained for 15 days by default. This indicates that the lifecycle of the table that stores problem data is 15 days. The problem data will be deleted if the lifecycle-based reclaim policy of a MaxCompute table is met. For more information about the lifecycle of a MaxCompute table, see Lifecycle.

The following table describes the partition structure of the table that stores problem data.

Partition level	Partition field name	Description
Level-1 partition	dqc_task_run_dt	The time when the problem data is collected, which is the time when the related monitoring rule is run. The time is in the yyyymmdd format.
Level-2 partition	dqc_task_id	The ID of an instance generated for the monitoring rule that is relevant to the collected problem data. Note An instance is generated when a monitoring rule is executed.

Monitoring rules that support the problem data retention feature and specifications for problem data
For information about the monitoring rules that support the problem data retention feature and the definitions of problem data that can be identified based on different types of monitoring rules, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.

Precautions

You can configure the problem data retention feature only for MaxCompute tables.
Only specific types of monitoring rules support the problem data retention feature. For more information, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.
DataWorks generates data collection SQL statements to collect problem data based on your monitoring rules. The SQL statements are executed on the MaxCompute compute engine side. Computing fees of the MaxCompute compute engine will be generated during this process.
The problem data collection results are stored in a MaxCompute temporary table. Storage fees of the MaxCompute compute engine will be generated during this process.

Entry point for configuring the problem data retention feature

You can enable the problem data retention feature only for a specific field-level monitoring rule on the Template Rules tab of the Create Rule panel of a table. After you enable the problem data retention feature for a field-level monitoring rule on the Template Rules tab, the system retains abnormal data in the automatically generated problem data table if a node fails the check of the rule. For information about how to configure monitoring rules for a single table, see Configure monitoring rules for a single table.

Note

Problem data is retained only for specific types of monitoring rules. For information about the list of monitoring rules that support the problem data retention feature, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.

View problem data

On the Running Records page in Data Quality, find the running record of a quality check exception or a quality check failure and click Details in the Actions column. On the Monitor Running Details page, click the Problem Data Processing tab.

Note

If you do not have the permissions to preview the problem data or query the table that stores the problem data by running commands, go to Security Center to apply for the query permissions on the table. For more information, see Manage permissions on MaxCompute.

View problem data collection logs

You can use one of the following methods to view problem data collection logs. The logs contain the SQL statements for collecting problem data and the execution process. If an error occurs during problem data collection, the logs help you quickly troubleshoot issues.

Method 1: View problem data collection logs in Data Quality
On the Running Records page in Data Quality, you can find the desired table and click Details in the Actions column. On the page that appears, select the desired monitoring rule and click the Raw Log tab to view the problem data collection process. For more information, see View monitoring results.
Method 2: View problem data collection logs based on the list of scheduling nodes
After you associate a monitoring rule with a scheduling node, you can go to the Running Details page of the scheduling node. On the DQC tab of the Execution step, find the monitoring rule and click View details in the Actions column to view the problem data collection process. For more information, see Intelligent diagnosis.

Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data

Built-in templates

Check type	Check rule	Rule description	Problem data specifications
Unique value	Unique value, fixed value	Data Quality compares the number of unique values of a field after deduplication with a fixed value.	If the check is not passed: If all values of the field are unique, the system generates a problem data table but stores no problem data in the table. If some values of the field are not unique, the system generates a problem data table and stores the duplicate data of the field in the table. In this case, the duplicate data is the problem data.
Unique value	Number of unique values/Total number of rows	Data Quality compares the ratio of the number of unique values of a field to the total number of rows with a fixed value.
Null value	Number of null values, fixed value	Data Quality compares the number of null values of a field with a fixed value. Note The IS NULL expression is used in the SQL statements to check whether a value of a field is a null value.	The system generates a problem data table and stores null values of the field in the table. In this case, all null values are problem data.
Null value	Number of null values/Total number of rows, fixed value	Data Quality compares the ratio of the number of null values of a field to the total number of rows with a fixed value. Note The fixed value is a decimal.
Duplicate value	Number of duplicate values/Total number of rows, fixed value	Data Quality compares the ratio of the number of duplicate values of a field to the total number of rows with a fixed value.	The system generates a problem data table and stores duplicate values of the field in the table. In this case, all duplicate values are problem data.
Duplicate value	Number of duplicate values, fixed value	Data Quality subtracts the number of values of a field after deduplication from the total number of rows to obtain the number of duplicate values of the field. Then, Data Quality compares the number of duplicate values with a fixed value.

Custom SQL statement
After you turn on Retain problem data, problem data is saved after the monitoring rule that is created based on the custom SQL statement is executed and an alert is generated.