Data Quality allows you to retain unexpected data that is detected from tables whose data quality is monitored based on rules. This helps you quickly locate and troubleshoot issues and improve data quality. This topic describes how to manage unexpected data that is detected from a data table during data quality monitoring.
Background information
Tables that store problem data
If a node fails the data quality check based on specific monitoring rules in Data Quality, the system automatically creates a table to store problem data that is detected during the check.
Name format of the table that stores problem data:
{Name of the monitored data table}_dirtydata_dw_system_dqc.Workspace to which the table that stores problem data belongs: determined based on the environment where the current table resides. If the name of the monitored data table is
dev_project.table1, the table that stores problem data is also stored in thedev_projectdirectory.Owner of the table that stores problem data: the owner of the monitored data table.
Retention period of problem data: Problem data is retained for 15 days by default. This indicates that the lifecycle of the table that stores problem data is 15 days. The problem data will be deleted if the lifecycle-based reclaim policy of a MaxCompute table is met. For more information about the lifecycle of a MaxCompute table, see Lifecycle.
The following table describes the partition structure of the table that stores problem data.
Partition level
Partition field name
Description
Level-1 partition
dqc_task_run_dt
The time when the problem data is collected, which is the time when the related monitoring rule is run. The time is in the yyyymmdd format.
Level-2 partition
dqc_task_id
The ID of an instance generated for the monitoring rule that is relevant to the collected problem data.
NoteAn instance is generated when a monitoring rule is executed.
Monitoring rules that support the problem data retention feature and specifications for problem data
For information about the monitoring rules that support the problem data retention feature and the definitions of problem data that can be identified based on different types of monitoring rules, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.
Precautions
You can configure the problem data retention feature only for MaxCompute tables.
Only specific types of monitoring rules support the problem data retention feature. For more information, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.
DataWorks generates data collection SQL statements to collect problem data based on your monitoring rules. The SQL statements are executed on the MaxCompute compute engine side. Computing fees of the MaxCompute compute engine will be generated during this process.
The problem data collection results are stored in a MaxCompute temporary table. Storage fees of the MaxCompute compute engine will be generated during this process.
Entry point for configuring the problem data retention feature
You can enable the problem data retention feature only for a specific field-level monitoring rule on the Template Rules tab of the Create Rule panel of a table. After you enable the problem data retention feature for a field-level monitoring rule on the Template Rules tab, the system retains abnormal data in the automatically generated problem data table if a node fails the check of the rule. For information about how to configure monitoring rules for a single table, see Configure monitoring rules for a single table.
Problem data is retained only for specific types of monitoring rules. For information about the list of monitoring rules that support the problem data retention feature, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.
View problem data
On the Running Records page in Data Quality, find the running record of a quality check exception or a quality check failure and click Details in the Actions column. On the Monitor Running Details page, click the Problem Data Processing tab.
If you do not have the permissions to preview the problem data or query the table that stores the problem data by running commands, go to Security Center to apply for the query permissions on the table. For more information, see Manage permissions on MaxCompute.
View problem data collection logs
You can use one of the following methods to view problem data collection logs. The logs contain the SQL statements for collecting problem data and the execution process. If an error occurs during problem data collection, the logs help you quickly troubleshoot issues.
Method 1: View problem data collection logs in Data Quality
On the Running Records page in Data Quality, you can find the desired table and click Details in the Actions column. On the page that appears, select the desired monitoring rule and click the Raw Log tab to view the problem data collection process. For more information, see View monitoring results.
Method 2: View problem data collection logs based on the list of scheduling nodes
After you associate a monitoring rule with a scheduling node, you can go to the Running Details page of the scheduling node. On the DQC tab of the Execution step, find the monitoring rule and click View details in the Actions column to view the problem data collection process. For more information, see Intelligent diagnosis.
Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data
Built-in templates
Check type
Check rule
Rule description
Problem data specifications
Unique value
Unique value, fixed value
Data Quality compares the number of unique values of a field after deduplication with a fixed value.
If the check is not passed:
If all values of the field are unique, the system generates a problem data table but stores no problem data in the table.
If some values of the field are not unique, the system generates a problem data table and stores the duplicate data of the field in the table. In this case, the duplicate data is the problem data.
Number of unique values/Total number of rows
Data Quality compares the ratio of the number of unique values of a field to the total number of rows with a fixed value.
Null value
Number of null values, fixed value
Data Quality compares the number of null values of a field with a fixed value.
NoteThe IS NULL expression is used in the SQL statements to check whether a value of a field is a null value.
The system generates a problem data table and stores null values of the field in the table. In this case, all null values are problem data.
Number of null values/Total number of rows, fixed value
Data Quality compares the ratio of the number of null values of a field to the total number of rows with a fixed value.
NoteThe fixed value is a decimal.
Duplicate value
Number of duplicate values/Total number of rows, fixed value
Data Quality compares the ratio of the number of duplicate values of a field to the total number of rows with a fixed value.
The system generates a problem data table and stores duplicate values of the field in the table. In this case, all duplicate values are problem data.
Number of duplicate values, fixed value
Data Quality subtracts the number of values of a field after deduplication from the total number of rows to obtain the number of duplicate values of the field. Then, Data Quality compares the number of duplicate values with a fixed value.
Custom SQL statement
After you turn on Retain problem data, problem data is saved after the monitoring rule that is created based on the custom SQL statement is executed and an alert is generated.