All Products
Search
Document Center

DataWorks:Manage problem data

Last Updated:Oct 27, 2023

Data Quality allows you to retain unexpected data that is detected from tables whose data quality is monitored based on rules. This helps you quickly locate and troubleshoot issues and improve data quality. This topic describes how to manage unexpected data that is detected from a data table during data quality monitoring.

Background information

  • Tables that store problem data

    If a node fails the data quality check based on specific monitoring rules in Data Quality, the system automatically creates a table to store problem data that is detected during the check.

    • Name format of the table that stores problem data: {Name of the monitored data table}_dirtydata_dw_system_dqc.

    • Workspace to which the table that stores problem data belongs: determined based on the environment where the current table resides. If the name of the monitored data table is dev_project.table1, the table that stores problem data is also stored in the dev_project directory.

    • Owner of the table that stores problem data: the owner of the monitored data table.

    • Retention period of problem data: Problem data is retained for 15 days by default. This indicates that the lifecycle of the table that stores problem data is 15 days. The problem data will be deleted if the lifecycle-based reclaim policy of a MaxCompute table is met. For more information about the lifecycle of a MaxCompute table, see Lifecycle.

    • The following table describes the partition structure of the table that stores problem data.

      Partition level

      Partition field name

      Description

      Level-1 partition

      dqc_task_run_dt

      The time when the problem data is collected, which is the time when the related monitoring rule is run. The time is in the yyyymmdd format.

      Level-2 partition

      dqc_task_id

      The ID of an instance generated for the monitoring rule that is relevant to the collected problem data.

      Note

      An instance is generated when a monitoring rule is executed.

  • Monitoring rules that support the problem data retention feature and specifications for problem data

    For information about the monitoring rules that support the problem data retention feature and the definitions of problem data that can be identified based on different types of monitoring rules, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.

Precautions

  • You can configure the problem data retention feature only for MaxCompute tables.

  • Only specific types of monitoring rules support the problem data retention feature. For more information, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.

  • DataWorks generates data collection SQL statements to collect problem data based on your monitoring rules. The SQL statements are executed on the MaxCompute compute engine side. Computing fees of the MaxCompute compute engine will be generated during this process.

  • The problem data collection results are stored in a MaxCompute temporary table. Storage fees of the MaxCompute compute engine will be generated during this process.

Entry point for configuring the problem data retention feature

You can enable the problem data retention feature only for a specific field-level monitoring rule on the Template Rules tab of the Create Rule panel of a table. After you enable the problem data retention feature for a field-level monitoring rule on the Template Rules tab, the system retains abnormal data in the automatically generated problem data table if a node fails the check of the rule. For information about how to configure monitoring rules for a single table, see Configure monitoring rules for a single table.

Note

Problem data is retained only for specific types of monitoring rules. For information about the list of monitoring rules that support the problem data retention feature, see Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data.

View problem data

On the Node Query page, find the desired table and click Details in the Actions column. On the node details page, click Problem Data. On the page that appears, you can view the monitoring rules that are relevant to the problem data and the details of the problem data table, and preview the problem data.

Note

If you do not have the permissions to preview the problem data or query the table that stores the problem data by running commands, go to Security Center to apply for the query permissions on the table. For more information, see Manage permissions on MaxCompute.

View problem data collection logs

You can use one of the following methods to view problem data collection logs. The logs contain the SQL statements for collecting problem data and the execution process. If an error occurs during problem data collection, the logs help you quickly troubleshoot issues.

  • Method 1: View problem data collection logs in Data Quality

    On the Node Query page in Data Quality, you can click Execution Logs in the Actions column of the desired table to view problem data collection logs. For more information, see View monitoring results.

  • Method 2: View problem data collection logs based on the list of scheduling nodes

    After you associate a monitoring rule with a scheduling node, you can go to the Running Details page of the scheduling node. On the DQC tab of the Execution step, find the monitoring rule and click View details in the Actions column to view the problem data collection process. For more information, see Intelligent diagnosis.

Appendix: Monitoring rules that support the problem data retention feature and specifications for problem data

Check type

Check rule

Rule description

Problem data specifications

Unique value

Unique value, fixed value

Data Quality compares the number of unique values of a field after deduplication with a fixed value.

If the check is not passed:

  • If all values of the field are unique, the system generates a problem data table but stores no problem data in the table.

  • If some values of the field are not unique, the system generates a problem data table and stores the duplicate data of the field in the table. In this case, the duplicate data is the problem data.

Number of unique values/Total number of rows

Data Quality compares the ratio of the number of unique values of a field to the total number of rows with a fixed value.

Null value

Number of null values, fixed value

Data Quality compares the number of null values of a field with a fixed value.

Note

The IS NULL expression is used in the SQL statements to check whether a value of a field is a null value.

The system generates a problem data table and stores null values of the field in the table. In this case, all null values are problem data.

Number of null values/Total number of rows, fixed value

Data Quality compares the ratio of the number of null values of a field to the total number of rows with a fixed value.

Note

The fixed value is a decimal.

Duplicate value

Number of duplicate values/Total number of rows, fixed value

Data Quality compares the ratio of the number of duplicate values of a field to the total number of rows with a fixed value.

The system generates a problem data table and stores duplicate values of the field in the table. In this case, all duplicate values are problem data.

Number of duplicate values, fixed value

Data Quality subtracts the number of values of a field after deduplication from the total number of rows to obtain the number of duplicate values of the field. Then, Data Quality compares the number of duplicate values with a fixed value.