The quality monitoring node in DataWorks lets you configure rules to monitor the data quality of tables in your data sources and detect dirty data. You can also customize scheduling policies to periodically run data validation tasks. This topic describes how to use a quality monitoring node to monitor data quality.
Background
The Data Quality feature in DataWorks helps you detect changes in source data and identify dirty data generated during the ETL (Extract, Transformation, and Load) process. It automatically intercepts problematic tasks to prevent dirty data from propagating downstream. This prevents unexpected data from disrupting operations and affecting business decisions. It also significantly reduces troubleshooting time and saves resources by preventing task reruns. For more information, see Data Quality.
Limitations
-
Supported table types: MaxCompute, E-MapReduce, Hologres, CDH Hive, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and StarRocks.
-
Scope of supported tables:
-
You can monitor only tables in data sources that are bound to the workspace where the quality monitoring node is located.
-
Each node can monitor only one table, but you can configure multiple monitoring rules for it. The monitoring scope varies by table type:
-
For a non-partitioned table, the entire table is monitored by default.
-
For a partitioned table, you must specify a partition to monitor by using a partition filter expression.
NoteTo monitor multiple tables, create multiple nodes.
-
-
-
Operational limitations:
-
Quality monitoring rules created in Data Studio can be run, modified, and deployed only within Data Studio. Although these rules are visible in the Data Quality module, you cannot trigger them on a schedule or manage them from there.
-
If you modify the monitoring rules in a quality monitoring node and then deploy the node, the previously generated monitoring rules are replaced.
-
Prerequisites
-
A computing engine is bound to your workspace, and the table you want to monitor has been created in it.
Before you run a data quality monitoring task, you must create the table that the monitoring node will check. For more information, see Bind a computing engine and Develop a node.
-
A resource group has been created.
Quality monitoring nodes can run only on a serverless resource group. For more information, see Manage resource groups.
-
(Optional, for RAM users) The RAM user for task development is added to the corresponding workspace and granted the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions and should be granted with caution. For more information about adding and authorizing members, see Add members to a workspace.
Step 1: Create a quality monitoring node
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
-
In the left-side navigation pane, click
to go to Data Development. Next to Project Directory, click
and choose . Follow the on-screen instructions to enter the path and name for the node and create it.
Step 2: Configure quality monitoring rules
1. Select the table to monitor
On the quality monitoring node editor page, click Add Table. In the Add Table dialog box, select the table to monitor. You can use the More filter to locate the table faster.
If your table is not listed, go to My Data in Data Map.
2. Configure the monitoring data scope
-
Non-partitioned table: The entire table is monitored by default. You can skip this step.
-
Partitioned table: You must select the partition data to monitor. You can use scheduling parameters. Click Preview to verify that the partition filter expression is resolved correctly.
3. Configure data quality monitoring rules
You can create new rules or import existing ones. Configured rules are enabled by default.
-
When you create rules in a quality monitoring node, you can use the DataWorks Copilot rule recommendation feature to intelligently generate quality rules based on your table information. You can then accept or reject the suggestions as needed.
-
DataWorks Copilot is in public preview in some regions. If it is not available in the region where your workspace is located, you can manually create or import rules as described in this topic.
-
Create a new rule
Click Create Rule to create a quality monitoring rule based on a template or custom SQL. The following sections describe these methods.
Built-in template
You can follow these steps to quickly create a quality monitoring rule from a rule template.
NoteYou can also find the required rule template in the built-in template list on the left and click + Use to create it.
Click the + Built-in Template Rule tab on the right side of the dialog box. In the Built-in Templates panel on the left, expand a template category such as Table Row Count, find the target template, and click + Use. In the form on the right, configure the rule parameters and click OK.
Custom template
Before you use this method, you must first create a custom rule template in . You can then create quality monitoring rules from that template. For more information, see Create and manage custom rule templates.
The following steps show how to create a data quality rule from a custom template.
NoteYou can also find the required rule template in the custom template list on the left and click + Use to create it.
In the Create Rule dialog box, click the + Custom Template Rule tab. Configure Rule Name, Rule Template, Quality Dimension, FLAG Parameter, SQL, Comparison Method, and Monitoring Threshold (including normal and red thresholds), and then click OK.
Custom SQL
This method allows you to define custom data quality validation logic for a table.
Click the + Custom SQL tab at the top. In the form on the right, configure the Rule Name, Rule Template (select Custom SQL), Quality Dimension, FLAG Parameter, SQL, Comparison Method (choose Manual Settings or Intelligent Dynamic Threshold), and Monitoring Threshold, and then click OK.
-
Import existing rules
If monitoring rules for the target table already exist in the Data Quality module, you can import them to quickly clone the rules. If no rules exist, go to Data Quality to create them first. For more information, see Configure rules: By table (single table).
NoteThis method supports importing multiple rules in bulk and allows for configuring monitoring rules at the table field level.
Click Import Rule. You can search for and select the rules to import by rule ID or name, rule template, or associated scope (entire table or specific fields).
After you select the rules, click OK to complete the import.
After a quality monitoring node is deployed, the rules it contains can be viewed in the Data Quality module, but management operations such as modifying or deleting them are not allowed there.
4. Configure runtime resources
Select the runtime resources for the quality rule checks. This selection determines the data source where the quality monitoring task runs. By default, this is the data source where the monitored table is located.
If you select another data source, confirm that it has access permissions to the table.
Step 3: Configure handling policies
In the Handling Policy section of the node editor page, you can configure handling policies and notification subscriptions for exceptions that the quality monitoring rules detect.
Exception categories
|
Exception category |
Description |
|
Strong rule: Check failed |
|
|
Strong rule: Critical exception |
|
|
Strong rule: Warning exception |
|
|
Weak rule: Check failed |
|
|
Weak rule: Critical exception |
|
|
Weak rule: Warning exception |
Exception handling policies
You can configure handling policies for exceptions generated by rule checks:
-
Do not ignore: If a specific exception category is detected (for example, a strong rule triggers a critical exception), you can configure the system to stop the current node and set its status to Failed.
Note-
After the current node fails, downstream nodes will not be executed. This blocks the production pipeline and prevents the spread of problematic data.
-
You can add multiple exception categories to check for.
-
This policy is typically used when an exception has a major impact and needs to block downstream tasks.
-
-
Ignore: Ignore the exception and continue to execute downstream nodes.
Exception notification methods
You can configure how to receive notifications for exceptions (for example, by email). When an exception occurs, the platform sends a notification through the specified method so you can handle the exception promptly.
The platform supports multiple notification methods, which may vary on the UI. Note the following:
-
Email, Email and SMS, and Phone notifications can only be sent to users under the current account. Make sure the recipients' email addresses and phone numbers are configured correctly. For more information, see View and set alert contacts.
-
For other methods, you must enter the webhook URL for receiving notifications. For instructions on how to obtain this URL, see Obtain a webhook URL.
Step 4: Configure scheduling
To run the node task periodically, click Scheduling Settings on the right side of the node editor page and configure the scheduling properties based on your business requirements. For more information, see Configure scheduling for a node.
Step 5: Debug the task
Perform the following debugging operations as needed to check whether the task runs as expected.
-
(Optional) Select the runtime resource group and assign values to custom parameters.
-
On the right side of the quality monitoring node, click Run Configuration and configure the Resource Group for Scheduling to use for the debug run.
-
If your task uses scheduling parameters, you can assign values to variables in the Script Parameters section for debugging. For more information about the parameter assignment logic, see Debug a task.
-
-
Save and run the task.
Click the
icon in the top toolbar to save the task. Click the
icon to run the task.After the task is complete, you can view the run results at the bottom of the node editor page. If the run fails, troubleshoot the issue based on the error message.
Step 6: Deploy the task
After the node task is configured, you must deploy it. After deployment, the node runs periodically according to its scheduling configuration.
When you deploy a quality monitoring node, the quality rules configured within it are also deployed.
-
In the top toolbar, click the
icon to save the node. -
In the top toolbar, click the
icon to deploy the node.
For more information about how to deploy nodes, see Deploy nodes and workflows.
Next steps
-
Task O&M: After a task is deployed, it runs periodically based on its scheduling configuration. You can click O&M in the upper-right corner of the node editor page to go to Operation Center and view the scheduling and run details of the task, such as the node run status and triggered rule details. For more information, see Manage scheduled tasks.
-
Data Quality: After a quality monitoring rule is deployed, you can go to the Data Quality module to view the rule details. However, you cannot perform management operations such as modifying or deleting the rule there. For more information, see Data Quality.