how to use a data comparison node - DataWorks - Alibaba Cloud Documentation Center

DataWorks data comparison nodes allow you to compare data between different tables in various ways. You can use these nodes in workflows. This topic describes how to use a data comparison node to develop a task.

Node introduction

Data comparison nodes are used for more than just data integration. They support data comparison between tables. You can also specify custom comparison ranges and metrics for more flexible data comparisons.

Limits

Data comparison nodes support only serverless resource groups. For more information about serverless resource groups, see Resource group management.

Procedure

Step 1: Create a data comparison node

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Click the icon and choose Create Node > Data Quality > Data Comparison.
Follow the on-screen instructions to specify the node path and name.

Step 2: Configure the data comparison node

Configure table information for comparison

You can compare table data from different data sources by configuring the basic information of the tables. The following table describes the parameters.

Parameter	Description
Resource Group	Select an existing resource group from the drop-down list.
Task Resource Usage	Adjust the amount of resources that the data comparison node consumes when it runs.
Data Source Type	Select the data source types for the source and destination tables that you want to compare.
Data Source Name	Select the data sources for the source and destination tables that you want to compare.
Connection Status	After you complete the configuration, click Test to check if the data source is connected to the resource group.
Table Name	Select the source and destination tables from the drop-down list.
WHERE Condition	Filter the data in the source and destination tables that you want to compare.
Shard Key	Configure a shard key for the source table. A shard key is a column used to partition the data. We recommend that you use a primary key or an indexed column as the shard key.

Configure comparison rules

You can configure Metric-based Comparison or Full-text Comparison rules to compare the source data with the destination data.

Metric-based comparison

Table Row Comparison:
Metric-based comparison supports table-level comparisons. You can compare the number of rows in tables. The comparison is successful if the difference is less than the specified error threshold.
Note
The error threshold can be set as a Percentage, an Absolute Value, or Consistent or Not.

Field-level Comparison:

For field-level comparison, fields with the same name are compared by default. If the source and destination tables have different field names, click Add Field for Comparison to manually select the source and destination fields to compare.

Source Field: Select fields from the source table to compare.
Destination Field: Select fields from the destination table to compare.
Comparison Metric: Select a comparison metric. Valid values include MAX, AVG, MIN, and SUM.
- You can configure multiple comparison metrics for a pair of source and destination fields.
- You can configure different error thresholds and ignore settings for different comparison metrics.
Error Threshold: The difference from the comparison is compared with this threshold. The comparison is successful if the difference is less than the error threshold. You can set the threshold as a Percentage, an Absolute Value, or Consistent or Not.
Note
- Absolute difference = |Metric value of source table - Metric value of destination table|
- Percentage difference = (|Metric value of source table - Metric value of destination table|) / (Metric value of source table) × 100%

Ignored Object: Different field types support different ignore configurations:

Field type for comparison	Supported ignore options
Integer type fields (such as `INT`, `BIGINT`, etc.)	You can ignore Difference Between Null Value and Value 0.
String type fields (such as `STRING`, `VARCHAR`, `TEXT`, etc.)	You can ignore Difference Between Null Value and Empty String.
Numeric type fields (including integer and floating-point types)	You can set the Floating Precision for comparison. You can ignore Difference Between Null Value and Value 0. You can Ignore trailing zeros in the decimal part.
Integer and string type comparison	You can Ignore trailing zeros in the decimal part.
Integer and floating-point type comparison	You can Ignore trailing zeros in the decimal part. You can ignore Difference Between Null Value and Value 0.
Floating-point and string type comparison	You can Ignore trailing zeros in the decimal part.

Operation: Delete redundant or unnecessary fields from the field comparison.

Configure Custom Comparison Rules:
You can add custom SQL (Structured Query Language) comparison metrics to compare the source and destination tables. Perform the following steps:
1. Click Add Custom SQL Comparison Metric to add the metrics you need. You can manually modify and rename the metric.
2. Adjust the Error Threshold as needed. You can set it to a Percentage, an Absolute Value, or Consistent or Not.
3. After you configure the error threshold, click Configure in the Custom SQL column. Configure SQL statements for the source and destination tables to define custom comparison metrics.
4. After you complete the configuration, click Confirm to finish configuring the custom comparison.

Full-text comparison

When you select full-text comparison, you can select a full-text comparison type to achieve different results.
- Source Data Contained in Destination: The comparison is successful if every row from the source data exists in the destination data. In this case, the destination data may contain more rows than the source data.
- Comparison by Row: Compare the source and destination data row by row to find differences in row count and content.
  When you configure a row-by-row comparison, you must configure an error threshold. You can set it to a Percentage, an Absolute Value, or Consistent or Not.
  Note
  - Absolute difference = |Metric value of source table - Metric value of destination table|
  - Percentage difference = (|Metric value of source table - Metric value of destination table|) / (Metric value of source table) × 100%

After you configure the full-text comparison type, select the fields to compare. By default, fields with the same name are compared. To compare fields with different names, click Add Field for Comparison and select the source and destination fields.

Source Field: Select fields from the source table to compare.
Destination Field: Select fields from the destination table to compare.
Full-text Comparison Based on Primary Keys: For a full-text comparison, the primary key is used as the basis to compare the content of other fields.

Ignored Object: Different field types support different ignore configurations:

Field type for comparison	Supported ignore options
Integer type fields (such as `INT`, `BIGINT`, etc.)	You can ignore Difference Between Null Value and Value 0.
String type fields (such as `STRING`, `VARCHAR`, `TEXT`, etc.)	You can ignore Difference Between Null Value and Empty String.
Numeric type fields (including integer and floating-point types)	You can set the Floating Precision for comparison. You can ignore Difference Between Null Value and Value 0. You can Ignore trailing zeros in the decimal part.
Integer and string type comparison	You can Ignore trailing zeros in the decimal part.
Integer and floating-point type comparison	You can Ignore trailing zeros in the decimal part. You can ignore Difference Between Null Value and Value 0.
Floating-point and string type comparison	You can Ignore trailing zeros in the decimal part.

Operation: Delete redundant or unnecessary fields from the field comparison.

The results of a full-text comparison must be stored so you can view the details after the comparison is complete. You can configure a data source to store the results.
- Data Source Type: Only MaxCompute data sources are supported.
- Data Source Name: Select a MaxCompute data source that is associated with the workspace from the drop-down list.
- Connection Status: Make sure that the selected MaxCompute data source can connect to the resource group that you configured for the table comparison.
- Table for Storage: Click Generate Storage Table to generate a storage table with a name in the data_comparison_xxxxxx format.
- Tunnel Quota: Select MaxCompute data transmission resources from the drop-down list. For more information, see Purchase and use exclusive resource groups for Data Transmission Service.

Scheduling configuration

After you configure the rules, you can configure scheduling properties for the data comparison node. For more information, see Node scheduling configuration.

Step 3: Deploy and maintain the node

Deploy the data comparison node

After you configure the node task, you must commit and deploy it. After the task is committed and deployed, it runs periodically based on the scheduling configuration.

Click the icon in the toolbar to save the node.
Click the icon in the toolbar to submit the node.
In the Submit dialog box, enter Change description. If required, select whether to perform a code review and smoke testing after the node is committed.
Note
- You must set the Rerun property and Parent Nodes for the node before you can commit it.
- Code review helps control the quality of your task code. It prevents task errors that can occur if incorrect code is published to the production environment without review. If you enable code review, the committed code must be approved by a reviewer before it can be deployed. For more information, see Code review.
- To ensure that the scheduled node task runs as expected, we recommend that you perform smoke testing on the task before you deploy it. For more information, see Smoke testing.

If you use a workspace in standard mode, you must also click Deploy in the upper-right corner of the node editing page after the task is committed. This publishes the task to the production environment. For more information, see Deploy tasks.

Maintain the data comparison node

After the data comparison node is deployed, you can perform operations and maintenance (O&M) on the node in the Operation Center. For more information, see Operation Center.

View the data validation report

You can view the data validation report in the task run log. You can view the report in the following ways:

View in the Operation Center:
1. Click the icon and choose All Products > Data Development And Task Operation > Operation Center to go to the Operation Center.
2. In the navigation pane on the left of the Operation Center, choose AAuto Triggered Node O&M > Auto Triggered Instances to view the instance generated for the data comparison node. In the Actions column, click More and select View Runtime Log.
3. On the log page, click the Data Comparison tab to view the report.
View in the runtime log:
If you run the data comparison node from the Data Development page, you can click the link in the runtime log, as shown in the following figure, to go to the data validation report page.