The DataWorks Data Comparison node compares data between tables across different data sources and integrates the comparison into a scheduled workflow.
Limitations
-
The Data Comparison node supports only the Serverless resource group. For setup instructions, see Use a Serverless resource group.
-
Full-text comparison results are stored only in MaxCompute data sources. Make sure a MaxCompute data source is bound to the workspace before configuring full-text comparison.
-
For partitioned tables, the Where filter is required. Omitting the partition filter causes the task to fail.
Step 1: Create a Data Comparison node
-
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a region. Find the target workspace and click Shortcuts > Data Studio in the Actions column.
-
In the left navigation pane, click
to go to Data Development. To the right of Project Directory, click
and select Create Node > Data Quality > Data Comparison. Enter a path and name for the node, then confirm.
Step 2: Configure the node
Configure table information
Specify the source and destination tables to compare. The following table describes the parameters.
| Parameter | Description |
|---|---|
| Resource Group | Select a Serverless resource group from the drop-down list. |
| Task resource usage | Adjust the compute resources allocated to the node when it runs. |
| Data Source Type | Select the data source types for the source and destination tables. |
| Data Source Name | Select the data sources for the source and destination tables. |
| Connectivity | Click Test to verify that the resource group can connect to the selected data sources. |
| Table name | Select the source and destination tables from the drop-down lists. For MaxCompute data sources, a schema selection is also available. |
| Where filter | Filter rows for comparison. Do not include the WHERE keyword. For partitioned tables, specify a partition predicate; omitting this causes the task to fail with: Semantic analysis exception - physical plan generation failed: Table(<MaxCompute Project Name>,<Table Name>) is full scan with all partitions, please specify partition predicates. |
| Shard Key | Select a column to split the source data for parallel processing. Use a primary key or an indexed column as the Shard Key. |
Configure comparison rules
Metric-based comparison
Full-text comparison
Configure scheduling
After configuring the comparison rules, click Scheduling Configuration on the right side of the page to set when the node runs. See Node scheduling configuration for details.
Step 3: Deploy the node
-
Click
in the top toolbar to save the node. -
Click
in the top toolbar to deploy the node.
Once deployed, the node runs on the schedule you configured. For detailed deployment steps, see Deploy a node or workflow.
View the validation report
Access the report in two ways:
From the Operation Center:
-
Click the
icon and go to All Products > Data Development And Operations > Operation Center (Workflow). -
In the left navigation pane, choose Cycle Task Maintenance > Cycle Instance. Find the instance for the Data Comparison node, click More in the Operation column, and select View Running Log.
-
On the log page, click the Data Comparison tab to view the report.
From the run log (Data Development page):
Run the node from Data Development, then click the link in the run log to open the report directly.
What's next
-
To manage the node after deployment, go to the Operation Center. See Operation Center.
