All Products
Search
Document Center

DataWorks:Data comparison

Last Updated:Mar 26, 2026

The DataWorks Data Comparison node compares data between tables across different data sources and integrates the comparison into a scheduled workflow.

Limitations

  • The Data Comparison node supports only the Serverless resource group. For setup instructions, see Use a Serverless resource group.

  • Full-text comparison results are stored only in MaxCompute data sources. Make sure a MaxCompute data source is bound to the workspace before configuring full-text comparison.

  • For partitioned tables, the Where filter is required. Omitting the partition filter causes the task to fail.

Step 1: Create a Data Comparison node

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a region. Find the target workspace and click Shortcuts > Data Studio in the Actions column.

  2. In the left navigation pane, click image to go to Data Development. To the right of Project Directory, click image and select Create Node > Data Quality > Data Comparison. Enter a path and name for the node, then confirm.

Step 2: Configure the node

Configure table information

Specify the source and destination tables to compare. The following table describes the parameters.

Parameter Description
Resource Group Select a Serverless resource group from the drop-down list.
Task resource usage Adjust the compute resources allocated to the node when it runs.
Data Source Type Select the data source types for the source and destination tables.
Data Source Name Select the data sources for the source and destination tables.
Connectivity Click Test to verify that the resource group can connect to the selected data sources.
Table name Select the source and destination tables from the drop-down lists. For MaxCompute data sources, a schema selection is also available.
Where filter Filter rows for comparison. Do not include the WHERE keyword. For partitioned tables, specify a partition predicate; omitting this causes the task to fail with: Semantic analysis exception - physical plan generation failed: Table(<MaxCompute Project Name>,<Table Name>) is full scan with all partitions, please specify partition predicates.
Shard Key Select a column to split the source data for parallel processing. Use a primary key or an indexed column as the Shard Key.

Configure comparison rules

Metric-based comparison

Metric-based comparison validates aggregated values between the source and destination tables. It includes three levels:

Table-level: row count

Table Row Count Comparison checks whether the total number of rows in both tables is within an acceptable difference. Set the Error Threshold as one of the following:

  • Percentage

  • Absolute Value

  • Exact Match

Column-level: aggregate metrics

By default, columns with the same name are mapped automatically. To compare columns with different names, click Add Comparison Field and map a Source Field to a Destination Field.

For each column pair, configure:

  • Comparison Metric: Select one or more of MAX, AVG, MIN, or SUM. Each metric can have its own error threshold and ignore options.

  • Error Threshold: Applies per metric.

    • Absolute Value = |Source Metric Value - Destination Metric Value|

    • Percentage = (|Source Metric Value - Destination Metric Value|) / Source Metric Value × 100%

  • Ignore options: Available options depend on the column data types being compared.

    Column types Available ignore options
    Integer (INT, BIGINT) Ignore difference between null and 0
    String (STRING, VARCHAR, TEXT) Ignore difference between null and empty string
    Numeric (integer and floating-point) Set Floating Precision; ignore null vs. 0; ignore trailing zeros in the decimal part
    Integer compared with string Ignore trailing zeros in the decimal part
    Integer compared with floating-point Ignore trailing zeros in the decimal part; ignore null vs. 0
    Floating-point compared with string Ignore trailing zeros in the decimal part

Custom SQL

To compare metrics defined by custom business logic:

  1. Click Add Custom SQL Comparison Metric. Optionally rename the metric.

    image

  2. Set the Error Threshold (Percentage, Absolute Value, or Exact Match).

  3. Click Configure in the Custom SQL column and write SQL queries that calculate the metric for the source and destination tables.

  4. Click OK.

Full-text comparison

Full-text comparison checks for row-level differences between the source and destination tables. Set up the comparison as follows:

1. Select the comparison method

  • Destination Data Contains Source Data: passes if every source row exists in the destination. The destination may contain additional rows.

  • Row-by-Row Comparison: compares tables row by row for differences in both row count and content. Requires an Error Threshold (Percentage, Absolute Value, or Exact Match).

    • Absolute Value = |Source Metric Value - Destination Metric Value|

    • Percentage = (|Source Metric Value - Destination Metric Value|) / Source Metric Value × 100%

2. Map columns for comparison

Columns with the same name are mapped automatically. Click Add Comparison Field to map columns with different names.

For each mapped pair, configure:

  • Source Field: column from the source table

  • Destination Field: column from the destination table

  • Comparison Primary Key: required. Specifies the key used to match rows across both tables before comparing column values.

  • Ignore options: same as those available in metric-based comparison (see table above).

3. Configure result storage

Full-text comparison stores row-level diff results in a MaxCompute data source so you can review them after the task completes.

Parameter Description
Data Source Type Only MaxCompute is supported.
Data Source Name Select a MaxCompute data source bound to the workspace.
Connectivity Make sure the selected MaxCompute data source can connect to the resource group configured for the table comparison.
Storage Table Click Generate Storage Table to create a table in the format data_comparison_xxxxxx.
Tunnel Quota Select a MaxCompute data transmission resource. See Purchase and use exclusive data transmission service resource groups.

Configure scheduling

After configuring the comparison rules, click Scheduling Configuration on the right side of the page to set when the node runs. See Node scheduling configuration for details.

Step 3: Deploy the node

  1. Click image in the top toolbar to save the node.

  2. Click image in the top toolbar to deploy the node.

Once deployed, the node runs on the schedule you configured. For detailed deployment steps, see Deploy a node or workflow.

View the validation report

Access the report in two ways:

From the Operation Center:

  1. Click the image icon and go to All Products > Data Development And Operations > Operation Center (Workflow).

  2. In the left navigation pane, choose Cycle Task Maintenance > Cycle Instance. Find the instance for the Data Comparison node, click More in the Operation column, and select View Running Log.

  3. On the log page, click the Data Comparison tab to view the report.

From the run log (Data Development page):

Run the node from Data Development, then click the link in the run log to open the report directly.

image

What's next

  • To manage the node after deployment, go to the Operation Center. See Operation Center.