All Products
Search
Document Center

DataWorks:Create and use a data comparison node

Last Updated:Jan 26, 2025

DataWorks provides data comparison nodes that allow you to compare data between different tables in multiple ways. You can use data comparison nodes in workflows. This topic describes how to use a data comparison node to develop tasks.

Node introduction

Data comparison nodes can be used for data synchronization and support data comparison between tables. You can specify custom ranges and metrics to implement data comparison from different aspects.

Limits

Data comparison nodes support only serverless resource groups. For more information about how to use a serverless resource group, see Resource group management.

Procedure

Step 1: Create a data comparison node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. In the Scheduled Workflow pane of the DataStudio page, move the pointer over the image icon and choose Create Node > Data Quality > Data Comparison.

    In the Create Node dialog box, configure the Path and Name parameters as prompted and click Confirm. The configuration tab of the node appears.

Step 2: Configure the data comparison node

Configure parameters in the Configure Information of Tables to Compare section

Data comparison nodes allow you to compare table data from different data sources based on only simple information configuration of tables whose data you want to compare. The following table describes the configuration details.

Parameter

Description

Resource Group

Select an existing resource group from the drop-down list.

Task Resource Usage

The number of compute units (CUs) that are allocated to run the data comparison node. You can configure this parameter based on your business requirements.

Data Source Type

Select the types of data sources to which the source and destination tables separately belong.

Data Source Name

Select the data sources to which the source and destination tables separately belong.

Connection Status

Click Test to the right of the Connection Status parameter to check whether the selected data sources are connected to the selected resource group.

Table Name

Separately select the source and destination tables from the drop-down list.

WHERE Condition

Enter a WHERE condition to filter data in the source and destination tables.

Shard Key

Specifies a column in the source table as the shard key. We recommend that you use the primary key or an indexed column as the shard key.

Configure parameters in the Configure Comparison Rule section

You can configure metric-based comparison or full-text comparison rules for data comparison.

Metric-based comparison

  • Table-level Comparison:

    Table-level comparison is supported. You can compare the numbers of rows in tables. If the difference rate of the comparison result is less than the difference threshold specified by the Error Threshold parameter, the comparison is successful.

    Note

    You can set the Error Threshold parameter to Percentage, Absolute Value, or Consistent or Not.

  • Field-level Comparison:

    By default, fields that have the same name are selected for comparison. If the source and destination tables contain different fields, you can click Add Field for Comparison to manually select source and destination table fields for field-level comparison.

    • Source Field: Select fields from the source table.

    • Destination Field: Select fields from the destination table.

    • Comparison Metric: Select a comparison metric. Valid values: MAX, AVG, MIN, and SUM.

      • You can configure multiple comparison metrics for a pair of source and destination table fields.

      • You can set the Error Threshold and Ignored Object parameters to different values for different comparison metrics.

    • Error Threshold: the threshold that is used to determine whether the comparison is successful. If the difference rate of the comparison result is less than the difference threshold specified by the Error Threshold parameter, the comparison is successful. Valid values: Percentage, Absolute Value, and Consistent or Not.

      Note
      • Absolute value specified by the Error Threshold parameter = |Metric value in the source table - Metric value in the destination table|

      • Percentage specified by the Error Threshold parameter = (|Metric value in the source table - Metric value in the destination table|)/Metric value in the source table × 100%

    • Ignored Object:

      • If fields are of the INT data type, you can select the Difference Between Null Value and Value 0 check box.

      • If fields are of the VARCHAR or STRING data type, you can select the Difference Between Null Value and Empty String check box.

      • If fields are of the DECIMAL data type, you can select the Floating Precision check box and the Difference Between Null Value and Value 0 check box.

    • Operation: You can delete redundant fields or unnecessary fields.

  • Configure Custom Comparison Rules:

    You can perform the following operations to add custom SQL comparison metrics to compare data in the source and destination tables:

    1. Click Add Custom SQL Comparison Metric to add SQL comparison metrics based on your business requirements. You can rename metrics.

      image

    2. Configure the Error Threshold parameter based on your business requirements. You can set the Error Threshold parameter to Percentage, Absolute Value, or Consistent or Not.

    3. After you configure the Error Threshold parameter, click Configure in the Custom SQL column to configure SQL statements for the source and destination tables to specify custom computing metrics.

    4. After the configuration is complete, click Confirm.

Full-text comparison

  1. If you set the Comparison Type parameter to Full-text Comparison, you can also configure the Full-text Comparison Type parameter to achieve different comparison results.

    • Source Data Contained in Destination: If each row of data in a source table exists in the related destination table, the comparison is successful. In this case, the number of data records in the destination table may be greater than the number of data records in the source table.

    • Comparison by Row: Compare the number of rows in a source table and the number of rows in a destination table by row and the data differences between the rows.

      If you set the Full-text Comparison Type parameter to Comparison by Row, you must configure the Error Threshold parameter. You can set the Error Threshold parameter to Percentage, Absolute Value, or Consistent or Not.

      Note
      • Absolute value specified by the Error Threshold parameter = |Metric value in the source table - Metric value in the destination table|

      • Percentage specified by the Error Threshold parameter = (|Metric value in the source table - Metric value in the destination table|)/Metric value in the source table × 100%

  2. After you configure the full-text comparison type, you can select the fields that you want to compare. By default, fields that have the same name are selected. If you want to compare fields that have different names, you can click Add Field for Comparison to manually select source and destination table fields.

    • Source Field: Select fields from the source table.

    • Destination Field: Select fields from the destination table.

    • Full-text Comparison Based on Primary Keys: If you perform a full-text comparison, the system compares the columns except for the primary key columns based on the primary keys.

    • Ignored Object:

      • If fields are of the INT data type, you can select the Difference Between Null Value and Value 0 check box.

      • If fields are of the VARCHAR or STRING data type, you can select the Difference Between Null Value and Empty String check box.

      • If fields are of the DECIMAL data type, you can select the Floating Precision check box and the Difference Between Null Value and Value 0 check box.

    • Operation: You can delete redundant fields or unnecessary fields.

  3. Full-text comparison results need to be stored. This way, you can view the data comparison details after the comparison is complete. You can configure a data source to store comparison results.

    • Data Source Type: Only MaxCompute is supported.

    • Data Source Name: Select a MaxCompute data source that is associated with the current workspace from the drop-down list.

    • Connection Status: Make sure that the selected MaxCompute data source is connected to the resource group that is selected in the Configure Information of Tables to Compare section.

    • Table For Storage: Click Generate Storage Table to generate a table named in the data_comparison_xxxxxx format.

    • Tunnel Quota: Select data transmission resources of MaxCompute from the drop-down list. For more information, see Purchase and use exclusive resource groups for data transmission service.

Configure scheduling properties

After you configure comparison rules, you can configure scheduling properties for the data comparison node. For more information, see Node scheduling configuration.

Step 3: Deploy and perform O&M operations on the data comparison node

Deploy the data comparison node

After a task on the data comparison node is configured, you must commit and deploy the node. After you commit and deploy the node, the system runs the node on a regular basis based on scheduling configurations.

  1. Click the image icon in the top toolbar to save the node.

  2. Click the image icon in the top toolbar to commit the node.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review node code and perform smoke testing after you commit the node based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the node.

    • You can use the code review feature to ensure the code quality of nodes and prevent execution errors caused by invalid node code. If you enable the code review feature, the node code that is committed can be deployed only after the node code passes the code review. For more information, see Code review.

    • To ensure that a task on the node you created can be run as expected, we recommend that you perform smoke testing before you deploy the node. For more information, see Perform smoke testing.

If the workspace that you use is in standard mode, you must click Deploy in the upper-right corner of the node configuration tab to deploy a task on the node to the production environment for running after you commit the task on the node. For more information, see Deploy nodes.

Perform O&M operations on the data comparison node

After the data comparison node is deployed, you can perform O&M operations on the node in Operation Center. For more information, see Operation Center.

View a data comparison report

You can use one of the following methods to view a data comparison report:

  • View in Operation Center:

    1. In the upper-left corner of the current page, click the image icon and choose All Products > Data Development And Task Operation > Operation Center.

    2. In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Instances. On the Instance Perspective tab of the page that appears, find the instance that is generated for the data comparison node and choose More > View Runtime Log in the Actions column.

    3. On the Running Details tab of the page that appears, click the Data Comparison tab in the Execution step.

  • View on the Log tab:

    If the data comparison node is run only on the DataStudio page, you can click the link that is shown in the following figure on the Log tab in the Execution step to go to the data comparison report page to view details.

    image