All Products
Search
Document Center

DataWorks:Batch synchronization data quality troubleshooting

Last Updated:Mar 27, 2026

After a batch sync task completes, the destination may contain fewer records than expected, different content than the source, or data that shifts between runs. This topic explains how Data Integration processes data and walks through the most common causes of data quality issues — organized by where they occur: the Writer plugin, the Reader plugin, or the environment.

How it works

Data Integration uses parallel processing and a plugin-based architecture to move data efficiently.

Execution model

A sync task runs as a Job. Each Job is split into multiple Tasks — the smallest execution unit. Tasks run concurrently on one or more machines, each processing an independent data shard.

Data flow inside each Task

Source → Reader plugin → memory buffer → Writer plugin → Destination
  • Reader plugin: Connects to the source, reads data, and pushes it into the buffer.

  • Writer plugin: Consumes data from the buffer and writes it to the destination.

Reader and Writer plugins follow the native read/write protocols and data constraints of their respective sources — including data types and primary key limits. The final synchronization result depends on those rules, not just on the task configuration.

Troubleshoot writer-side issues

The Writer plugin connects to the destination and writes data using Java Database Connectivity (JDBC) or the destination's software development kit (SDK). The actual result depends on the configured write mode and the destination table's constraints.

Review the following causes if records are missing, duplicated, or contain unexpected content at the destination.

Cause

Description

Solution

Improperly configured write mode

The Writer uses the selected write mode to resolve conflicts between source data and destination table constraints. The wrong mode can cause inserts to fail (dirty data), be silently skipped, or overwrite records incorrectly.

Select the correct write mode for your use case. See Write modes for relational databases.

Dirty data threshold reached

Dirty data occurs when a value is incompatible with the destination column — for example, a string value such as abc written to an integer column, or a value that exceeds the column length limit. Each such row counts against the configured threshold. When the threshold is exceeded, the task fails and remaining records are not written. To check how many rows were skipped, open the task run log and look for dirty data or parsing error entries.

Identify and fix the source of dirty data. If some dirty data is acceptable, increase the threshold in Task Configuration > Channel Configuration. For configuration details, see Codeless UI configuration. For what counts as dirty data, see Terms.

Querying data too early

For data sources such as Hive and MaxCompute (configurable), written data may be partially or fully invisible until the task completes.

Always verify destination data after confirming the sync task instance has completed successfully.

Missing node dependencies

If no dependency is configured between a downstream analysis task and the upstream sync task, the downstream task may start before synchronization finishes — reading incomplete data.

In DataStudio, configure parent-child node dependencies between upstream and downstream tasks. Avoid weak dependencies such as max_pt.

Concurrent writes to the same table or partition

MaxCompute / Hologres: Two tasks writing to the same partition, both configured to truncate before writing — the second task clears data written by the first. Relational databases: pre-SQL or post-SQL statements from one task interfere with data written by another.

For recurring instances of the same node, configure a self-dependency so each instance waits for the previous one to complete. Avoid designing tasks that write concurrently to the same destination.

Task not configured for idempotent execution

A task that is not idempotent produces different results on each run. Rerunning it can cause duplicate inserts or incorrect overwrites.

Design the task to be idempotent — for example, use REPLACE INTO mode. If idempotency is not achievable, exercise caution when you rerun it. Configure success alerts to avoid unnecessary retries.

Incorrect partition expression

In MaxCompute, a scheduling parameter such as $bizdate in the partition expression must be replaced at runtime. Common errors: the parameter remains literal (data lands in a partition named ds=$bizdate instead of ds=20230118), or a downstream query reads from the wrong partition.

Check the variable expressions in the sync task. Confirm the scheduling parameter configuration and verify that runtime parameters are replaced with the expected values.

Data type or time zone mismatch

Inconsistent data types or time zone settings between source and destination can truncate values, cause incorrect conversions, or produce discrepancies during data comparison.

Confirm the type and time zone differences between source and destination. Decide whether to keep the current settings or update the data type and time zone parameters at the destination.

Destination data changed

Concurrent writes by other applications make the destination inconsistent with the source.

Ensure no other processes write to the destination table during the synchronization window. If concurrent writing is expected, accept the resulting discrepancy.

Write modes for relational databases

Select the write mode that matches your conflict-handling requirements.

General/MySQL protocol

Write mode

Behavior on conflict

Behavior without conflict

When to use

insert into

Fails and generates dirty data

Inserts normally

Full or incremental append — you do not want to overwrite existing records

replace into

Deletes the old row and inserts the new row

Inserts normally

Overwrite old records completely with the latest data

insert into ... on duplicate key update

Updates specified fields in the existing row

Inserts normally

Update some fields while keeping others, such as a creation timestamp

insert ignore into

Ignores the new row without error

Inserts normally

Insert only new data, take no action on existing records

PostgreSQL

Write mode

Behavior on conflict

Behavior without conflict

When to use

insert on conflict do nothing

Ignores the new row without error

Inserts normally

Insert only new data, take no action on existing records

insert on conflict do update

Updates specified fields in the conflicting row

Inserts normally

Update some fields while keeping others, such as a creation timestamp

copy on conflict do nothing

Discards conflicting rows using the high-performance COPY protocol; no dirty data generated

Bulk inserts normally

Efficiently append large data batches while skipping duplicates

copy on conflict do update

Overwrites the conflicting row using the COPY protocol

Bulk inserts normally

Efficiently sync large data batches, overwriting old records with new data

merge into

Unsupported

Troubleshoot reader-side issues

The Reader plugin connects to the source and extracts data using JDBC or the source's SDK. The actual read result depends on the data extraction mode — including filter conditions, tables, partitions, and columns — as well as changes to the source data and task configuration.

Review the following causes if the amount or content of data read does not match expectations.

Cause

Description

Solution

Concurrent changes to source data

The sync task captures a data snapshot at the moment of reading, not the absolute latest state. Because a Job splits into multiple Tasks that issue independent queries, each Task may capture a snapshot from a slightly different point in time. Changes that occur after all Tasks have started are not captured.

Accept this as normal behavior for high-throughput synchronization. Running the task multiple times may produce different results due to real-time changes in the source data.

Incorrect query conditions

MySQL: A WHERE clause using a scheduling parameter such as gmt_modify >= ${bizdate} must be replaced correctly at runtime. A common error is filtering one day's data when two days are needed. MaxCompute: A partition parameter such as pt=${bizdate} is easily misconfigured or fails to be replaced.

Check the scheduling variable expressions in the sync task. Confirm the scheduling parameter configuration and verify the actual replacement values in the task instance's runtime parameters.

Reader-side dirty data

Parsing fails when reading source data. This is rare for structured databases but common for semi-structured sources — for example, a CSV or JSON file in Object Storage Service (OSS) or Hadoop Distributed File System (HDFS) that contains format errors.

Check the task run log for parsing errors or format exceptions, then fix the source files. Alternatively, adjust the dirty data toleration configuration.

Troubleshoot environment issues

Data quality issues can also stem from environment configuration rather than the sync task itself.

Cause

Solution

Querying the wrong data source, table, or partition

In a DataWorks workspace in standard mode, data sources are isolated between development and production environments. An offline single-table sync task uses the development data source in the development environment and the production data source in the production environment. Confirm which environment you are querying. Also check whether production has a corresponding pre-release or testing environment, since those databases differ from the production database. For semi-structured data, confirm that the full set of source and destination files is included.

Dependent data not ready

If data is generated periodically — by a recurring sync task or a recurring full/incremental merge task — the upstream task may not have finished when the downstream task starts. Check that all dependent data generation tasks have completed successfully before verifying the result.

Note

For general troubleshooting when you encounter Data Quality issues, you can run the task multiple times to observe and compare the synchronization results. You can also switch the source or destination data source for comparison testing. Running multiple comparison tests can help you narrow down the scope of the problem.