After a batch sync task completes, the destination may contain fewer records than expected, different content than the source, or data that shifts between runs. This topic explains how Data Integration processes data and walks through the most common causes of data quality issues — organized by where they occur: the Writer plugin, the Reader plugin, or the environment.
How it works
Data Integration uses parallel processing and a plugin-based architecture to move data efficiently.
Execution model
A sync task runs as a Job. Each Job is split into multiple Tasks — the smallest execution unit. Tasks run concurrently on one or more machines, each processing an independent data shard.
Data flow inside each Task
Source → Reader plugin → memory buffer → Writer plugin → DestinationReader plugin: Connects to the source, reads data, and pushes it into the buffer.
Writer plugin: Consumes data from the buffer and writes it to the destination.
Reader and Writer plugins follow the native read/write protocols and data constraints of their respective sources — including data types and primary key limits. The final synchronization result depends on those rules, not just on the task configuration.
Troubleshoot writer-side issues
The Writer plugin connects to the destination and writes data using Java Database Connectivity (JDBC) or the destination's software development kit (SDK). The actual result depends on the configured write mode and the destination table's constraints.
Review the following causes if records are missing, duplicated, or contain unexpected content at the destination.
Cause | Description | Solution |
Improperly configured write mode | The Writer uses the selected write mode to resolve conflicts between source data and destination table constraints. The wrong mode can cause inserts to fail (dirty data), be silently skipped, or overwrite records incorrectly. | Select the correct write mode for your use case. See Write modes for relational databases. |
Dirty data threshold reached | Dirty data occurs when a value is incompatible with the destination column — for example, a string value such as | Identify and fix the source of dirty data. If some dirty data is acceptable, increase the threshold in Task Configuration > Channel Configuration. For configuration details, see Codeless UI configuration. For what counts as dirty data, see Terms. |
Querying data too early | For data sources such as Hive and MaxCompute (configurable), written data may be partially or fully invisible until the task completes. | Always verify destination data after confirming the sync task instance has completed successfully. |
Missing node dependencies | If no dependency is configured between a downstream analysis task and the upstream sync task, the downstream task may start before synchronization finishes — reading incomplete data. | In DataStudio, configure parent-child node dependencies between upstream and downstream tasks. Avoid weak dependencies such as |
Concurrent writes to the same table or partition | MaxCompute / Hologres: Two tasks writing to the same partition, both configured to truncate before writing — the second task clears data written by the first. Relational databases: pre-SQL or post-SQL statements from one task interfere with data written by another. | For recurring instances of the same node, configure a self-dependency so each instance waits for the previous one to complete. Avoid designing tasks that write concurrently to the same destination. |
Task not configured for idempotent execution | A task that is not idempotent produces different results on each run. Rerunning it can cause duplicate inserts or incorrect overwrites. | Design the task to be idempotent — for example, use |
Incorrect partition expression | In MaxCompute, a scheduling parameter such as | Check the variable expressions in the sync task. Confirm the scheduling parameter configuration and verify that runtime parameters are replaced with the expected values. |
Data type or time zone mismatch | Inconsistent data types or time zone settings between source and destination can truncate values, cause incorrect conversions, or produce discrepancies during data comparison. | Confirm the type and time zone differences between source and destination. Decide whether to keep the current settings or update the data type and time zone parameters at the destination. |
Destination data changed | Concurrent writes by other applications make the destination inconsistent with the source. | Ensure no other processes write to the destination table during the synchronization window. If concurrent writing is expected, accept the resulting discrepancy. |
Write modes for relational databases
Select the write mode that matches your conflict-handling requirements.
General/MySQL protocol
Write mode | Behavior on conflict | Behavior without conflict | When to use |
| Fails and generates dirty data | Inserts normally | Full or incremental append — you do not want to overwrite existing records |
| Deletes the old row and inserts the new row | Inserts normally | Overwrite old records completely with the latest data |
| Updates specified fields in the existing row | Inserts normally | Update some fields while keeping others, such as a creation timestamp |
| Ignores the new row without error | Inserts normally | Insert only new data, take no action on existing records |
PostgreSQL
Write mode | Behavior on conflict | Behavior without conflict | When to use |
| Ignores the new row without error | Inserts normally | Insert only new data, take no action on existing records |
| Updates specified fields in the conflicting row | Inserts normally | Update some fields while keeping others, such as a creation timestamp |
| Discards conflicting rows using the high-performance | Bulk inserts normally | Efficiently append large data batches while skipping duplicates |
| Overwrites the conflicting row using the | Bulk inserts normally | Efficiently sync large data batches, overwriting old records with new data |
| Unsupported | — | — |
Troubleshoot reader-side issues
The Reader plugin connects to the source and extracts data using JDBC or the source's SDK. The actual read result depends on the data extraction mode — including filter conditions, tables, partitions, and columns — as well as changes to the source data and task configuration.
Review the following causes if the amount or content of data read does not match expectations.
Cause | Description | Solution |
Concurrent changes to source data | The sync task captures a data snapshot at the moment of reading, not the absolute latest state. Because a Job splits into multiple Tasks that issue independent queries, each Task may capture a snapshot from a slightly different point in time. Changes that occur after all Tasks have started are not captured. | Accept this as normal behavior for high-throughput synchronization. Running the task multiple times may produce different results due to real-time changes in the source data. |
Incorrect query conditions | MySQL: A | Check the scheduling variable expressions in the sync task. Confirm the scheduling parameter configuration and verify the actual replacement values in the task instance's runtime parameters. |
Reader-side dirty data | Parsing fails when reading source data. This is rare for structured databases but common for semi-structured sources — for example, a CSV or JSON file in Object Storage Service (OSS) or Hadoop Distributed File System (HDFS) that contains format errors. | Check the task run log for parsing errors or format exceptions, then fix the source files. Alternatively, adjust the dirty data toleration configuration. |
Troubleshoot environment issues
Data quality issues can also stem from environment configuration rather than the sync task itself.
Cause | Solution |
Querying the wrong data source, table, or partition | In a DataWorks workspace in standard mode, data sources are isolated between development and production environments. An offline single-table sync task uses the development data source in the development environment and the production data source in the production environment. Confirm which environment you are querying. Also check whether production has a corresponding pre-release or testing environment, since those databases differ from the production database. For semi-structured data, confirm that the full set of source and destination files is included. |
Dependent data not ready | If data is generated periodically — by a recurring sync task or a recurring full/incremental merge task — the upstream task may not have finished when the downstream task starts. Check that all dependent data generation tasks have completed successfully before verifying the result. |
For general troubleshooting when you encounter Data Quality issues, you can run the task multiple times to observe and compare the synchronization results. You can also switch the source or destination data source for comparison testing. Running multiple comparison tests can help you narrow down the scope of the problem.