This topic explains how data synchronization works in Data Integration. Understanding this process can help you evaluate the results of a sync task, including the amount of data synchronized and the final record count at the destination. This topic also describes common Data Quality scenarios to help you identify and resolve related issues.
Synchronization principles
DataWorks Data Integration uses parallel processing and a plugin-based architecture to achieve efficient and stable data synchronization.
Parallel execution model (Job and Task)
To maximize data throughput, a sync task uses a two-level execution structure:
Job: A running instance of a sync task.
Task: The smallest execution unit of a Job. A Job is split into multiple Tasks. These Tasks can run concurrently on one or more machines.
Each Task processes an independent data shard. This parallel processing mechanism significantly improves the overall efficiency of data synchronization.
Plugin-based data stream (Reader and Writer)
Inside each Task, the data stream connects a Reader plugin and a Writer plugin through a memory buffer:
Reader plugin: Connects to the source data storage, reads data, and pushes it to the internal buffer.
Writer plugin: Consumes data from the buffer and writes it to the destination data storage.
Reader and Writer plugins adhere to the native read/write protocols and data constraints of their respective data sources. These constraints can include data types and primary key limits. The final synchronization result and data consistency behavior depend on the implementation rules of the source and destination data sources.
Troubleshooting writer-side data consistency
The Writer plugin in Data Integration writes data from the source to the destination. Each destination data source type has a corresponding Writer plugin. The Writer plugin uses the configured write mode, which includes conflict resolution policies. It uses Java Database Connectivity (JDBC) or the data source's software development kit (SDK) to write data to the destination.
The actual write result and data content at the destination depend on the write mode and the destination table's constraints.
If you encounter Data Quality issues, such as discrepancies in record counts or data content, after a data synchronization task completes, review the following common writer-side scenarios:
Cause | Description | Solution | |
Improperly configured write mode | The Writer plugin writes source data to the destination based on the selected write mode. If conflicts exist between the source data and the destination table schema due to data constraints, the operation can result in insert failures (dirty data), ignored inserts, or replaced inserts. | Select the correct write mode based on your business needs. For more information, see Appendix: Write modes for relational databases. | |
Dirty data threshold reached | The amount of dirty data, which can be caused by issues such as data type mismatches or oversized content, exceeds the configured threshold. This causes the task to fail, and as a result, some data is not written to the destination. | Identify the cause of the dirty data and resolve the issue. Alternatively, you can increase the threshold if you can tolerate and ignore the dirty data. Note If your task cannot tolerate dirty data, you can modify the threshold in . For more information about how to configure the dirty data threshold, see Codeless UI configuration. For more information about what is considered dirty data, see Terms. | |
Querying data too early | You query the data before the sync task is complete. For some data sources, such as Hive and MaxCompute (configurable), data might be partially or completely invisible before the task finishes. | Always query and verify the destination table after you confirm that the sync task instance has successfully run. | |
Missing node dependencies | A dependency is not configured between the downstream analysis task and the upstream sync task. This causes the downstream task to start before the data synchronization is complete, which results in the downstream task reading incomplete data. | In DataStudio, configure parent-child node dependencies for upstream and downstream tasks. Avoid using weak dependencies such as | |
Multiple sync tasks write to the same table or partition concurrently and cause interference | Improper concurrent execution of sync tasks.
|
| |
Task is not configured for idempotent execution | The task is not designed to be idempotent, meaning multiple runs produce different results. Rerunning the task can lead to duplicate inserts or incorrect overwrites. | 1. Design the task to be idempotent. For example, you can use the | |
Incorrect partition expression | For example, in MaxCompute, most data tables are partitioned tables. The partition value is a DataWorks scheduling parameter such as $bizdate. Common errors include the following:
| Check the variable expressions in the data synchronization task. Confirm that the scheduling parameter configuration is correct. Also, check whether the runtime parameters of the task instance are correct. | |
Data type or time zone mismatch | The data types or time zone settings of the source and destination are inconsistent. This can cause data to be truncated or incorrectly converted during the write process, or lead to discrepancies during data comparison. |
| |
Destination data has changed | If other applications write to the destination data source concurrently, its content becomes inconsistent with the source data. | Ensure that no other processes write to the destination table during the synchronization window. If concurrent writing is the expected behavior, you must accept the resulting data discrepancy. | |
Appendix: Write modes for relational databases
Protocol Type | Write mode | Behavior (on data conflict) | Behavior (no data conflict) | Primary scenario |
General/MySQL protocol |
| Fails and generates dirty data. | Inserts new data normally. | Full or incremental append. You do not want to overwrite or modify existing data. |
| Replaces the old row. Deletes the old row, then inserts the new row. | Inserts new data normally. | Scenarios that require completely overwriting old records with the latest data. | |
| Updates the old row. Keeps the old row and updates only specified fields with new data. | Inserts new data normally. | Scenarios that require updating some fields of a record while keeping others, such as the creation time. | |
| Ignores the new row. Does not write or report an error. | Inserts new data normally. | You want to insert only new data and take no action on existing records. | |
PostgreSQL |
| Ignores the new row. Does not write or report an error. | Inserts new data normally. | You want to insert only new data and take no action on existing records. |
| Updates the old row. Uses new data to update specified fields in the conflicting row. | Inserts new data normally. | Updating some fields of a record while keeping others, such as the creation time. | |
| Discards conflicting rows. Uses the high-performance | Bulk inserts new data normally. | Efficiently appending large batches of data, allowing you to skip existing duplicate records. | |
| Updates conflicting rows. Uses the | Bulk inserts new data normally. | Efficiently synchronizing large batches of data. You need to completely overwrite old records with the latest data. | |
- |
| Unsupported. | ||
Troubleshooting reader-side data consistency
The Reader plugin in Data Integration connects to a source data storage. It extracts the data to be synchronized and delivers it to the Writer plugin. Each storage class has a corresponding Reader plugin. The Reader plugin uses the configured data extraction mode, which includes filter conditions, tables, partitions, and columns. It uses JDBC or the data source's SDK to extract the data.
The actual read result depends on the data synchronization mechanism, changes to the source data, and the task configuration.
If you encounter Data Quality issues, such as discrepancies in record counts or data content, after a data synchronization task completes, review the following common reader-side scenarios:
Issue | Description | Solution |
Concurrent changes to source data |
| Accept this behavior as normal for high-throughput data synchronization. Running the task multiple times may produce different results due to real-time changes in the source data. |
Incorrect query conditions |
| Check the scheduling variable expressions of the data synchronization task. Confirm that the scheduling parameter configuration is as expected and that the parameter is replaced with the expected value during scheduling. |
Reader-side dirty data | Parsing fails when reading source data. This is rare in structured databases. However, in semi-structured data sources such as CSV or JSON files in OSS or Hadoop Distributed File System (HDFS), format errors can prevent some data from being read. |
|
Troubleshooting the environment context
Issue | Solution |
Incorrect data source, table, or partition selected for query |
|
Dependent output is not ready | If the data is generated periodically, for example, by a recurring data synchronization task or a recurring full/incremental data merge task, check that the dependent data generation tasks have run and completed successfully. |
For general troubleshooting when you encounter Data Quality issues, you can run the task multiple times to observe and compare the synchronization results. You can also switch the source or destination data source for comparison testing. Running multiple comparison tests can help you narrow down the scope of the problem.