Troubleshooting Data Quality in offline single-table synchronization - DataWorks

This topic explains how data synchronization works in Data Integration. Understanding this process can help you evaluate the results of a sync task, including the amount of data synchronized and the final record count at the destination. This topic also describes common Data Quality scenarios to help you identify and resolve related issues.

Synchronization principles

DataWorks Data Integration uses parallel processing and a plugin-based architecture to achieve efficient and stable data synchronization.

Parallel execution model (Job and Task)

To maximize data throughput, a sync task uses a two-level execution structure:

Job: A running instance of a sync task.
Task: The smallest execution unit of a Job. A Job is split into multiple Tasks. These Tasks can run concurrently on one or more machines.

Each Task processes an independent data shard. This parallel processing mechanism significantly improves the overall efficiency of data synchronization.

Plugin-based data stream (Reader and Writer)

Inside each Task, the data stream connects a Reader plugin and a Writer plugin through a memory buffer:

Reader plugin: Connects to the source data storage, reads data, and pushes it to the internal buffer.
Writer plugin: Consumes data from the buffer and writes it to the destination data storage.

Note

Reader and Writer plugins adhere to the native read/write protocols and data constraints of their respective data sources. These constraints can include data types and primary key limits. The final synchronization result and data consistency behavior depend on the implementation rules of the source and destination data sources.

Troubleshooting writer-side data consistency

The Writer plugin in Data Integration writes data from the source to the destination. Each destination data source type has a corresponding Writer plugin. The Writer plugin uses the configured write mode, which includes conflict resolution policies. It uses Java Database Connectivity (JDBC) or the data source's software development kit (SDK) to write data to the destination.

Note

The actual write result and data content at the destination depend on the write mode and the destination table's constraints.

If you encounter Data Quality issues, such as discrepancies in record counts or data content, after a data synchronization task completes, review the following common writer-side scenarios:

Cause	Description	Solution
Improperly configured write mode	The Writer plugin writes source data to the destination based on the selected write mode. If conflicts exist between the source data and the destination table schema due to data constraints, the operation can result in insert failures (dirty data), ignored inserts, or replaced inserts.	Select the correct write mode based on your business needs. For more information, see Appendix: Write modes for relational databases.
Dirty data threshold reached	The amount of dirty data, which can be caused by issues such as data type mismatches or oversized content, exceeds the configured threshold. This causes the task to fail, and as a result, some data is not written to the destination.	Identify the cause of the dirty data and resolve the issue. Alternatively, you can increase the threshold if you can tolerate and ignore the dirty data. Note If your task cannot tolerate dirty data, you can modify the threshold in Task Configuration > Channel Configuration. For more information about how to configure the dirty data threshold, see Codeless UI configuration. For more information about what is considered dirty data, see Terms.
Querying data too early	You query the data before the sync task is complete. For some data sources, such as Hive and MaxCompute (configurable), data might be partially or completely invisible before the task finishes.	Always query and verify the destination table after you confirm that the sync task instance has successfully run.
Missing node dependencies	A dependency is not configured between the downstream analysis task and the upstream sync task. This causes the downstream task to start before the data synchronization is complete, which results in the downstream task reading incomplete data.	In DataStudio, configure parent-child node dependencies for upstream and downstream tasks. Avoid using weak dependencies such as `max_pt`.
Multiple sync tasks write to the same table or partition concurrently and cause interference	Improper concurrent execution of sync tasks. In MaxCompute or Hologres, two tasks write to the same data partition and are configured to truncate the partition before synchronization. The data written by the first task might be cleared by the second task. For a relational database, if pre-SQL or post-SQL statements are configured, the data written by the first task might be affected by the pre-SQL or post-SQL statements of the second task.	If the conflict is caused by multiple recurring instances of the same node, configure a self-dependency for the node. This ensures that the next recurring instance starts only after the previous one is complete. Avoid designing tasks that write to the same destination concurrently.
Task is not configured for idempotent execution	The task is not designed to be idempotent, meaning multiple runs produce different results. Rerunning the task can lead to duplicate inserts or incorrect overwrites.	1. Design the task to be idempotent. For example, you can use the `REPLACE INTO` mode. 2. If you cannot make the task idempotent, exercise caution when you rerun it. Configure success alerts for the task to avoid unnecessary retries.
Incorrect partition expression	For example, in MaxCompute, most data tables are partitioned tables. The partition value is a DataWorks scheduling parameter such as $bizdate. Common errors include the following: The scheduling parameter is not correctly replaced at runtime. Data is written to a literal partition named ds=$bizdate instead of the actual data timestamp partition, such as ds=20230118. A downstream query uses the data, but the partition expression is incorrect. The query uses data from the wrong partition.	Check the variable expressions in the data synchronization task. Confirm that the scheduling parameter configuration is correct. Also, check whether the runtime parameters of the task instance are correct.
Data type or time zone mismatch	The data types or time zone settings of the source and destination are inconsistent. This can cause data to be truncated or incorrectly converted during the write process, or lead to discrepancies during data comparison.	Confirm the differences in data types and time zones between the source and destination. Decide whether to keep the current settings or modify the data type and time zone parameters of the destination.
Destination data has changed	If other applications write to the destination data source concurrently, its content becomes inconsistent with the source data.	Ensure that no other processes write to the destination table during the synchronization window. If concurrent writing is the expected behavior, you must accept the resulting data discrepancy.

Appendix: Write modes for relational databases

Protocol Type	Write mode	Behavior (on data conflict)	Behavior (no data conflict)	Primary scenario
General/MySQL protocol	`insert into`	Fails and generates dirty data.	Inserts new data normally.	Full or incremental append. You do not want to overwrite or modify existing data.
	`replace into`	Replaces the old row. Deletes the old row, then inserts the new row.	Inserts new data normally.	Scenarios that require completely overwriting old records with the latest data.
	`insert into ... on duplicate key update`	Updates the old row. Keeps the old row and updates only specified fields with new data.	Inserts new data normally.	Scenarios that require updating some fields of a record while keeping others, such as the creation time.
	`insert ignore into`	Ignores the new row. Does not write or report an error.	Inserts new data normally.	You want to insert only new data and take no action on existing records.
PostgreSQL	`insert on conflict do nothing`	Ignores the new row. Does not write or report an error.	Inserts new data normally.	You want to insert only new data and take no action on existing records.
	`insert on conflict do update`	Updates the old row. Uses new data to update specified fields in the conflicting row.	Inserts new data normally.	Updating some fields of a record while keeping others, such as the creation time.
	`copy on conflict do nothing`	Discards conflicting rows. Uses the high-performance `COPY` protocol. Ignores new data on conflict and does not generate dirty data.	Bulk inserts new data normally.	Efficiently appending large batches of data, allowing you to skip existing duplicate records.
	`copy on conflict do update`	Updates conflicting rows. Uses the `COPY` protocol. Overwrites old data with new data on conflict.	Bulk inserts new data normally.	Efficiently synchronizing large batches of data. You need to completely overwrite old records with the latest data.
-	`merge into`	Unsupported.

Troubleshooting reader-side data consistency

The Reader plugin in Data Integration connects to a source data storage. It extracts the data to be synchronized and delivers it to the Writer plugin. Each storage class has a corresponding Reader plugin. The Reader plugin uses the configured data extraction mode, which includes filter conditions, tables, partitions, and columns. It uses JDBC or the data source's SDK to extract the data.

Note

The actual read result depends on the data synchronization mechanism, changes to the source data, and the task configuration.

If you encounter Data Quality issues, such as discrepancies in record counts or data content, after a data synchronization task completes, review the following common reader-side scenarios:

Issue	Description	Solution
Concurrent changes to source data	During data reading, external applications might still be modifying the source data. The sync task captures a data snapshot at the moment of reading, not the absolute latest data. To achieve parallel reading, a sync Job is split into multiple Tasks, which are independent database queries. Due to the transaction isolation of the database, each Task gets a data snapshot from a different point in time. This means the task cannot capture data changes that occur after all Tasks have started.	Accept this behavior as normal for high-throughput data synchronization. Running the task multiple times may produce different results due to real-time changes in the source data.
Incorrect query conditions	For example, in MySQL, you can configure a `WHERE` clause to filter extracted data. If the `WHERE` clause contains a scheduling parameter variable, such as `gmt_modify >= ${bizdate}`, a common error is that the scheduling parameter is not replaced correctly. For example, you might need data from the last two days but only filter for one day's data. For example, in MaxCompute, when reading a partitioned table, you often configure a variable expression for the partition parameter, such as `pt=${bizdate}`. It is also easy to misconfigure or fail to replace the partition parameter.	Check the scheduling variable expressions of the data synchronization task. Confirm that the scheduling parameter configuration is as expected and that the parameter is replaced with the expected value during scheduling.
Reader-side dirty data	Parsing fails when reading source data. This is rare in structured databases. However, in semi-structured data sources such as CSV or JSON files in OSS or Hadoop Distributed File System (HDFS), format errors can prevent some data from being read.	Check the task run log for parsing errors or format exceptions, and fix the source data files. Adjust the dirty data toleration configuration.

Troubleshooting the environment context

Issue	Solution
Incorrect data source, table, or partition selected for query	In a DataWorks workspace in standard mode, data sources are isolated between development and production environments. An offline single-table sync task uses the development data source when run in the development environment. It uses the production data source when run in the production environment. When you compare the amount and content of data, confirm the data source environment you are using to avoid inconsistencies between development and production queries. In production, online stores often have corresponding pre-release or testing environments. The production database used by the sync task is different from the pre-release or testing database. Check for environmental differences when you compare data. When you synchronize semi-structured data, multiple files are often involved. Confirm that the collection of files for reading and writing is complete.
Dependent output is not ready	If the data is generated periodically, for example, by a recurring data synchronization task or a recurring full/incremental data merge task, check that the dependent data generation tasks have run and completed successfully.

Note

For general troubleshooting when you encounter Data Quality issues, you can run the task multiple times to observe and compare the synchronization results. You can also switch the source or destination data source for comparison testing. Running multiple comparison tests can help you narrow down the scope of the problem.