DataWorks Data Integration supports data synchronization in complex network environments. Two synchronization modes are available: batch synchronization for periodic offline data transfers, and real-time synchronization for continuous incremental replication. Configure both on the DataStudio page.
Choose a synchronization mode
The two modes differ in transfer cadence and data volume per run.
| Batch synchronization | Real-time synchronization | |
|---|---|---|
| Transfer cadence | Scheduled (periodic) | Continuous |
| Data transferred | Full or incremental snapshots | Incremental changes only |
| Typical use case | Periodic reporting, data warehousing | Low-latency pipelines |
| Source topology | Single table to single table; tables in sharded databases to single table | Star-shaped multi-source link |
| Configuration | Codeless UI or code editor | Input/output configuration |
Use batch synchronization when:
Downstream workloads tolerate a delay (for example, daily or hourly refreshes)
You need to backfill historical data into specific partitions
Your source is one of the 40+ supported data source types, including relational databases, unstructured storage systems, big data storage systems, and message queues
Use real-time synchronization when:
Data must arrive at the destination within seconds of a source change
You want to continuously replicate an entire database to a destination
Batch synchronization is not ideal when:
You need sub-minute data freshness
Your source does not support any of the 40+ compatible data source types
For additional synchronization solutions — including combined full and incremental sync and whole-database batch sync — see Supported data source types and data synchronization solutions.
Prerequisites
Before you begin, ensure that you have:
The Development role in your DataWorks workspace
To add a RAM (Resource Access Management) user and assign roles, see Add a RAM user to a workspace as a member and assign roles to the member.
Batch synchronization
How it works
Batch synchronization reads data from a source using a Reader plug-in and writes it to a destination using a Writer plug-in. Before creating a batch synchronization node, add the data sources to DataWorks so they are available during node configuration.
Each run transfers either full data or incremental data to a specific partition in the destination table. Use the built-in scheduling parameter $bizdate — assigned to the built-in variable ${bizdate} by default — to target the correct partition for each scheduled run. You can also use the data backfill feature in Operation Center to synchronize historical data to specific tables or specific partitions based on the configurations of the batch synchronization node.
Configure a batch synchronization node
Choose the configuration method based on your data source and requirements:
| Scenario | Method | Reference |
|---|---|---|
| Data source is added to DataWorks and supports the codeless UI | Codeless UI | Configure a batch synchronization node by using the codeless UI (2.0) |
| Data source cannot be added to DataWorks | Code editor | Configure a batch synchronization node by using the code editor (2.0) |
| Data source does not support the codeless UI | Code editor | Configure a batch synchronization node by using the code editor (2.0) |
| Reader or Writer plug-in parameters can only be set in script mode | Code editor | Configure a batch synchronization node by using the code editor (2.0) |
For the full list of supported data sources, Reader plug-ins, and Writer plug-ins, see Supported data source types, Reader plug-ins, and Writer plug-ins and Overview of the batch synchronization feature.
Real-time synchronization
Real-time synchronization uses a star-shaped synchronization link that combines multiple data source types. Configure the input and output of a real-time synchronization node to sync from a single table to another single table, or to replicate all data from an entire database to a destination.
For supported data source types and setup details, see Data source types that support real-time synchronization and Overview of the real-time synchronization feature.
Configure scheduling dependencies
Scheduling dependencies control when a node runs relative to other nodes in the workspace.
Batch synchronization node
Ancestor node: Set the root node of the workspace or a zero load node as the ancestor. This triggers the batch synchronization node within the workspace scheduling cycle.
Descendant node: To let DataWorks automatically parse the dependency between a batch synchronization node and a downstream SQL node, configure the output of the batch synchronization node in
Project name.Table nameformat.
Real-time synchronization node
Real-time synchronization nodes run continuously and do not generate outputs the same way auto-triggered nodes do. Table-lineage-based scheduling dependencies are not supported for downstream nodes. Instead, set the root node of the workspace or a zero load node as the ancestor of the downstream node directly.
Use scheduling parameters in batch synchronization
DataWorks provides the built-in variable ${bizdate} for batch synchronization nodes. By default, the scheduling parameter $bizdate is assigned to ${bizdate} as its value.
For how to use scheduling parameters in data synchronization, see the Description for using scheduling parameters in data synchronization section in Overview of the batch synchronization feature.
For common use cases of scheduling parameters, see Common use scenarios of scheduling parameters.