Unified Batch & Real-Time Data Integration Across 40+ Sources - DataWorks

DataWorks Data Integration supports data synchronization in complex network environments. Two synchronization modes are available: batch synchronization for periodic offline data transfers, and real-time synchronization for continuous incremental replication. Configure both on the DataStudio page.

Choose a synchronization mode

The two modes differ in transfer cadence and data volume per run.

	Batch synchronization	Real-time synchronization
Transfer cadence	Scheduled (periodic)	Continuous
Data transferred	Full or incremental snapshots	Incremental changes only
Typical use case	Periodic reporting, data warehousing	Low-latency pipelines
Source topology	Single table to single table; tables in sharded databases to single table	Star-shaped multi-source link
Configuration	Codeless UI or code editor	Input/output configuration

Use batch synchronization when:

Downstream workloads tolerate a delay (for example, daily or hourly refreshes)
You need to backfill historical data into specific partitions
Your source is one of the 40+ supported data source types, including relational databases, unstructured storage systems, big data storage systems, and message queues

Use real-time synchronization when:

Data must arrive at the destination within seconds of a source change
You want to continuously replicate an entire database to a destination

Batch synchronization is not ideal when:

You need sub-minute data freshness
Your source does not support any of the 40+ compatible data source types

For additional synchronization solutions — including combined full and incremental sync and whole-database batch sync — see Supported data source types and data synchronization solutions.

Prerequisites

Before you begin, ensure that you have:

The Development role in your DataWorks workspace

To add a RAM (Resource Access Management) user and assign roles, see Add a RAM user to a workspace as a member and assign roles to the member.

Batch synchronization

How it works

Batch synchronization reads data from a source using a Reader plug-in and writes it to a destination using a Writer plug-in. Before creating a batch synchronization node, add the data sources to DataWorks so they are available during node configuration.

Each run transfers either full data or incremental data to a specific partition in the destination table. Use the built-in scheduling parameter $bizdate — assigned to the built-in variable ${bizdate} by default — to target the correct partition for each scheduled run. You can also use the data backfill feature in Operation Center to synchronize historical data to specific tables or specific partitions based on the configurations of the batch synchronization node.

Configure a batch synchronization node

Choose the configuration method based on your data source and requirements:

Scenario	Method	Reference
Data source is added to DataWorks and supports the codeless UI	Codeless UI	Configure a batch synchronization node by using the codeless UI (2.0)
Data source cannot be added to DataWorks	Code editor	Configure a batch synchronization node by using the code editor (2.0)
Data source does not support the codeless UI	Code editor	Configure a batch synchronization node by using the code editor (2.0)
Reader or Writer plug-in parameters can only be set in script mode	Code editor	Configure a batch synchronization node by using the code editor (2.0)

For the full list of supported data sources, Reader plug-ins, and Writer plug-ins, see Supported data source types, Reader plug-ins, and Writer plug-ins and Overview of the batch synchronization feature.

Real-time synchronization

Real-time synchronization uses a star-shaped synchronization link that combines multiple data source types. Configure the input and output of a real-time synchronization node to sync from a single table to another single table, or to replicate all data from an entire database to a destination.

For supported data source types and setup details, see Data source types that support real-time synchronization and Overview of the real-time synchronization feature.

Configure scheduling dependencies

Scheduling dependencies control when a node runs relative to other nodes in the workspace.

Batch synchronization node

Ancestor node: Set the root node of the workspace or a zero load node as the ancestor. This triggers the batch synchronization node within the workspace scheduling cycle.
Descendant node: To let DataWorks automatically parse the dependency between a batch synchronization node and a downstream SQL node, configure the output of the batch synchronization node in Project name.Table name format.

Real-time synchronization node

Real-time synchronization nodes run continuously and do not generate outputs the same way auto-triggered nodes do. Table-lineage-based scheduling dependencies are not supported for downstream nodes. Instead, set the root node of the workspace or a zero load node as the ancestor of the downstream node directly.

Note To make sure a real-time synchronization node produces data as expected, configure a monitoring rule for the node.

Use scheduling parameters in batch synchronization

DataWorks provides the built-in variable ${bizdate} for batch synchronization nodes. By default, the scheduling parameter $bizdate is assigned to ${bizdate} as its value.

For how to use scheduling parameters in data synchronization, see the Description for using scheduling parameters in data synchronization section in Overview of the batch synchronization feature.
For common use cases of scheduling parameters, see Common use scenarios of scheduling parameters.