DataWorks Data Integration synchronizes data between a wide range of data sources, including MySQL, MaxCompute, Hologres, and Kafka. It provides solutions for offline batch synchronization, real-time data synchronization, and entire-database migration to support use cases like T+1 offline ETL, real-time data replication with latency in seconds, and database migration.
Synchronization solutions
Synchronization type | Source granularity | Destination granularity | Latency | Synchronization scenario |
Batch synchronization for a single table | A single table | A single table or partition | T+1 or periodic | Periodic full synchronization, periodic incremental synchronization |
Batch synchronization for sharded tables | Multiple tables with identical structures | A single table or partition | T+1 or periodic | Periodic full synchronization, periodic incremental synchronization |
Real-time synchronization for a single table | A single table | A single table or partition | Seconds to minutes | Real-time incremental synchronization (Change Data Capture (CDC)) |
Real-time synchronization for sharded tables | Multiple logical tables (logical aggregations of multiple physical tables) | Map to multiple destination tables or MERGE into a single destination table | Seconds to minutes | Full synchronization + real-time incremental synchronization (CDC) |
Batch synchronization for an entire database | An entire database or multiple tables | Map to multiple tables and their partitions | One-time or periodic | One-time/periodic full synchronization, one-time/periodic incremental synchronization, one-time full + periodic incremental synchronization |
Real-time synchronization for an entire database | An entire database or multiple tables | Map to multiple tables and their partitions | Seconds to minutes | Full synchronization + real-time incremental synchronization (CDC) |
Full and incremental synchronization for an entire database | An entire database or multiple tables | Map to multiple tables and their partitions | Initial full synchronization: Batch Subsequent incremental synchronization: T+1 | One-time full + T+1 incremental synchronization (using a Base+Log pattern) |
Recommended synchronization solutions
When choosing a data synchronization solution, consider two key factors:
Latency requirement: How often does your business need data to be synchronized? Is a daily update (batch) sufficient, or do you need real-time updates with second-to-minute latency?
Scale and complexity: How many tables do you need to synchronize? Is the processing logic for these tables uniform (entire database) or varied (single table)?
Based on these factors, we recommend two main categories of solutions: batch synchronization and real-time synchronization.
1. Batch synchronization (T+1/periodic)
Batch solutions are suitable for scenarios that do not require high data freshness (for example, T+1) and involve periodic batch processing.
Prerequisite: To perform incremental batch synchronization, the source table must have a watermark field that defines incremental data, such as a timestamp like gmt_modified or an auto-incrementing ID. Without such a field, you can only perform periodic full synchronization.
a. Batch synchronization for a single table
Use this option for fine-grained processing of a small number of critical or heterogeneous data sources.
Key advantage: Flexible processing logic.
Fine-grained transformations: Supports complex field mapping, data filtering, constant assignment, function-based transformations, and even AI-assisted processing.
Heterogeneous source integration: The best choice for processing non-standard data sources like APIs and log files.
Key limitation: High cost at scale.
High configuration overhead: Configuring and maintaining individual tasks for a large number of tables requires significant effort.
High resource consumption: Because each task is scheduled independently, 100 single-table tasks consume far more resources than one entire-database task.
Recommended solution: Create a batch synchronization task
b. Batch synchronization for an entire database
Use this option to efficiently migrate a large number of homogeneous tables from one location to another.
Key advantages: High O&M efficiency and low cost.
High efficiency: Configure hundreds of tables at once with automatic object matching, greatly improving development efficiency.
Cost-effective: Resources are scheduled and optimized together, resulting in extremely low costs. For example, the resource consumption of one entire-database task might be 2 CUs, while 100 single-table tasks could consume 100 CUs.
Typical use cases: Building the ODS layer of a data warehouse, performing periodic database backups, and migrating data to the cloud.
Key limitation: Limited processing logic.
Designed primarily for "copying" data and does not support complex, table-specific transformation logic.
Recommended solution: Configure a solution task for data synchronization.
2. Real-time synchronization (second-to-minute latency)
Real-time solutions are suitable for use cases that require capturing real-time data changes (inserts, updates, and deletes) from the source to support real-time analytics and business responses.
Prerequisite: The source must support Change Data Capture (CDC) or be a message queue. For example, a MySQL source must have its binary log enabled, or the source must be a Kafka instance.
Real-time: single table vs. entire database
The decision logic is similar to that for batch solutions:
Real-time synchronization for a single table: Best for complex processing of real-time change streams from a single, critical table.
Real-time synchronization for an entire database: The primary choice for building real-time data warehouses, implementing real-time disaster recovery for databases, and enabling real-time data lakes. It offers the same significant advantages in efficiency and cost-effectiveness.
Recommended solutions:Create a real-time synchronization task, Configure a solution task for data synchronization
3. Real-time CDC to append-only destinations
The challenge: Real-time synchronization captures CDC data, which includes Insert, Update, and Delete operations. For append-only storage systems like MaxCompute non-Delta tables that do not natively support physical Update or Delete operations, writing a raw CDC stream can cause data inconsistencies. For example, a delete operation would not be reflected.
DataWorks solution: The Base + Log pattern
This solution is implemented as the full and incremental synchronization for an entire database task. It solves the problem by creating a
Basetable (for the full snapshot) and aLogtable (for incremental changes) at the destination.How it works: The CDC data stream is written to the
Logtable in real time. Then, on a T+1 schedule, a system-scheduled task automatically applies the changes from theLogtable to theBasetable using aMERGEoperation, creating an updated full snapshot. This approach provides minute-level latency for ingesting incremental data, while the final, consistent state becomes visible on a T+1 basis. It balances real-time data capture with the eventual consistency required by offline data warehouses.
Recommended solution: Configure a solution task for data synchronization.
Data source capabilities
Data source | Batch (single table) | Real-time (single table) | Batch (database) | Real-time (database) | Full and incremental (database) |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Read | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | - | - | Read | Read | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | Write | Write | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | Read | - | - | |
Read/Write | - | Read | - | - | |
Enterprise Distributed Application Service (EDAS) data source | Read/Write | - | Read | - | - |
Read/Write | Write | Write | Write | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write (Native API) Read (SQL 2.0.x) Write (SQL 1.1.x) | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | Read/Write | - | - | |
Read/Write | Read/Write | Read/Write | Write | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | - | Write | - | |
Read/Write | Read | - | - | - | |
Read/Write | Write | Write | Write | Write | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Write | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | Read | - | |
Read/Write | Read | Read | Read | Read | |
Write | - | - | - | - | |
Read/Write | Read | Read | Read | Read | |
On-premises PolarDB for PostgreSQL (compatible with Oracle) data source | Read/Write | - | Write | Write | - |
Read/Write | - | Write | Write | - | |
Read/Write | Read | Read | Read | Read | |
Read/Write | - | Read | Read | - | |
Read/Write | - | Read | Read | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Write | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read | - | - | - | - |
Use cases
Related documentation
The following topics provide essential information to help you get started with Data Integration.
To learn how to configure a data source, see Data source configuration.
For details on configuring a synchronization task, see the following topics:
For answers to frequently asked questions about data synchronization, see FAQ about data synchronization.