DataWorks Data Integration supports data synchronization between multiple data sources, such as MySQL, MaxCompute, Hologres, and Kafka. Data Integration offers batch processing, real-time data synchronization, and whole-database migration solutions. You can use these solutions for scenarios such as T+1 batch extract, transform, and load (ETL), real-time data replication with second-level latency, and whole-database migration.
Synchronization solutions
Type | Source granularity | Target granularity | Timeliness | Synchronization scenario |
Single-table batch | A single table | A single table or partition | T+1 or periodic | Periodic full, periodic incremental |
Sharded database and table batch | Multiple tables with identical structures | A single table or partition | T+1 or periodic | Periodic full, periodic incremental |
Single-table real-time | A single table | A single table or partition | Seconds to minutes | Real-time incremental (CDC) |
Whole-database batch | An entire database or multiple tables | Corresponding multiple tables and their partitions | One-time or periodic | One-time/periodic full, one-time/periodic incremental, one-time full + periodic incremental |
Whole-database real-time | An entire database or multiple tables | Corresponding multiple tables and their partitions | Seconds to minutes | Full + real-time incremental (CDC) |
Whole-database full-plus-incremental | An entire database or multiple tables | Corresponding multiple tables and their partitions | Initial full load: Batch processing Subsequent incremental: T+1 | Full + real-time incremental (CDC) |
Recommended synchronization solutions
When selecting a data synchronization solution, consider two key questions:
Timeliness requirements: How frequently does your business need data synchronized — once a day (batch), or with second-level or minute-level real-time updates (real-time)?
Synchronization scale and complexity: How many tables need to be synchronized, and whether the processing logic is uniform across tables (single-table vs. whole-database)?
Based on these considerations, we recommend synchronization solutions in two categories: batch synchronization solutions and real-time synchronization solutions.
1. Choosing a batch synchronization solution (T+1/periodic)
Batch solutions are suitable for scenarios where data timeliness requirements are not high (for example, T+1) and periodic batch processing is needed.
Key prerequisite: To implement batch incremental synchronization, source tables must contain columns that can be used to identify incremental data, such as timestamps like gmt_modified or auto-increment IDs. If such columns are unavailable, you can only fall back to periodic full synchronization.
1. Choose single-table batch
Use this when you need fine-grained processing of a small number of core, heterogeneous data sources.
Key advantages: Flexible processing logic.
Fine-grained transformation: Supports complex column mapping, data filtering, constant assignment, function-based transformations, and even AI-assisted processing.
Heterogeneous source integration: The best choice for non-standard data sources such as APIs and log files.
Key limitations: High cost at scale.
High configuration overhead: When synchronizing a large number of tables, configuring and maintaining tasks one by one involves significant effort.
High resource consumption: Each task is scheduled independently. The resource consumption of 100 single-table tasks far exceeds that of one whole-database task.
Single-table batch solution: Configure a single-table batch synchronization task
2. Choose whole-database batch
Use this when you need to efficiently migrate a large number of homogeneous tables from one location to another.
Key advantages: High O&M efficiency and low cost.
High efficiency: Configure hundreds of tables at once with automatic object matching, significantly improving development efficiency.
Cost-effective: Resources are scheduled and optimized holistically at a very low cost. For example, one whole-database task vs. 100 single-table tasks may consume 2 CUs vs. 100 CUs.
Typical scenarios: Building the ODS layer of a data warehouse, periodic database backup, and data migration to the cloud.
Key limitations: Limited processing logic.
Primarily designed for data replication and does not support complex transformation logic for individual tables.
Whole-database batch solution: Configure a whole-database batch synchronization task.
2. Choosing a real-time synchronization solution (seconds to minutes)
Real-time solutions are suitable for scenarios that require capturing real-time data changes (inserts, updates, and deletes) at the source to support real-time analytics and business responses.
Key prerequisite: The source must support change data capture (CDC) or be a message queue. For example, MySQL requires Binlog to be enabled, or the source must be a Kafka instance.
Choose single-table real-time or whole-database real-time
The selection logic is similar to that of batch solutions:
Single-table real-time: Suitable for scenarios that require complex processing of real-time change streams from a single core table.
Whole-database real-time: The mainstream choice for building real-time data warehouses, implementing real-time database disaster recovery, and connecting to real-time data lakes. This option also provides significant advantages in efficiency and cost-effectiveness.
Real-time solutions: Configure a single-table real-time synchronization task, Configure a whole-database real-time synchronization task
3. Special scenario: Writing real-time CDC data to append-only target tables
Background: CDC data captured by real-time synchronization includes three types of operations: Insert, Update, and Delete. For append-only storage systems that do not natively support Update/Delete operations at the physical level, such as non-Delta Table types in MaxCompute, directly writing the CDC stream causes data state inconsistencies (for example, delete operations cannot be reflected).
DataWorks solution: Base + Log mode
This solution uses the whole-database full-plus-incremental task and resolves this issue by creating a
Base table(full snapshot) and aLog table(incremental log) at the target.How it works: The CDC data stream is written in real time to the
Log table. Then, on a T+1 basis, the system automatically schedules a task to merge the changes from theLog tableinto theBase tableto generate the latest full snapshot. The timeliness of this solution is "incremental data written to the log table within minutes, with the final state merged and visible at T+1." It balances real-time data capture with eventual consistency for offline data warehouses.
Recommended solution: Configure a whole-database full-plus-incremental synchronization task.
Data source read/write capabilities
Data source | Single-table batch | Single-table real-time | Whole-database batch | Whole-database real-time | Whole-database full-plus-incremental |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Read | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | Read | Read/Write | Read | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | Read | - | - | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | Write | Write | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | Read | - | - | |
Read/Write | - | Read | - | - | |
Read/Write | - | Read | - | - | |
Elasticsearch | Read/Write | Write | Write | Write | - |
Read/Write | - | - | - | - | |
GBase8a | Read/Write | - | - | - | - |
HBase | hbase Read/Write HBase 20xsql Read HBase 11xsql Write | - | - | - | - |
Read/Write | - | - | - | - | |
Hive | Read/Write | - | Read/Write | - | - |
Read/Write | Read/Write | Read/Write | Write | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | - | Write | - | |
Read/Write | Read | - | - | - | |
Read/Write | Write | Write | Write | Write | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Write | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | Read | - | |
Read/Write | - | Read | Read | Read | |
Write | - | - | - | - | |
Read/Write | Read | Read | Read | Read | |
Read/Write | - | Write | Write | - | |
Read/Write | - | Write | Write | - | |
Read/Write | - | Read | Read | Read | |
Read/Write | - | Read | Read | - | |
Read/Write | - | Read | Read | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Write | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Vertica | Read/Write | - | - | - | - |
Read | - | - | - | - |
References
The following is a curated list of core Data Integration documents to help you get started quickly.
For data source configuration, see Configure a data source.
For synchronization task configuration, see:
For more practical scenarios:
For common data synchronization issues, see FAQ about data synchronization.