DataWorks Data Integration synchronizes data across multiple data sources, including MySQL, MaxCompute, Hologres, and Kafka. It offers solutions for various scenarios, such as T+1 batch extract, transform, and load (ETL), real-time data synchronization with latency in seconds, and full database migration.
Synchronization solution
Synchronization type | Source granularity | Target granularity | Timeliness | Scenario |
Single-table batch | Single table | Single table/Partition | T+1 or Periodic | Periodic full load, Periodic incremental load |
Sharded database and table batch | Sharded tables | Single table/Partition | T+1 or Periodic | Periodic full load, Periodic incremental load |
Single-table real-time | Single table | Single table/Partition | Seconds to minutes | Change Data Capture (CDC) |
Full-database batch | Full database or multiple tables | Multiple corresponding tables and their Partitions | One-time or Periodic | One-time or periodic full loads, one-time or periodic incremental loads, or a one-time full load followed by periodic incremental loads |
Full-database real-time | Full database or multiple tables | Multiple corresponding tables and their Partitions | Seconds to minutes | Full load + Change Data Capture (CDC) |
Full-database full and incremental | Full database or multiple tables | Multiple corresponding tables and their Partitions | Initial full load: Batch Ongoing incremental loads: T+1 | Initial full load with Periodic incremental loads |
Recommended synchronization solutions
When selecting a data synchronization solution, consider two core factors:
Data timeliness: How frequently does your business require data to be synchronized? Does it need daily batch updates or real-time updates at the second or minute level?
Synchronization scope and complexity: How many tables do you need to synchronize? Is the processing logic for these tables consistent? This determines whether you need a single-table or whole-database solution.
Based on these factors, we recommend two main types of synchronization solutions: batch and real-time.
1. Batch synchronization (T+1 or periodic)
Batch solutions are ideal for use cases without strict data timeliness requirements, such as T+1 periodic batch processing.
Prerequisite: To perform incremental batch synchronization, the source table must contain a field for tracking increments, such as a timestamp like gmt_modified or an auto-incrementing ID. If no such field exists, you can only perform periodic full synchronization.
1. Single-table batch
This approach is best suited for scenarios where you need to perform fine-grained processing on a small number of core, heterogeneous data tables.
Key benefit: Flexible processing logic.
Fine-grained transformations: Supports complex field mapping, data filtering, constant value assignment, function-based transformations, and even AI-assisted processing.
Heterogeneous source integration: Ideal for processing data from non-standard sources such as APIs and log files.
Key limitation: High cost at scale.
High configuration overhead: Configuring and maintaining individual tasks for many tables requires significant effort.
High resource consumption: Each task is scheduled independently. The resource consumption of 100 single-table tasks is far greater than that of one whole-database task.
Recommended solution: Batch synchronization task for a single table
2. Whole-database batch
Use this approach when you need to efficiently migrate a large number of homogeneous tables from one location to another.
Key benefits: High operational efficiency and low cost.
High efficiency: Configure hundreds of tables at once with automatic object matching to significantly improve development efficiency.
Cost-effective: Optimized resource scheduling results in extremely low costs. For example, the resource consumption of one whole-database task might be 2 CUs, compared to 100 CUs for 100 single-table tasks.
Typical use cases: Building the ODS layer of a data warehouse, performing periodic database backups, and migrating data to the cloud.
Key limitation: Limited processing logic.
Designed primarily for replication, it does not support complex, table-specific transformation logic.
Recommended solution: Batch whole-database synchronization task.
2. Real-time synchronization (second- or minute-level)
Real-time solutions are ideal for use cases that require capturing data changes (inserts, updates, and deletes) from a source to support real-time analytics and business responses.
Prerequisite: The source must support Change Data Capture (CDC) or be a message queue. For example, a MySQL database must have the Binary Log enabled, or the source could be a Kafka instance.
Single-table real-time or whole-database real-time
The selection logic is similar to that for batch solutions:
Single-table real-time: Ideal for scenarios that involve complex processing of real-time change streams from a single, core table.
Whole-database real-time: The mainstream choice for building a real-time data warehouse, implementing real-time database disaster recovery, and creating a real-time data lake. It also offers significant advantages in efficiency and cost-effectiveness.
Recommended solutions: Real-time synchronization task for a single table, Real-time whole-database synchronization task
3. Special use case: CDC to append-only targets
Context: The CDC data captured by real-time synchronization includes three types of operations: Insert, Update, and Delete. For append-only storage systems like MaxCompute non-Delta tables that do not natively support physical Update or Delete operations, writing the CDC stream directly can lead to data inconsistencies. For example, delete operations would not be reflected in the target table.
DataWorks solution: The Base + Log pattern
This solution is implemented as a whole-database full and incremental (near real-time) task. It works by creating a
Basetable (Full Snapshot) and aLogtable (incremental log) at the target.How it works: The CDC data stream is written to the
Logtable in real time. Then, on a T+1 basis, the system automatically schedules a task to merge the changes from theLogtable into theBasetable, generating an updated Full Snapshot. This approach writes data to the incremental table with minute-level latency. The final, consistent state is visible after the T+1 merge. It balances real-time data capture with the Eventual Consistency required for a batch data warehouse.
Recommended solution: Whole-database full and incremental (near real-time) task.
Data source capabilities
Data source | Single-table batch | Single-table real-time | Whole-database batch | Whole-database real-time | Whole-database full and incremental |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Read | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | - | - | Read | Read | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | Write | Write | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | Read | - | - | |
Read/Write | - | Read | - | - | |
Read/Write | - | Read | - | - | |
Elasticsearch | Read/Write | Write | Write | Write | - |
Read/Write | - | - | - | - | |
GBase8a | Read/Write | - | - | - | - |
HBase | HBase: Read/Write HBase 2.x SQL: Read HBase 1.1.x SQL: Write | - | - | - | - |
Read/Write | - | - | - | - | |
Hive | Read/Write | - | Read/Write | - | - |
Read/Write | Read/Write | Read/Write | Write | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | - | Write | - | |
Read/Write | Read | - | - | - | |
Read/Write | Write | Write | Write | Write | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Write | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | Read | - | |
Read/Write | Read | Read | Read | Read | |
Write | - | - | - | - | |
Read/Write | Read | Read | Read | Read | |
Read/Write | - | Write | Write | - | |
Read/Write | - | Write | Write | - | |
Read/Write | Read | Read | Read | Read | |
Read/Write | - | Read | Read | - | |
Read/Write | - | Read | Read | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Write | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Vertica | Read/Write | - | - | - | - |
Read | - | - | - | - |
Use cases
References
Use these articles to get started with Data Integration.
For data source configuration, see Data Source Management.
For synchronization task configuration, see the following articles:
For frequently asked questions about data synchronization, see Data Integration FAQ.