Data sources and synchronization solutions - DataWorks - Alibaba Cloud Documentation Center

DataWorks Data Integration supports data synchronization between multiple data sources, such as MySQL, MaxCompute, Hologres, and Kafka. Data Integration offers batch processing, real-time data synchronization, and whole-database migration solutions. You can use these solutions for scenarios such as T+1 batch extract, transform, and load (ETL), real-time data replication with second-level latency, and whole-database migration.

Synchronization solutions

Type	Source granularity	Target granularity	Timeliness	Synchronization scenario
Single-table batch	A single table	A single table or partition	T+1 or periodic	Periodic full, periodic incremental
Sharded database and table batch	Multiple tables with identical structures	A single table or partition	T+1 or periodic	Periodic full, periodic incremental
Single-table real-time	A single table	A single table or partition	Seconds to minutes	Real-time incremental (CDC)
Whole-database batch	An entire database or multiple tables	Corresponding multiple tables and their partitions	One-time or periodic	One-time/periodic full, one-time/periodic incremental, one-time full + periodic incremental
Whole-database real-time	An entire database or multiple tables	Corresponding multiple tables and their partitions	Seconds to minutes	Full + real-time incremental (CDC)
Whole-database full-plus-incremental	An entire database or multiple tables	Corresponding multiple tables and their partitions	Initial full load: Batch processing Subsequent incremental: T+1	Full + real-time incremental (CDC)

Recommended synchronization solutions

When selecting a data synchronization solution, consider two key questions:

Timeliness requirements: How frequently does your business need data synchronized — once a day (batch), or with second-level or minute-level real-time updates (real-time)?
Synchronization scale and complexity: How many tables need to be synchronized, and whether the processing logic is uniform across tables (single-table vs. whole-database)?

Based on these considerations, we recommend synchronization solutions in two categories: batch synchronization solutions and real-time synchronization solutions.

1. Choosing a batch synchronization solution (T+1/periodic)

Batch solutions are suitable for scenarios where data timeliness requirements are not high (for example, T+1) and periodic batch processing is needed.

Important

Key prerequisite: To implement batch incremental synchronization, source tables must contain columns that can be used to identify incremental data, such as timestamps like gmt_modified or auto-increment IDs. If such columns are unavailable, you can only fall back to periodic full synchronization.

1. Choose single-table batch

Use this when you need fine-grained processing of a small number of core, heterogeneous data sources.

Key advantages: Flexible processing logic.
- Fine-grained transformation: Supports complex column mapping, data filtering, constant assignment, function-based transformations, and even AI-assisted processing.
- Heterogeneous source integration: The best choice for non-standard data sources such as APIs and log files.
Key limitations: High cost at scale.
- High configuration overhead: When synchronizing a large number of tables, configuring and maintaining tasks one by one involves significant effort.
- High resource consumption: Each task is scheduled independently. The resource consumption of 100 single-table tasks far exceeds that of one whole-database task.

Single-table batch solution: Configure a single-table batch synchronization task

2. Choose whole-database batch

Use this when you need to efficiently migrate a large number of homogeneous tables from one location to another.

Key advantages: High O&M efficiency and low cost.
- High efficiency: Configure hundreds of tables at once with automatic object matching, significantly improving development efficiency.
- Cost-effective: Resources are scheduled and optimized holistically at a very low cost. For example, one whole-database task vs. 100 single-table tasks may consume 2 CUs vs. 100 CUs.
- Typical scenarios: Building the ODS layer of a data warehouse, periodic database backup, and data migration to the cloud.
Key limitations: Limited processing logic.
- Primarily designed for data replication and does not support complex transformation logic for individual tables.

Whole-database batch solution: Configure a whole-database batch synchronization task.

2. Choosing a real-time synchronization solution (seconds to minutes)

Real-time solutions are suitable for scenarios that require capturing real-time data changes (inserts, updates, and deletes) at the source to support real-time analytics and business responses.

Important

Key prerequisite: The source must support change data capture (CDC) or be a message queue. For example, MySQL requires Binlog to be enabled, or the source must be a Kafka instance.

Choose single-table real-time or whole-database real-time

The selection logic is similar to that of batch solutions:

Single-table real-time: Suitable for scenarios that require complex processing of real-time change streams from a single core table.
Whole-database real-time: The mainstream choice for building real-time data warehouses, implementing real-time database disaster recovery, and connecting to real-time data lakes. This option also provides significant advantages in efficiency and cost-effectiveness.

Real-time solutions: Configure a single-table real-time synchronization task, Configure a whole-database real-time synchronization task

3. Special scenario: Writing real-time CDC data to append-only target tables

Important

Background: CDC data captured by real-time synchronization includes three types of operations: Insert, Update, and Delete. For append-only storage systems that do not natively support Update/Delete operations at the physical level, such as non-Delta Table types in MaxCompute, directly writing the CDC stream causes data state inconsistencies (for example, delete operations cannot be reflected).

DataWorks solution: Base + Log mode
- This solution uses the whole-database full-plus-incremental task and resolves this issue by creating a Base table (full snapshot) and a Log table (incremental log) at the target.
- How it works: The CDC data stream is written in real time to the Log table. Then, on a T+1 basis, the system automatically schedules a task to merge the changes from the Log table into the Base table to generate the latest full snapshot. The timeliness of this solution is "incremental data written to the log table within minutes, with the final state merged and visible at T+1." It balances real-time data capture with eventual consistency for offline data warehouses.

Recommended solution: Configure a whole-database full-plus-incremental synchronization task.

Data source read/write capabilities

Data source	Single-table batch	Single-table real-time	Whole-database batch	Whole-database real-time	Whole-database full-plus-incremental
Public dataset data source	Read	-	-	-	-
Amazon S3 data source	Read/Write	-	-	-	-
Amazon Redshift data source	Read/Write	-	-	-	-
AnalyticDB for MySQL 2.0	Read/Write	-	-	-	-
AnalyticDB for MySQL 3.0 data source	Read/Write	Write	Read	Write	-
AnalyticDB for PostgreSQL data source	Read/Write	-	Read	-	-
ApsaraDB for OceanBase data source	Read/Write	Write	Read	Read/Write	Read
Azure Blob Storage data source	Read	-	-	-	-
BigQuery	Read	-	-	-	-
ClickHouse data source	Read/Write	-	Read	-	-
COS data source	Read	-	-	-	-
Databricks data source	Read	-	-	-	-
DataHub data source	Read/Write	Read/Write	-	Write	-
Data Lake Formation data source	Read/Write	Write	Write	Write	-
DB2	Read/Write	-	Read	-	-
Doris	Read/Write	Write	Read	-	-
DM (Dameng)	Read/Write	-	Read	-	-
DRDS (PolarDB-X 1.0)	Read/Write	-	Read	-	-
Elasticsearch	Read/Write	Write	Write	Write	-
FTP data source	Read/Write	-	-	-	-
GBase8a	Read/Write	-	-	-	-
HBase	hbase Read/Write HBase 20xsql Read HBase 11xsql Write	-	-	-	-
HDFS	Read/Write	-	-	-	-
Hive	Read/Write	-	Read/Write	-	-
Hologres data source	Read/Write	Read/Write	Read/Write	Write	-
HttpFile	Read	-	-	-	-
Kafka data source	Read/Write	Read/Write	-	Write	-
KingbaseES	Read/Write	-	-	-	-
Lindorm data source	Read/Write	Write	-	Write	-
LogHub (SLS)	Read/Write	Read	-	-	-
MaxCompute data source	Read/Write	Write	Write	Write	Write
MariaDB	Read/Write	-	-	-	-
Maxgraph	Write	-	-	-	-
Memcache (OCS)	Write	-	-	-	-
MetaQ data source	Read	-	-	-	-
Milvus	Read/Write	-	-	-	-
MongoDB	Read/Write	-	-	Read	-
MySQL data source	Read/Write	-	Read	Read	Read
OpenSearch	Write	-	-	-	-
Oracle data source	Read/Write	Read	Read	Read	Read
OSS	Read/Write	-	Write	Write	-
OSS-HDFS	Read/Write	-	Write	Write	-
PolarDB data source	Read/Write	-	Read	Read	Read
PolarDB-X 2.0	Read/Write	-	Read	Read	-
PostgreSQL	Read/Write	-	Read	Read	-
Redis	Write	-	-	-	-
RestAPI (HTTP)	Read/Write	-	-	-	-
Salesforce data source	Read/Write	-	-	-	-
SAP HANA	Read/Write	-	-	-	-
Sensors Data data source	Write	-	-	-	-
Snowflake	Read/Write	-	-	-	-
StarRocks data source	Read/Write	Write	Write	Write	-
SQL Server data source	Read/Write	-	Read	-	-
Tablestore data source	Read/Write	Write	-	-	-
TiDB data source	Read/Write	-	-	-	-
TSDB	Write	-	-	-	-
Vertica	Read/Write	-	-	-	-
TOS data source	Read	-	-	-	-

References

The following is a curated list of core Data Integration documents to help you get started quickly.

For data source configuration, see Configure a data source.
For synchronization task configuration, see:
For more practical scenarios:
For common data synchronization issues, see FAQ about data synchronization.