Data Integration in DataWorks offers an efficient way to perform batch synchronization of entire databases. This feature lets you migrate all or selected tables from a source database to a destination data store in a single operation or on a recurring schedule. It supports both full and incremental data transfers. This eliminates the need to create separate synchronization tasks for each table. The feature also automatically creates table schemas in the destination, simplifying database migrations.
Use cases
Data migration and cloud adoption
Migrate databases like MySQL and Oracle from an on-premises data center to a cloud-based Data Warehouse or Data Lake.
Migrate data between different cloud platforms or database systems.
Data warehouse and data lake construction
Periodically migrate full or incremental data from online transactional (OLTP) databases to the Operational Data Store (ODS) layer of a Data Warehouse or Data Lake. This data serves as the foundation for downstream data analytics.
Data backup and disaster recovery
Regularly back up entire production databases to cost-effective storage like HDFS or Object Storage Service (OSS).
Implement disaster recovery solutions across different regions or Availability Zones.
Core capabilities
The following table describes the core capabilities of batch synchronization for entire databases.
Core capability | Feature | Description |
Batch synchronization between heterogeneous data sources | - | Batch synchronization supports migrating data from on-premises data centers or other cloud platforms to MaxCompute, Hologres, Object Storage Service (OSS), and other Data Warehouses or Data Lakes. For more information, see Supported data sources and synchronization solutions. |
Data synchronization in complex network environments | - | Batch synchronization supports data transfer from various environments, including ApsaraDB, on-premises data centers, self-hosted databases on Elastic Compute Service (ECS) instances, and databases on third-party clouds. Before you configure the task, ensure network connectivity between the resource group and both the source and destination data sources. For more information, see Network connectivity. |
Synchronization modes | Full synchronization | Supports one-time or periodic full data synchronization to a destination table or a specified partition. |
Incremental synchronization | Supports one-time or periodic incremental data synchronization based on time, partitions, or primary keys. | |
Combined full and incremental synchronization | Initial run: Performs a one-time full data synchronization. Subsequent runs: Automatically switches to periodic incremental data synchronization to a specified partition. | |
Database and table mapping | Batch table sync | Supports synchronizing all tables in a database. You can also select specific tables by checking them or configuring filtering rules. |
Automatic schema creation | A single configuration can handle hundreds of tables from the source database. The feature automatically creates table schemas at the destination, eliminating the need for manual intervention. | |
Flexible mapping | Supports custom naming rules for destination databases and tables. You can also define custom mappings for field types between the source and destination to adapt to the destination data model. | |
Scheduling and dependency management | Scheduling | Supports scheduling by the minute, hour, day, week, month, and year. If you are synchronizing a large number of tables at once, we recommend staggering the execution times in your scheduling configuration to prevent task buildup and resource contention. |
Task dependencies | In DataWorks, both the main database task and the individual table-level subtasks can be used as upstream dependencies for other development tasks. When a table synchronization task is complete, DataWorks automatically triggers its downstream development tasks. | |
Parameter support | Supports the use of scheduling parameters for incremental synchronization, such as using | |
Advanced parameters | Dirty data handling | Dirty data refers to any record that cannot be written to the destination due to an error, such as a type conflict or a constraint violation. The default is false, which means the task fails if dirty data occurs. If set to true, all dirty data is ignored. |
Reader and writer configuration | You can configure the maximum number of connections for both the reader (source) and writer (destination). You can also define a cleanup policy for the destination before data is written. | |
Concurrency and rate limiting |
| |
Operations and maintenance | Manual intervention | You can perform manual interventions such as Rerun, data backfill, mark as successful, and freeze or restore tasks. |
Monitoring and alerting | You can configure monitoring rules for baselines, task statuses, and run durations, and send alerts when rules are triggered. | |
Data Quality | After a task is committed and deployed, you can configure data quality monitoring rules for the destination table in the Operation Center. You can configure rules manually or use AI-powered generation. Currently, quality rule monitoring is only available for specific database types. For more information, see Data Quality. |
Get started
To create a batch synchronization task for entire databases, see Configure batch synchronization for entire databases.
Supported data sources
DataWorks supports migrating entire databases from various data sources to destinations like MaxCompute, Object Storage Service (OSS), and Elasticsearch. The supported data sources are listed below.
Source | Destination |
MaxCompute | |
Data Lake Formation | |
Hologres | |
Object Storage Service (OSS) | |
Elasticsearch | |
StarRocks |