Data Integration is a stable, efficient, and scalable data synchronization platform that provides high-speed data synchronization between disparate data sources across complex network environments.
Process guide
Data Integration must be accessed via a PC using Chrome version 69 or later.
The general development flow for Data Integration is as follows:
Configure a data source, prepare a resource group, and establish network connectivity between the data source and the resource group.
Select a batch or real-time synchronization method based on your scenario to develop a task. Follow the on-screen guide to complete the resource and task configuration.
Use data preview and trial runs to debug the task. After successful debugging, submit and publish the task. Batch tasks must be published to the production environment.
Enter the continuous O&M phase. Monitor the synchronization status, set alerts, and optimize resources to achieve full lifecycle management.
Synchronization methods
DataWorks Data Integration provides synchronization methods that can be combined across three dimensions: latency, scope, and data policy. For more information about the solutions and recommendations, see Supported data sources and synchronization solutions.
Latency: Includes batch and real-time. Batch synchronization uses scheduled tasks to migrate data on an hourly or daily basis. Real-time synchronization captures source data changes to achieve latency within seconds.
Scope: Includes single table, full database, and sharding. It supports fine-grained transfer of a single table, along with batch migration and merging of an entire database or sharded tables.
Data policy: Includes full, incremental, and initial full and incremental. Full migration moves all historical data. Incremental synchronization only synchronizes new or changed data. The initial full and incremental mode combines both. It offers batch, real-time, and near real-time implementation options based on data source features and timeliness requirements.
Method | Description |
Batch | A data transfer method based on a batch scheduling mechanism. It uses scheduled tasks (hourly/daily) to migrate full or incremental source data to the destination. |
Real-time | Uses a stream processing engine to capture source data changes (CDC logs) in real time. This achieves data synchronization with latency in seconds. |
Single table | Data transfer for a single table. It supports fine-grained field mapping, transform rules, and control configurations. |
Full database | Migrates the schemas and data of multiple tables from a source database instance to a destination in one go. It supports automatic table creation. A single task can synchronize multiple tables, which reduces the number of tasks and resource consumption. |
Sharding | Writes data from multiple source tables with identical schemas into a single destination table. It automatically detects sharding routing rules and merges the data. |
Full | A one-time migration of all historical data from a source table. This is typically used for data warehouse initialization or data archiving. |
Incremental | Synchronizes only new or changed data from the source, such as |
Full and incremental | Performs a one-time full synchronization of historical data, then automatically transitions to writing incremental data. Data Integration supports initial full and incremental synchronization for various scenarios. Select a method as needed based on the features and timeliness requirements of the source and destination data sources.
|
Terms
Concepts | Description |
Data synchronization | Data synchronization reads data from a source, extracts and filters it, and then writes it to a destination. Data Integration focuses on transferring data that can be parsed into a logical two-dimensional table schema. It does not provide data stream consumption or extract, transform, and load (ETL) transformations. |
Field mapping | Field mapping defines the read/write relationship between source and destination data in a sync task. When you configure mapping, ensure strict compatibility between field types. This prevents conversion errors, dirty data, or task failures caused by type mismatches. Common risks include the following:
|
Concurrency | Concurrency is the maximum number of threads that can be used to read from a source or write to a data storage destination in parallel during a data synchronization task. |
Rate limiting | Rate limiting is the transfer speed limit for a Data Integration sync task. |
Dirty data | Dirty data refers to data that is invalid, has a format error, or causes a synchronization error. When a single data record fails to be written to the destination, it is classified as dirty data. For example, a If a task fails due to dirty data, data that has been successfully written will not be rolled back. Data Integration uses a batch writing mechanism. In case of a batch error, the rollback capability depends on whether the destination supports transactions. Data Integration itself does not provide transaction support. |
Data source | A data source is a standardized configuration unit in DataWorks for connecting to external systems. It provides unified read and write endpoint definitions for data integration tasks through various pre-configured connection templates for disparate data sources, such as MaxCompute, MySQL, and OSS. |
Data consistency | Data Integration synchronization supports an at-least-once delivery guarantee. It does not support exactly-once delivery. This means data may be duplicated after transfer. Uniqueness must be ensured using primary keys and the capabilities of the destination. |
Product features and core values
DataWorks Data Integration features broad connectivity, flexible solutions, excellent performance, simplified development and O&M, and comprehensive security controls.
Broad data ecosystem connectivity
Break down data silos to achieve data aggregation and migration.
Supports a wide range of data sources: Covers various types of data sources, such as relational databases, big data storage, NoSQL databases, message queues, file storage, and SaaS applications.
Compatible with complex networks: By configuring network connectivity settings, it supports data forwarding in hybrid cloud and multicloud architectures over the Internet, VPCs, Express Connect, or Cloud Enterprise Network (CEN).
Flexible and rich synchronization solutions
Meets synchronization needs ranging from batch to real-time, from single table to full database, and from full to incremental.
Batch synchronization: Supports various batch synchronization scenarios, such as single table, full database, and sharding. It provides capabilities for data filtering, column cropping, and transformation logic. It is suitable for periodic T+1 ETL loading of large-scale data.
Real-time synchronization: Captures data changes from data sources such as MySQL, Oracle, and Hologres in near real-time. It then writes the data to a real-time data warehouse or message queue to support real-time business decisions.
Integrated full and incremental synchronization: Provides full database synchronization solutions, including batch, real-time, and integrated full and incremental (near real-time) modes. It performs an initial full data synchronization on the first run and then automatically switches to incremental data synchronization. This simplifies the process of initial data warehousing and subsequent updates. It provides data ingestion capabilities for full migration, incremental capture, and automatic transition between full and incremental synchronization.
Elastic scaling and performance
Adaptive resource scheduling provides highly stable data transfer guarantees for core business operations.
Elastic resources: Serverless resource groups support on-demand elastic scaling and pay-as-you-go billing to effectively handle traffic fluctuations.
Performance tuning: Supports concurrency control, rate limiting, dirty data processing, and distributed processing to ensure stable synchronization under different loads.
Low-code development and intelligent O&M
Reduces the complexity and cost of data synchronization development and O&M through visual configuration and workflows.
Low-code development: The codeless UI provides a visual configuration interface. You can configure most sync tasks with simple clicks, without writing code. The code editor supports advanced configuration through JSON scripts to meet complex requirements, such as parameterization and dynamic column mapping.
End-to-end O&M: Batch sync tasks can be integrated into directed acyclic graph (DAG) workflows. This supports scheduling orchestration, monitoring, and alerting.
Comprehensive security control
Integrates multi-layered security mechanisms to ensure data control and compliance throughout its entire lifecycle.
Centralized management: A unified data source management center supports permission control for data sources and isolation between development and production environments.
Security protection: It uses RAM for access control and supports role-based authentication and data masking.
Billing description
The costs for Data Integration tasks mainly include resource group fees, scheduling fees, and data transfer costs. Data Integration tasks require resource groups, and you are charged based on resource group usage. Scheduling fees apply to certain batch synchronization tasks and full database batch tasks. Data transfer costs are also incurred if data is transferred over the Internet. For more billing details, see Core billing scenarios.
Network connectivity
A network connection between a data source and a resource group is required for Data Integration tasks to run. The task will fail if a connection cannot be established.

Data Integration supports data synchronization between disparate data sources in complex network environments. It supports the following complex scenarios:
Data synchronization across different Alibaba Cloud accounts or regions.
Connectivity for hybrid clouds and on-premises data centers.
Configuration of multiple network channels, such as the Internet, VPC, and CEN.
For detailed network configuration solutions, see Overview of network connectivity solutions.
References
You can then configure a data source and create a sync job in Data Integration or Data Studio to transfer and migrate data. For more information, see: