Data Integration is a stable, efficient, and elastic data synchronization platform. It provides high-speed and stable data movement and synchronization between disparate data sources across complex network environments.
Process guide
Data Integration can only be used on a PC with Chrome version 69 or later.
The general development workflow for Data Integration is as follows:
Configure a data source, prepare a resource group, and establish network connectivity between the data source and the resource group.
Select an offline or real-time synchronization method to create a task for your scenario. Then, follow the on-screen instructions to complete the resource and task configuration.
Debug the task using the data preview and trial run features. After debugging, submit and publish the task. Offline tasks must be published to the production environment.
Finally, enter the continuous O&M phase, where you can monitor the synchronization status, set alerts, and optimize resources to create a closed-loop management system.
Synchronization methods
DataWorks Data Integration provides synchronization methods that can be combined across three dimensions: timeliness, scope, and data policy. For more information about solutions and recommendations, see Supported data sources and synchronization solutions.
Timeliness: This includes offline and real-time synchronization. Offline synchronization uses auto triggered tasks to migrate data on an hourly or daily basis. Real-time synchronization captures source data changes (Change Data Capture (CDC)) to achieve second-level latency.
Scope: This includes single table, entire database, and sharding. Data Integration supports fine-grained transfer of a single table, along with batch migration and merging of an entire database or sharded databases.
Data policy: This includes full, incremental, and full and incremental synchronization. Full migration moves all historical data. Incremental synchronization synchronizes only new or changed data. The full and incremental mode combines both approaches. It provides multiple implementation solutions, such as offline, real-time, and Near Real-Time, based on data source attributes and timeliness requirements.
Method | Description |
Offline | A data transfer method based on a batch scheduling mechanism. It uses auto triggered tasks (hourly/daily) to migrate full or incremental source data to the destination. |
Real-time | Uses a stream processing engine to capture source data changes (CDC logs) in real-time, achieving second-level latency for data synchronization. |
Single table | Data transfer for a single table. It supports fine-grained field mapping, transformation rules, and control configurations. |
Entire database | Migrates the schemas and data of multiple tables within a source database instance to the destination at once. It supports automatic table creation. You can synchronize multiple tables in a single task to reduce the number of tasks and resource consumption. |
Sharding | Writes data from multiple source tables with identical structures into a single destination table. It automatically detects sharding routing rules and merges the data. |
Full | A one-time migration of all historical data from a source table. It is typically used for data warehouse initialization or data archiving. |
Incremental | Syncs only new or changed data from the source, such as |
Full and incremental | After a one-time full synchronization of historical data, it automatically proceeds to write incremental data. Data Integration supports full and incremental synchronization for various scenarios. Select a method as needed based on the attributes and timeliness requirements of the source and destination data sources.
|
Terms
Term | Description |
Data synchronization | Data synchronization involves reading data from a source data source, extracting and filtering it, and then writing it to a destination. Data Integration focuses on transferring data that can be parsed into a logical two-dimensional table structure. It does not provide data stream consumption or extract, transform, and load (ETL) transformations. Data Integration synchronization supports only an at-least-once delivery guarantee. It does not support exactly-once delivery. This means that data may be duplicated after transfer. Uniqueness must be ensured using primary keys and the capabilities of the destination. |
Field mapping | Field mapping defines the read-write relationship between source and destination data in a sync task. When you configure mapping, strictly check the compatibility of field types at both ends. This prevents conversion errors, dirty data, or task failures caused by type mismatches. Common risks include the following:
|
Number of concurrent threads | The number of concurrent threads is the maximum number of threads that can read from the source or write to the destination data storage in parallel during a data synchronization task. |
Rate limiting | Rate limiting is the transfer speed limit that a Data Integration sync task can reach. |
Dirty data | Dirty data refers to data that is invalid, incorrectly formatted, or has synchronization errors. When a single data record fails to be written to the destination, it is classified as dirty data (for example, a source If a task fails due to dirty data, successfully written data is not rolled back. Data Integration uses a batch writing mechanism. The ability to roll back during a batch exception depends on whether the destination supports transactions. Data Integration itself does not provide transaction support. |
Data source | A data source is a standardized configuration unit in DataWorks for connecting to external systems. It provides unified read and write endpoint definitions for Data Integration tasks through various pre-configured connection templates for disparate data sources (such as MaxCompute, MySQL, and OSS). |
Data consistency | Data Integration synchronization supports only an at-least-once delivery guarantee. It does not support exactly-once delivery. This means that data may be duplicated after transfer. Uniqueness must be ensured using primary keys and the capabilities of the destination. |
Features and core values
DataWorks Data Integration provides extensive connectivity, flexible solutions, excellent performance, convenient development and operations, and comprehensive security controls.
Extensive data ecosystem connectivity
Break down data silos to enable data aggregation and migration.
Support for various data sources: Covers multiple types of data sources, including relational databases, big data storage, NoSQL databases, MSMQ, file storage, and Software as a Service (SaaS) applications.
Complex network compatibility: By configuring network connectivity settings, it supports data forwarding in hybrid cloud or multicloud architectures over the Internet, VPCs, Express Connect, or Cloud Enterprise Network (CEN).
Flexible and rich synchronization solutions
Meets synchronization needs ranging from offline to real-time, from single table to entire database, and from full to incremental.
Offline synchronization: Supports various offline batch synchronization scenarios, such as single table, entire database, and sharding. It provides capabilities for data filtering, column cropping, and transformation logic, making it suitable for large-scale, periodic T+1 ETL loading.
Real-time synchronization: Captures data changes from data sources such as MySQL, Oracle, and Hologres in Near Real-Time. It then writes the changes to a real-time data warehouse or MSMQ to support real-time business decisions.
Integrated full and incremental synchronization: Provides solutions such as offline entire database, real-time entire database, and full and incremental entire database (Near Real-Time) synchronization. It performs an initial full data synchronization on the first run and then automatically switches to incremental data synchronization. This simplifies the process of initial data warehousing and subsequent updates, providing data ingestion capabilities for full migration, incremental capture, and automatic transition between full and incremental modes.
Elastic scaling and performance
Dynamic resource scheduling ensures highly stable data transfers for core business operations.
Elastic resources: Serverless resource groups support on-demand elastic scaling and pay-as-you-go billing to effectively handle traffic fluctuations.
Performance tuning: Supports concurrency control, traffic limiting, dirty data processing, and distributed processing to ensure stable synchronization under different payloads.
Low-code development and intelligent O&M
Reduces the complexity and cost of data synchronization development and O&M through visual configurations and workflows.
Low-code development: The codeless UI provides a visual interface where you can configure most sync tasks with simple clicks. The code editor supports advanced configuration using JSON scripts to meet the needs of complex scenarios, such as parameterization and dynamic column mapping.
Full-link O&M: Offline sync tasks can be integrated into directed acyclic graph (DAG) workflows to support scheduling, orchestration, monitoring, and Alerting.
Comprehensive security control
Integrates multilayer security mechanisms to ensure data controllability and compliance throughout data lifecycle.
Centralized management: A unified data source Management Center supports permission control for data sources and fencing for developer and production environments.
Security protection: Complies with Resource Access Management (RAM) access control and supports role-based authentication and data masking.
Billing
The costs of Data Integration tasks mainly consist of resource group fees, scheduling fees, and data transfer costs. Data Integration tasks run on resource groups, and you are charged for these resources. Some offline or offline entire database sync tasks involve scheduled runs, which incur scheduling fees. If a data source transfers data over the Internet, data transfer costs are also incurred. For more information about billing, see Core billing scenarios.
Network connectivity
Network connectivity between the data source and the resource group is a prerequisite for a Data Integration task to run successfully. You must ensure that they can connect to each other. Otherwise, the task will fail.

Data Integration supports data synchronization between disparate data sources in complex network environments. It supports the following complex scenarios:
Data synchronization across different Alibaba Cloud accounts or regions.
Integration with hybrid clouds and on-premises data centers.
Configuration of multiple network channels, such as the Internet, VPC, and CEN.
For more information about network configuration solutions, see Overview of network connectivity solutions.
References
You can configure a data source and create a synchronization task in Data Integration or Data Studio to transfer and migrate data. For more information, see the following documents: