All Products
Search
Document Center

DataWorks:Data Integration

Last Updated:Dec 09, 2025

Data Integration is a stable, efficient, and elastic data synchronization platform. It provides high-speed and stable data movement and synchronization between disparate data sources across complex network environments.

Process guide

Important

Data Integration can only be used on a PC with Chrome version 69 or later.

image

The general development workflow for Data Integration is as follows:

  1. Configure a data source, prepare a resource group, and establish network connectivity between the data source and the resource group.

  2. Select an offline or real-time synchronization method to create a task for your scenario. Then, follow the on-screen instructions to complete the resource and task configuration.

  3. Debug the task using the data preview and trial run features. After debugging, submit and publish the task. Offline tasks must be published to the production environment.

  4. Finally, enter the continuous O&M phase, where you can monitor the synchronization status, set alerts, and optimize resources to create a closed-loop management system.

Synchronization methods

DataWorks Data Integration provides synchronization methods that can be combined across three dimensions: timeliness, scope, and data policy. For more information about solutions and recommendations, see Supported data sources and synchronization solutions.

  • Timeliness: This includes offline and real-time synchronization. Offline synchronization uses auto triggered tasks to migrate data on an hourly or daily basis. Real-time synchronization captures source data changes (Change Data Capture (CDC)) to achieve second-level latency.

  • Scope: This includes single table, entire database, and sharding. Data Integration supports fine-grained transfer of a single table, along with batch migration and merging of an entire database or sharded databases.

  • Data policy: This includes full, incremental, and full and incremental synchronization. Full migration moves all historical data. Incremental synchronization synchronizes only new or changed data. The full and incremental mode combines both approaches. It provides multiple implementation solutions, such as offline, real-time, and Near Real-Time, based on data source attributes and timeliness requirements.

Method

Description

Offline

A data transfer method based on a batch scheduling mechanism. It uses auto triggered tasks (hourly/daily) to migrate full or incremental source data to the destination.

Real-time

Uses a stream processing engine to capture source data changes (CDC logs) in real-time, achieving second-level latency for data synchronization.

Single table

Data transfer for a single table. It supports fine-grained field mapping, transformation rules, and control configurations.

Entire database

Migrates the schemas and data of multiple tables within a source database instance to the destination at once. It supports automatic table creation. You can synchronize multiple tables in a single task to reduce the number of tasks and resource consumption.

Sharding

Writes data from multiple source tables with identical structures into a single destination table. It automatically detects sharding routing rules and merges the data.

Full

A one-time migration of all historical data from a source table. It is typically used for data warehouse initialization or data archiving.

Incremental

Syncs only new or changed data from the source, such as INSERT or UPDATE operations. Data Integration supports both offline and real-time incremental modes. These are implemented by setting data filtering (incremental conditions) and reading source CDC data, respectively.

Full and incremental

After a one-time full synchronization of historical data, it automatically proceeds to write incremental data. Data Integration supports full and incremental synchronization for various scenarios. Select a method as needed based on the attributes and timeliness requirements of the source and destination data sources.

  • Offline scenario: One-time full synchronization followed by periodic incremental synchronization. This is suitable for data sources that do not have high timeliness requirements and have a suitable incremental field (such as modify_time) in the source table.

  • Real-time scenario: One-time full synchronization followed by real-time incremental synchronization. This is suitable for data sources with high timeliness requirements, such as MSMQ or databases that support enabling CDC logs.

  • Near Real-Time scenario: A one-time full synchronization is performed into a base table, and real-time incremental data is written to a log table. At T+1, the data from the log table is merged into the base table. The Near Real-Time scenario complements the real-time scenario. It is suitable for destination table formats that do not support updates or deletions, such as standard MaxCompute tables.

Terms

Term

Description

Data synchronization

Data synchronization involves reading data from a source data source, extracting and filtering it, and then writing it to a destination. Data Integration focuses on transferring data that can be parsed into a logical two-dimensional table structure. It does not provide data stream consumption or extract, transform, and load (ETL) transformations.

Data Integration synchronization supports only an at-least-once delivery guarantee. It does not support exactly-once delivery. This means that data may be duplicated after transfer. Uniqueness must be ensured using primary keys and the capabilities of the destination.

Field mapping

Field mapping defines the read-write relationship between source and destination data in a sync task. When you configure mapping, strictly check the compatibility of field types at both ends. This prevents conversion errors, dirty data, or task failures caused by type mismatches. Common risks include the following:

  • Type conversion failure: If the source and destination field types are inconsistent (for example, the source is String and the destination is Integer), the task will be interrupted or dirty data will be generated.

  • Loss of precision or range: If the maximum value of the destination field type is less than the source's maximum value (or its minimum value is greater than the source's minimum, or its precision is lower than the source's precision), a write failure or precision truncation may occur. This risk applies regardless of source and destination types, or whether the synchronization is offline or real-time.

Number of concurrent threads

The number of concurrent threads is the maximum number of threads that can read from the source or write to the destination data storage in parallel during a data synchronization task.

Rate limiting

Rate limiting is the transfer speed limit that a Data Integration sync task can reach.

Dirty data

Dirty data refers to data that is invalid, incorrectly formatted, or has synchronization errors. When a single data record fails to be written to the destination, it is classified as dirty data (for example, a source VARCHAR type cannot be converted to a destination INT type). You can configure a dirty data toleration policy in the task configuration. Set a threshold to limit the number of dirty data records. If the threshold is exceeded, the task fails and exits.

If a task fails due to dirty data, successfully written data is not rolled back. Data Integration uses a batch writing mechanism. The ability to roll back during a batch exception depends on whether the destination supports transactions. Data Integration itself does not provide transaction support.

Data source

A data source is a standardized configuration unit in DataWorks for connecting to external systems. It provides unified read and write endpoint definitions for Data Integration tasks through various pre-configured connection templates for disparate data sources (such as MaxCompute, MySQL, and OSS).

Data consistency

Data Integration synchronization supports only an at-least-once delivery guarantee. It does not support exactly-once delivery. This means that data may be duplicated after transfer. Uniqueness must be ensured using primary keys and the capabilities of the destination.

Features and core values

DataWorks Data Integration provides extensive connectivity, flexible solutions, excellent performance, convenient development and operations, and comprehensive security controls.

Extensive data ecosystem connectivity

Break down data silos to enable data aggregation and migration.
  • Support for various data sources: Covers multiple types of data sources, including relational databases, big data storage, NoSQL databases, MSMQ, file storage, and Software as a Service (SaaS) applications.

  • Complex network compatibility: By configuring network connectivity settings, it supports data forwarding in hybrid cloud or multicloud architectures over the Internet, VPCs, Express Connect, or Cloud Enterprise Network (CEN).

Flexible and rich synchronization solutions

Meets synchronization needs ranging from offline to real-time, from single table to entire database, and from full to incremental.
  • Offline synchronization: Supports various offline batch synchronization scenarios, such as single table, entire database, and sharding. It provides capabilities for data filtering, column cropping, and transformation logic, making it suitable for large-scale, periodic T+1 ETL loading.

  • Real-time synchronization: Captures data changes from data sources such as MySQL, Oracle, and Hologres in Near Real-Time. It then writes the changes to a real-time data warehouse or MSMQ to support real-time business decisions.

  • Integrated full and incremental synchronization: Provides solutions such as offline entire database, real-time entire database, and full and incremental entire database (Near Real-Time) synchronization. It performs an initial full data synchronization on the first run and then automatically switches to incremental data synchronization. This simplifies the process of initial data warehousing and subsequent updates, providing data ingestion capabilities for full migration, incremental capture, and automatic transition between full and incremental modes.

Elastic scaling and performance

Dynamic resource scheduling ensures highly stable data transfers for core business operations.
  • Elastic resources: Serverless resource groups support on-demand elastic scaling and pay-as-you-go billing to effectively handle traffic fluctuations.

  • Performance tuning: Supports concurrency control, traffic limiting, dirty data processing, and distributed processing to ensure stable synchronization under different payloads.

Low-code development and intelligent O&M

Reduces the complexity and cost of data synchronization development and O&M through visual configurations and workflows.
  • Low-code development: The codeless UI provides a visual interface where you can configure most sync tasks with simple clicks. The code editor supports advanced configuration using JSON scripts to meet the needs of complex scenarios, such as parameterization and dynamic column mapping.

  • Full-link O&M: Offline sync tasks can be integrated into directed acyclic graph (DAG) workflows to support scheduling, orchestration, monitoring, and Alerting.

Comprehensive security control

Integrates multilayer security mechanisms to ensure data controllability and compliance throughout data lifecycle.
  • Centralized management: A unified data source Management Center supports permission control for data sources and fencing for developer and production environments.

  • Security protection: Complies with Resource Access Management (RAM) access control and supports role-based authentication and data masking.

Billing

The costs of Data Integration tasks mainly consist of resource group fees, scheduling fees, and data transfer costs. Data Integration tasks run on resource groups, and you are charged for these resources. Some offline or offline entire database sync tasks involve scheduled runs, which incur scheduling fees. If a data source transfers data over the Internet, data transfer costs are also incurred. For more information about billing, see Core billing scenarios.

Network connectivity

Network connectivity between the data source and the resource group is a prerequisite for a Data Integration task to run successfully. You must ensure that they can connect to each other. Otherwise, the task will fail.

image

Data Integration supports data synchronization between disparate data sources in complex network environments. It supports the following complex scenarios:

  • Data synchronization across different Alibaba Cloud accounts or regions.

  • Integration with hybrid clouds and on-premises data centers.

  • Configuration of multiple network channels, such as the Internet, VPC, and CEN.

For more information about network configuration solutions, see Overview of network connectivity solutions.

References

You can configure a data source and create a synchronization task in Data Integration or Data Studio to transfer and migrate data. For more information, see the following documents: