All Products
Search
Document Center

DataWorks:Overview

Last Updated:Dec 09, 2025

Data Integration is a stable, efficient, and scalable data synchronization platform that provides high-speed data synchronization between disparate data sources across complex network environments.

Process guide

Important

Data Integration must be accessed via a PC using Chrome version 69 or later.

image

The general development flow for Data Integration is as follows:

  1. Configure a data source, prepare a resource group, and establish network connectivity between the data source and the resource group.

  2. Select a batch or real-time synchronization method based on your scenario to develop a task. Follow the on-screen guide to complete the resource and task configuration.

  3. Use data preview and trial runs to debug the task. After successful debugging, submit and publish the task. Batch tasks must be published to the production environment.

  4. Enter the continuous O&M phase. Monitor the synchronization status, set alerts, and optimize resources to achieve full lifecycle management.

Synchronization methods

DataWorks Data Integration provides synchronization methods that can be combined across three dimensions: latency, scope, and data policy. For more information about the solutions and recommendations, see Supported data sources and synchronization solutions.

  • Latency: Includes batch and real-time. Batch synchronization uses scheduled tasks to migrate data on an hourly or daily basis. Real-time synchronization captures source data changes to achieve latency within seconds.

  • Scope: Includes single table, full database, and sharding. It supports fine-grained transfer of a single table, along with batch migration and merging of an entire database or sharded tables.

  • Data policy: Includes full, incremental, and initial full and incremental. Full migration moves all historical data. Incremental synchronization only synchronizes new or changed data. The initial full and incremental mode combines both. It offers batch, real-time, and near real-time implementation options based on data source features and timeliness requirements.

Method

Description

Batch

A data transfer method based on a batch scheduling mechanism. It uses scheduled tasks (hourly/daily) to migrate full or incremental source data to the destination.

Real-time

Uses a stream processing engine to capture source data changes (CDC logs) in real time. This achieves data synchronization with latency in seconds.

Single table

Data transfer for a single table. It supports fine-grained field mapping, transform rules, and control configurations.

Full database

Migrates the schemas and data of multiple tables from a source database instance to a destination in one go. It supports automatic table creation. A single task can synchronize multiple tables, which reduces the number of tasks and resource consumption.

Sharding

Writes data from multiple source tables with identical schemas into a single destination table. It automatically detects sharding routing rules and merges the data.

Full

A one-time migration of all historical data from a source table. This is typically used for data warehouse initialization or data archiving.

Incremental

Synchronizes only new or changed data from the source, such as INSERT or UPDATE operations. Data Integration supports both batch and real-time incremental modes. These are implemented by setting data filters (incremental conditions) and reading source CDC data, respectively.

Full and incremental

Performs a one-time full synchronization of historical data, then automatically transitions to writing incremental data. Data Integration supports initial full and incremental synchronization for various scenarios. Select a method as needed based on the features and timeliness requirements of the source and destination data sources.

  • Batch scenario: One-time full and periodic incremental. This is suitable for data sources that do not have high timeliness requirements and have a valid incremental field (such as modify_time) in the source table.

  • Real-time scenario: One-time full and real-time incremental. This is suitable for data with high timeliness requirements, where the source is a message queue or a database that supports enabling CDC logs.

  • Near real-time scenario: One-time full synchronization to a base table, and real-time incremental synchronization to a log table. The data from the log table is merged into the base table on the next day (T+1). The Near Real-Time scenario complements the real-time scenario. It is suitable for destination table formats that do not support updates or deletions, such as standard MaxCompute tables.

Terms

Concepts

Description

Data synchronization

Data synchronization reads data from a source, extracts and filters it, and then writes it to a destination. Data Integration focuses on transferring data that can be parsed into a logical two-dimensional table schema. It does not provide data stream consumption or extract, transform, and load (ETL) transformations.

Field mapping

Field mapping defines the read/write relationship between source and destination data in a sync task. When you configure mapping, ensure strict compatibility between field types. This prevents conversion errors, dirty data, or task failures caused by type mismatches. Common risks include the following:

  • Type conversion failure: Inconsistent field types between the source and destination (for example, String at the source and Integer at the destination) will directly cause task interruption or generate dirty data.

  • Loss of precision or range: If the maximum value of the destination field type is less than the source's maximum value (or its minimum value is greater than the source's minimum, or its precision is lower than the source's precision), there is a risk of write failure or precision truncation. This applies regardless of source and destination types, or whether the synchronization is batch or real-time.

Concurrency

Concurrency is the maximum number of threads that can be used to read from a source or write to a data storage destination in parallel during a data synchronization task.

Rate limiting

Rate limiting is the transfer speed limit for a Data Integration sync task.

Dirty data

Dirty data refers to data that is invalid, has a format error, or causes a synchronization error. When a single data record fails to be written to the destination, it is classified as dirty data. For example, a VARCHAR type from the source cannot be converted to an INT type at the destination. You can control the dirty data tolerance policy in the task configuration. Set a threshold to limit the number of dirty data records. If the threshold is exceeded, the task fails and exits.

If a task fails due to dirty data, ​data that has been successfully written will not be rolled back. Data Integration uses a batch writing mechanism. In case of a batch error, the rollback capability depends on whether the destination supports transactions. Data Integration itself does not provide transaction support.

Data source

A data source is a standardized configuration unit in DataWorks for connecting to external systems. It provides unified read and write endpoint definitions for data integration tasks through various pre-configured connection templates for disparate data sources, such as MaxCompute, MySQL, and OSS.

Data consistency

Data Integration synchronization supports an at-least-once delivery guarantee. It does not support exactly-once delivery. This means data may be duplicated after transfer. Uniqueness must be ensured using primary keys and the capabilities of the destination.

Product features and core values

DataWorks Data Integration features broad connectivity, flexible solutions, excellent performance, simplified development and O&M, and comprehensive security controls.

Broad data ecosystem connectivity

Break down data silos to achieve data aggregation and migration.
  • Supports a wide range of data sources: Covers various types of data sources, such as relational databases, big data storage, NoSQL databases, message queues, file storage, and SaaS applications.

  • Compatible with complex networks: By configuring network connectivity settings, it supports data forwarding in hybrid cloud and multicloud architectures over the Internet, VPCs, Express Connect, or Cloud Enterprise Network (CEN).

Flexible and rich synchronization solutions

Meets synchronization needs ranging from batch to real-time, from single table to full database, and from full to incremental.
  • Batch synchronization: Supports various batch synchronization scenarios, such as single table, full database, and sharding. It provides capabilities for data filtering, column cropping, and transformation logic. It is suitable for periodic T+1 ETL loading of large-scale data.

  • Real-time synchronization: Captures data changes from data sources such as MySQL, Oracle, and Hologres in near real-time. It then writes the data to a real-time data warehouse or message queue to support real-time business decisions.

  • Integrated full and incremental synchronization: Provides full database synchronization solutions, including batch, real-time, and integrated full and incremental (near real-time) modes. It performs an initial full data synchronization on the first run and then automatically switches to incremental data synchronization. This simplifies the process of initial data warehousing and subsequent updates. It provides data ingestion capabilities for full migration, incremental capture, and automatic transition between full and incremental synchronization.

Elastic scaling and performance

Adaptive resource scheduling provides highly stable data transfer guarantees for core business operations.
  • Elastic resources: Serverless resource groups support on-demand elastic scaling and pay-as-you-go billing to effectively handle traffic fluctuations.

  • Performance tuning: Supports concurrency control, rate limiting, dirty data processing, and distributed processing to ensure stable synchronization under different loads.

Low-code development and intelligent O&M

Reduces the complexity and cost of data synchronization development and O&M through visual configuration and workflows.
  • Low-code development: The codeless UI provides a visual configuration interface. You can configure most sync tasks with simple clicks, without writing code. The code editor supports advanced configuration through JSON scripts to meet complex requirements, such as parameterization and dynamic column mapping.

  • End-to-end O&M: Batch sync tasks can be integrated into directed acyclic graph (DAG) workflows. This supports scheduling orchestration, monitoring, and alerting.

Comprehensive security control

Integrates multi-layered security mechanisms to ensure data control and compliance throughout its entire lifecycle.
  • Centralized management: A unified data source management center supports permission control for data sources and isolation between development and production environments.

  • Security protection: It uses RAM for access control and supports role-based authentication and data masking.

Billing description

The costs for Data Integration tasks mainly include resource group fees, scheduling fees, and data transfer costs. Data Integration tasks require resource groups, and you are charged based on resource group usage. Scheduling fees apply to certain batch synchronization tasks and full database batch tasks. Data transfer costs are also incurred if data is transferred over the Internet. For more billing details, see Core billing scenarios.

Network connectivity

A network connection between a data source and a resource group is required for Data Integration tasks to run. The task will fail if a connection cannot be established.

image

Data Integration supports data synchronization between disparate data sources in complex network environments. It supports the following complex scenarios:

  • Data synchronization across different Alibaba Cloud accounts or regions.

  • Connectivity for hybrid clouds and on-premises data centers.

  • Configuration of multiple network channels, such as the Internet, VPC, and CEN.

For detailed network configuration solutions, see Overview of network connectivity solutions.

References

You can then configure a data source and create a sync job in Data Integration or Data Studio to transfer and migrate data. For more information, see: