All Products
Search
Document Center

DataWorks:Overview

Last Updated:Mar 27, 2026

Data Integration is a stable, efficient, and scalable data synchronization platform that moves data at high speed between disparate data sources across complex network environments.

Important

Access Data Integration from a PC using Chrome 69 or later.

image

How it works

A typical Data Integration workflow has four stages:

  1. Connect: Configure a data source, provision a resource group, and establish network connectivity between them.

  2. Develop: Choose a batch or real-time synchronization method, then complete resource and task configuration.

  3. Test and publish: Use data preview and trial runs to debug. After debugging succeeds, submit and publish the task. Batch tasks must be published to the production environment.

  4. Operate: Monitor synchronization status, set alerts, and optimize resources for full lifecycle management.

Synchronization methods

DataWorks Data Integration provides synchronization methods that can be combined across three dimensions: latency, scope, and data policy. For more information about the solutions and recommendations, see Supported data sources and synchronization solutions.

How to read the dimensions:

  • Latency — how often data moves (scheduled batch vs. continuous real-time)

  • Scope — how much of the source is transferred (one table, a full database, or merged shards)

  • Data policy — which records are transferred (all history, only new changes, or both)

Latency

Method Description
Batch Uses scheduled tasks (hourly or daily) to migrate full or incremental data. Suitable for periodic T+1 ETL workloads.
Real-time Captures source data changes using Change Data Capture (CDC) via a stream processing engine, achieving synchronization latency within seconds.

Scope

Method Description
Single table Transfers one table at a time with fine-grained field mapping, transform rules, and control configurations.
Full database Migrates schemas and data from multiple tables in one task. Supports automatic table creation, reducing the number of tasks and resource consumption.
Sharding Merges data from multiple source tables with identical schemas into a single destination table. Automatically detects sharding routing rules.

Data policy

Method Description
Full One-time migration of all historical data. Typically used for data warehouse initialization or data archiving.
Incremental Transfers only new or changed records (such as INSERT or UPDATE operations). Implemented via data filters (batch mode) or CDC log reading (real-time mode).
Full and incremental Performs a one-time full synchronization, then automatically switches to incremental synchronization. Three sub-modes are available based on timeliness requirements:

Full and incremental sub-modes:

Sub-mode How it works When to use
Batch One-time full load, then periodic incremental Source has no strict timeliness requirements and has a valid incremental field (e.g., modify_time)
Real-time One-time full load, then real-time CDC incremental Data has high timeliness requirements; source is a message queue or a database that supports CDC logs
Near real-time One-time full load to a base table; real-time incremental to a log table; log data merged into base table at T+1 Destination format does not support updates or deletes (e.g., standard MaxCompute tables)

Key concepts

These terms appear at specific stages of task configuration and operation. Understanding them before you start reduces configuration errors.

Concept What it means Where it matters
Data synchronization Reads data from a source, extracts and filters it, and writes it to a destination. Data Integration focuses on transferring data that can be parsed into a logical two-dimensional table schema. It does not provide data stream consumption or ETL transformations. Step 1: Connect
Data source A standardized connection configuration in DataWorks for external systems (such as MaxCompute, MySQL, and OSS). Think of it as a saved connection string that tasks reuse. Step 1: Connect
Field mapping Defines which source fields are read and which destination fields are written. Type mismatches between source and destination fields cause task failures or dirty data — ensure strict type compatibility during configuration. Common risks include: Type conversion failure — inconsistent field types (for example, String at the source and Integer at the destination) directly cause task interruption or dirty data; Loss of precision or range — if the destination field's maximum value is less than the source's maximum (or its precision is lower), there is a risk of write failure or precision truncation, regardless of sync method. Step 2: Develop
Concurrency The maximum number of parallel read/write threads for a sync task. Step 2: Develop
Rate limiting A transfer speed cap for a sync task. Step 2: Develop
Dirty data A record that fails to write to the destination (for example, a VARCHAR value that cannot convert to INT). Set a dirty data threshold in the task configuration — if the threshold is exceeded, the task fails and exits. Data already written is not rolled back. Data Integration uses a batch writing mechanism; in case of a batch error, rollback capability depends on whether the destination supports transactions. Data Integration itself does not provide transaction support. Step 3: Test and publish
Data consistency Data Integration guarantees at-least-once delivery. Exactly-once delivery is not supported, so duplicate records are possible. Use primary keys and the capabilities of the destination to enforce uniqueness. Step 4: Operate

Features

Connect to your data ecosystem

Data Integration connects to relational databases, big data stores, NoSQL databases, message queues, file storage systems, and SaaS applications.

For cross-account, cross-region, hybrid cloud, and on-premises environments, configure network connectivity to route data over the Internet, virtual private clouds (VPCs), Express Connect, or Cloud Enterprise Network (CEN).

Synchronize data flexibly

  • Batch synchronization: Covers single table, full database, and sharding scenarios. Supports data filtering, column pruning, and transformation logic for large-scale periodic ETL loads.

  • Real-time synchronization: Captures changes from sources such as MySQL, Oracle, and Hologres and writes them to a real-time data warehouse or message queue with latency within seconds.

  • Full and incremental synchronization: Combines an initial full load with ongoing incremental synchronization (batch, real-time, or near real-time) to simplify initial data warehousing and continuous updates.

Scale to your workload

Serverless resource groups scale on demand with pay-as-you-go billing, handling traffic fluctuations without manual intervention. Concurrency control, rate limiting, dirty data handling, and distributed processing keep synchronization stable under varying loads.

Develop and operate at low cost

A codeless visual interface covers most sync task configurations. A JSON script editor handles advanced requirements such as parameterization and dynamic column mapping. Batch sync tasks integrate into directed acyclic graph (DAG) workflows for scheduling orchestration, monitoring, and alerting.

Control access and protect data

A unified data source management center provides permission controls and isolates development from production environments. Resource Access Management (RAM) handles access control with role-based authentication. Data masking is available.

Billing

Data Integration costs come from three sources:

For a full breakdown, see Core billing scenarios.

Network connectivity

Every Data Integration task requires a working network connection between the data source and the resource group. A task fails if this connection cannot be established.

image

Data Integration supports synchronization across:

  • Different Alibaba Cloud accounts or regions

  • Hybrid cloud and on-premises data centers

  • Multiple network channels: Internet, VPC, Express Connect, and CEN

For configuration details, see Overview of network connectivity solutions.

What's next

Configure a data source and create a sync task in Data Integration or Data Studio: