Data Integration is a stable, efficient, and scalable data synchronization platform that moves data at high speed between disparate data sources across complex network environments.
Access Data Integration from a PC using Chrome 69 or later.
How it works
A typical Data Integration workflow has four stages:
-
Connect: Configure a data source, provision a resource group, and establish network connectivity between them.
-
Develop: Choose a batch or real-time synchronization method, then complete resource and task configuration.
-
Test and publish: Use data preview and trial runs to debug. After debugging succeeds, submit and publish the task. Batch tasks must be published to the production environment.
-
Operate: Monitor synchronization status, set alerts, and optimize resources for full lifecycle management.
Synchronization methods
DataWorks Data Integration provides synchronization methods that can be combined across three dimensions: latency, scope, and data policy. For more information about the solutions and recommendations, see Supported data sources and synchronization solutions.
How to read the dimensions:
-
Latency — how often data moves (scheduled batch vs. continuous real-time)
-
Scope — how much of the source is transferred (one table, a full database, or merged shards)
-
Data policy — which records are transferred (all history, only new changes, or both)
Latency
| Method | Description |
|---|---|
| Batch | Uses scheduled tasks (hourly or daily) to migrate full or incremental data. Suitable for periodic T+1 ETL workloads. |
| Real-time | Captures source data changes using Change Data Capture (CDC) via a stream processing engine, achieving synchronization latency within seconds. |
Scope
| Method | Description |
|---|---|
| Single table | Transfers one table at a time with fine-grained field mapping, transform rules, and control configurations. |
| Full database | Migrates schemas and data from multiple tables in one task. Supports automatic table creation, reducing the number of tasks and resource consumption. |
| Sharding | Merges data from multiple source tables with identical schemas into a single destination table. Automatically detects sharding routing rules. |
Data policy
| Method | Description |
|---|---|
| Full | One-time migration of all historical data. Typically used for data warehouse initialization or data archiving. |
| Incremental | Transfers only new or changed records (such as INSERT or UPDATE operations). Implemented via data filters (batch mode) or CDC log reading (real-time mode). |
| Full and incremental | Performs a one-time full synchronization, then automatically switches to incremental synchronization. Three sub-modes are available based on timeliness requirements: |
Full and incremental sub-modes:
| Sub-mode | How it works | When to use |
|---|---|---|
| Batch | One-time full load, then periodic incremental | Source has no strict timeliness requirements and has a valid incremental field (e.g., modify_time) |
| Real-time | One-time full load, then real-time CDC incremental | Data has high timeliness requirements; source is a message queue or a database that supports CDC logs |
| Near real-time | One-time full load to a base table; real-time incremental to a log table; log data merged into base table at T+1 | Destination format does not support updates or deletes (e.g., standard MaxCompute tables) |
Key concepts
These terms appear at specific stages of task configuration and operation. Understanding them before you start reduces configuration errors.
| Concept | What it means | Where it matters |
|---|---|---|
| Data synchronization | Reads data from a source, extracts and filters it, and writes it to a destination. Data Integration focuses on transferring data that can be parsed into a logical two-dimensional table schema. It does not provide data stream consumption or ETL transformations. | Step 1: Connect |
| Data source | A standardized connection configuration in DataWorks for external systems (such as MaxCompute, MySQL, and OSS). Think of it as a saved connection string that tasks reuse. | Step 1: Connect |
| Field mapping | Defines which source fields are read and which destination fields are written. Type mismatches between source and destination fields cause task failures or dirty data — ensure strict type compatibility during configuration. Common risks include: Type conversion failure — inconsistent field types (for example, String at the source and Integer at the destination) directly cause task interruption or dirty data; Loss of precision or range — if the destination field's maximum value is less than the source's maximum (or its precision is lower), there is a risk of write failure or precision truncation, regardless of sync method. |
Step 2: Develop |
| Concurrency | The maximum number of parallel read/write threads for a sync task. | Step 2: Develop |
| Rate limiting | A transfer speed cap for a sync task. | Step 2: Develop |
| Dirty data | A record that fails to write to the destination (for example, a VARCHAR value that cannot convert to INT). Set a dirty data threshold in the task configuration — if the threshold is exceeded, the task fails and exits. Data already written is not rolled back. Data Integration uses a batch writing mechanism; in case of a batch error, rollback capability depends on whether the destination supports transactions. Data Integration itself does not provide transaction support. |
Step 3: Test and publish |
| Data consistency | Data Integration guarantees at-least-once delivery. Exactly-once delivery is not supported, so duplicate records are possible. Use primary keys and the capabilities of the destination to enforce uniqueness. | Step 4: Operate |
Features
Connect to your data ecosystem
Data Integration connects to relational databases, big data stores, NoSQL databases, message queues, file storage systems, and SaaS applications.
For cross-account, cross-region, hybrid cloud, and on-premises environments, configure network connectivity to route data over the Internet, virtual private clouds (VPCs), Express Connect, or Cloud Enterprise Network (CEN).
Synchronize data flexibly
-
Batch synchronization: Covers single table, full database, and sharding scenarios. Supports data filtering, column pruning, and transformation logic for large-scale periodic ETL loads.
-
Real-time synchronization: Captures changes from sources such as MySQL, Oracle, and Hologres and writes them to a real-time data warehouse or message queue with latency within seconds.
-
Full and incremental synchronization: Combines an initial full load with ongoing incremental synchronization (batch, real-time, or near real-time) to simplify initial data warehousing and continuous updates.
Scale to your workload
Serverless resource groups scale on demand with pay-as-you-go billing, handling traffic fluctuations without manual intervention. Concurrency control, rate limiting, dirty data handling, and distributed processing keep synchronization stable under varying loads.
Develop and operate at low cost
A codeless visual interface covers most sync task configurations. A JSON script editor handles advanced requirements such as parameterization and dynamic column mapping. Batch sync tasks integrate into directed acyclic graph (DAG) workflows for scheduling orchestration, monitoring, and alerting.
Control access and protect data
A unified data source management center provides permission controls and isolates development from production environments. Resource Access Management (RAM) handles access control with role-based authentication. Data masking is available.
Billing
Data Integration costs come from three sources:
-
Resource group fees: Charged based on resource group usage. All tasks require a resource group.
-
Scheduling fees: Apply to certain batch sync tasks and full database batch tasks.
-
Data transfer costs: Incurred when data crosses the Internet.
For a full breakdown, see Core billing scenarios.
Network connectivity
Every Data Integration task requires a working network connection between the data source and the resource group. A task fails if this connection cannot be established.
Data Integration supports synchronization across:
-
Different Alibaba Cloud accounts or regions
-
Hybrid cloud and on-premises data centers
-
Multiple network channels: Internet, VPC, Express Connect, and CEN
For configuration details, see Overview of network connectivity solutions.
What's next
Configure a data source and create a sync task in Data Integration or Data Studio: