DataWorks Data Integration in Data Studio - ETL task management - DataWorks

Overview

Define and manage Data Integration tasks in the DataStudio module alongside other task nodes. The full ETL/ELT pipeline runs in a single view with no module switching.

A single-table batch synchronization task can be added as a workflow node with dependency configuration. Other integration tasks run as standalone nodes.

Core mechanism:

Configuration consistency: The configuration interface, parameters, and underlying functions are identical whether you create a task in Data Studio or in the Data Integration module.
Bidirectional synchronization: Tasks created in the Data Integration module auto-sync to the data_integration_jobs directory in Data Studio, categorized by source-type-destination-type channel.

Prerequisites

Data source preparation
- Source and destination data sources are configured. Data source management.
- Your data sources support real-time synchronization. Supported data sources and synchronization solutions.
- For data sources that require logging (such as Hologres and Oracle), logging is enabled. Data source list.
Resource group: You have purchased and configured a serverless resource group.
Network connectivity: You have established network connectivity between the resource group and your data sources.

Create an integration task in Data Studio

The following example creates a single-table batch synchronization task from MySQL to MaxCompute.

Create a node

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
Click the + icon at the top of the page, or in the toolbar of the workflow canvas, select New Node > Data Integration.

Configure basic information

Choose batch or real-time synchronization based on your data latency, volume, and complexity requirements. Then select a compatible synchronization solution that matches your source and destination database types, network environment, and feature requirements.

In the New Node dialog box, configure the following settings:

Path: Select where to store the task node in the Data Studio directory tree.
Data Source Type: Select your source data source, for example, MySQL.
Data destination type: Select your destination data source, for example, MaxCompute.
Specific Type: Select a synchronization solution based on your needs, such as full-database real-time, full-database batch, full-incremental for an entire database, or single-table batch.
Name: Enter a name for the task node, for example, mysql_to_mc_user_table.

Click Confirm. The task is created and its visual configuration page opens.

Configure task details

The remaining configuration is identical to the Data Integration module.
Configure scheduling (single-table batch synchronization only)

Configure node properties such as schedule time, scheduling dependencies, and scheduling policies for automatic execution. Set scheduling parameters to pass dynamic variables to node instances at runtime.
Publish and maintain the task
- Click Publish in the toolbar to deploy the task to production. The task is then managed by Operation and Maintenance Center.
- After publishing, view run status, logs, alerts, and dependencies in Operation Center. Task O&M and tuning covers instance management, rerunning failed tasks, performance tuning, and dirty data handling.

Task types

Data Integration supports multiple synchronization modes. Each mode differs in scheduling, dependency support, and use cases.

Single-table batch synchronization

Configuration: Wizard mode configuration | Script mode configuration.

Description: Periodically synchronizes data between a single source and destination table on a defined schedule (daily, hourly, etc.).
Use cases:
- Daily T+1 synchronization of business data to build the ODS or DWD layers of a data warehouse.
- Periodically archiving business tables from a production database to a data lake or data warehouse.
- Periodically migrating report data across data sources.

Single-table real-time synchronization

Configuration: Single-table real-time task configuration.

Description: Captures real-time data changes (inserts, updates, deletes) from a single source table and synchronizes them to a destination table using CDC or a message queue.
Use cases:
- Synchronizing real-time data changes from a business database to MaxCompute or Hologres to build a real-time data warehouse.
- Providing data for real-time monitoring dashboards or recommendation systems.
- Replicating a single table in real time between database instances.

Full-database batch synchronization

Configuration: Full-database batch synchronization task.

Description: Performs a one-time or periodic batch synchronization of all or multiple tables from a source database.
Use cases:
- Migrating an entire business database to a cloud data warehouse for the first time.
- Performing regular full or incremental backups of an entire database.
- Initializing a new data analytics environment by importing all historical data at once.

Full-database real-time synchronization

Configuration: Full-database real-time synchronization task.

Description: Captures schema and data changes in real time from all or specified tables in a source database and synchronizes them to a destination.
Use cases:
- Replicating a production OLTP database to an analytical database in real time for read/write splitting and real-time analytics.
- Building a real-time disaster recovery solution for a database.
- Maintaining real-time data consistency between a data lake or data platform and multiple upstream business systems.

Full-incremental database synchronization

Configuration: Full-incremental synchronization task for an entire database.

Description: CDC data includes Insert, Update, and Delete operations. Append-only storage systems (such as non-Delta MaxCompute tables) cannot natively apply Update/Delete, which causes data inconsistencies. This mode addresses the issue by creating a base table (full snapshot) and a log table (incremental logs) at the destination.
Use cases:
- When the destination is a non-Delta MaxCompute table and the source lacks an auto-incrementing column, batch incremental synchronization is not available. Full-incremental synchronization writes incremental data to the log table in minutes, with the merged state visible on a T+1 basis.

Feature comparison

Task type	Creation location	Workflow orchestration	Scheduling configuration	Data Studio debugging	Data source isolation
Single-table batch synchronization	Data Studio only	Supported	Supported	Supported	Supported
Single-table real-time synchronization	Data Studio / Data Integration	Not supported (Runs as a standalone node only)	Not supported	Not supported (Requires publishing to Operation Center to run)	Not supported
Full-database batch synchronization	Data Studio / Data Integration	Not supported (Runs as a standalone node only)	Supported (You can set separate schedules for sub-tasks)	Not supported (Requires publishing to Operation Center to run)	Not supported
Full-database real-time synchronization	Data Studio / Data Integration	Not supported (Runs as a standalone node only)	Not supported	Not supported (Requires publishing to Operation Center to run)	Not supported
Full-incremental synchronization for an entire database	Data Studio / Data Integration	Not supported (Runs as a standalone node only)	Supported (You can set separate schedules for sub-tasks)	Not supported (Requires publishing to Operation Center to run)	Not supported

FAQ

Data Integration FAQ.