All Products
Search
Document Center

DataWorks:Configure a real-time synchronization task in Data Integration

Last Updated:Dec 09, 2025

DataWorks Data Integration provides single-table real-time synchronization tasks designed for low-latency, high-throughput data replication and transfer between different data sources. This feature uses an advanced real-time computing engine to capture real-time data changes, such as inserts, deletes, and updates, at the source and quickly apply them to the destination. This topic uses the synchronization of a single table from Kafka to MaxCompute as an example to show you how to configure a single-table real-time synchronization task.

Preparations

Accessing the feature

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.

Configure the task

1. Create a sync task

You can create a sync task in one of the following ways:

  • Method 1: On the sync task page, select a Source and a Destination, and then click Create Sync Task. In this example, Kafka is the source and MaxCompute is the destination. You can select the source and destination as needed.

  • Method 2: On the sync task page, if the task list is empty, click Create.

image

2. Configure basic information

  1. Configure basic information, such as the task name, description, and owner.

  2. Select a synchronization type. Data Integration displays the supported Synchronization Types based on the source and destination database types. In this topic, Single-table Real-time is selected.

  3. Synchronization steps: Single-table real-time synchronization tasks support only incremental synchronization. The steps are typically Schema Migration and Incremental Synchronization. This process first initializes the source table schema to the destination. After the task starts, it automatically captures data changes from the source and writes them to the destination table.

    If the source is Hologres, full synchronization is also supported. This process first fully synchronizes existing data to the destination table. Then, incremental data synchronization starts automatically.
Note

For more information about supported data sources and synchronization solutions, see Supported data sources and synchronization solutions.

3. Configure network and resources

In this step, select the Resource Group for the sync task. Also select the Source Data Source and Destination Data Source. Then, test the network connectivity.

  • For a Serverless resource group, you can specify the maximum number of compute units (CUs) that a sync task can use. If your sync task fails due to an out-of-memory (OOM) error, increase the CU limit for the resource group.

  • If you have not created a data source, click Add Data Source to create one. For more information, see Data Source Configuration.

4. Configure the synchronization channel

1. Configure the source

At the top of the page, click the Kafka data source. Then, edit the Kafka Source Information.

image

  1. In the Kafka Source Information section, select the topic to synchronize from the Kafka data source.

    You can use the default values for other configurations or modify them as needed. For more information about the parameters, see the official Kafka documentation .

  2. In the upper-right corner, click Data Sampling.

    In the dialog box that appears, set the Start Time and Number Of Samples, and then click Start Sampling. This action samples data from the specified Kafka topic. You can preview the data in the topic. This preview provides input for the data preview and visualization configurations of subsequent data processing nodes.

  3. In the Output Field Configuration section, select the fields to synchronize as needed.

    By default, Kafka provides six fields.

    Field Name

    Description

    __key__

    The key of the Kafka record.

    __value__

    The value of the Kafka record.

    __partition__

    The partition number where the Kafka record is located. The partition number is an integer that starts from 0.

    __headers__

    The headers of the Kafka record.

    __offset__

    The offset of the Kafka record in its partition. The offset is an integer that starts from 0.

    __timestamp__

    The 13-digit UNIX timestamp in milliseconds for the Kafka record.

    You can also perform more field transformations in subsequent data processing nodes.

2. Edit the data processing node

Click the image icon to add a data processing method. Five methods are available: Data Masking, String Replace, Data Filtering, JSON Parsing, and Edit and Assign Fields. You can arrange these methods in the desired order. At runtime, the data processing methods are executed in the specified order.

image

After you configure a data processing node, you can click Preview Data Output in the upper-right corner:

  1. In the table below the input data, you can view the results from the previous Data Sampling step. Click Re-fetch Upstream Output to refresh the results.

  2. If there is no output from the upstream node, you can also click Manually Construct Data to simulate the previous output.

  3. Click Preview to view the output data from the upstream step after it is processed by the data processing component.

image

Note

The data output preview and data processing features depend on the Data Sampling from the Kafka source. Before you process data, you must complete data sampling in the Kafka source settings.

3. Configure the destination

At the top of the page, click the MaxCompute data destination. Then, edit the MaxCompute destination information.

image

  1. In the MaxCompute Destination Information section, select a Tunnel resource group. The default is "Public transport resources", which is the free quota provided by MaxCompute.

  2. Specify whether to Auto-create Table or Use Existing Table for the destination table.

    1. If you choose to auto-create a table, a table with the same name as the source table is created by default. You can manually change the destination table name.

    2. If you choose to use an existing table, select the destination table from the drop-down list.

  3. (Optional) Edit the table schema.

    When you select Auto-create Table, click Edit Table Schema. In the dialog box that appears, edit the destination table schema. You can also click Regenerate Table Schema Based On Upstream Output Columns to automatically generate the table schema based on the output columns of the upstream node. You can select a column in the auto-generated schema and set it as the primary key.

  4. Configure field mapping.

    1. The system automatically maps upstream columns to destination table columns based on the Same Name Mapping principle. You can adjust the mappings as needed. An upstream column can be mapped to multiple destination columns, but multiple upstream columns cannot be mapped to a single destination column. If an upstream column is not mapped to a destination column, its data is not written to the destination table.

    2. For Kafka fields, you can configure custom JSON parsing. Use the data processing component to retrieve the content of the value field. This allows for more fine-grained field configuration.

      image

  5. (Optional) Configure partitions.

    1. Automatic Time-based Partitioning creates partitions based on the business time (in this case, the _timestamp field). The first-level partition is by year, the second-level partition is by month, and so on.

    2. Dynamic Partitioning by Field Content maps a field from the source table to a partition field in the destination MaxCompute table. This ensures that rows containing specific data in the source field are written to the corresponding partition in the MaxCompute table.

5. Other configurations

Alert configuration

To prevent data synchronization latency caused by task errors, you can set an alert policy for the single-table real-time synchronization task.

  1. In the upper-right corner of the page, click Alert Settings to open the alert settings page for the task.

  2. Click Add Alert to configure an alert rule. You can set alert triggers to monitor metrics such as data latency, failover events, task status, Data Definition Language (DDL) changes, and task resource utilization. You can set CRITICAL or WARNING alert levels based on specified thresholds.

  3. Manage alert rules.

    For existing alert rules, you can use the alert switch to enable or disable them. You can also send alerts to different personnel based on the alert level.

Advanced parameter configuration

The sync task provides advanced parameters for fine-grained configuration. The system provides default values, which you do not need to change in most cases. To modify them:

  1. In the upper-right corner of the page, click Advanced Parameters to open the advanced parameter configuration page.

  2. Set Auto-configure Runtime Settings to false.

  3. Modify the parameter values based on the tooltips. The description for each parameter is displayed next to its name.

Important

Modify these parameters only after you fully understand their purpose and potential consequences. Incorrect settings can cause unexpected errors or data quality issues.

Resource group configuration

In the upper-right corner of the page, you can click Resource Group Configuration to view and switch the resource group used by the current task.

6. Test run

After you complete all task configurations, click Test Run in the upper-right corner to debug the task. This simulates how the entire task processes a small amount of sample data. You can then preview the results that would be written to the destination table. If there are configuration errors, exceptions during the test run, or dirty data, the system provides real-time feedback. This helps you quickly assess the correctness of your task configuration and whether it produces the expected results.

  1. In the dialog box that appears, set the sampling parameters (Start Time and Number Of Samples).

  2. Click Start Sampling to retrieve the sample data.

  3. Click Preview to simulate the task run and view the output.

The output of the test run is for preview only. It is not written to the destination data source and does not affect production data.

7. Start the task

  1. After you complete all configurations, click Complete Configuration at the bottom of the page.

  2. On the Data Integration > Sync Tasks page, find the sync task that you created. In the Actions column, click Publish. If you select the Start Running Immediately After Publishing checkbox, the task runs immediately after it is published. Otherwise, you must manually start the task.

    Note

    Data Integration tasks must be published to the production environment to run. Therefore, new or edited tasks take effect only after you perform the Publish operation.

  3. In the Task List, click the Name/ID of the task to view its detailed execution process.

What to do next

After the task starts, you can click the task name to view its running details and perform task operations and maintenance (O&M) and tuning.

FAQ

For answers to frequently asked questions about real-time synchronization tasks, see Real-time synchronization FAQ.

More examples

Real-time synchronization of a single table from Kafka to ApsaraDB for OceanBase

Real-time ingestion of a single table from LogHub (SLS) to Data Lake Formation

Real-time synchronization of a single table from Hologres to Doris

Real-time synchronization of a single table from Hologres to Hologres

Real-time synchronization of a single table from Kafka to Hologres

Real-time synchronization of a single table from LogHub (SLS) to Hologres

Real-time synchronization of a single table from Kafka to Hologres

Real-time synchronization of a single table from Hologres to Kafka

Real-time synchronization of a single table from LogHub (SLS) to MaxCompute

Real-time synchronization of a single table from Kafka to an OSS data lake

Real-time synchronization of a single table from Kafka to StarRocks

Real-time synchronization of a single table from Oracle to Tablestore