Configure a batch synchronization task between data sources in the codeless UI, batch synchronization task - DataWorks

Data Integration provides a codeless UI that lets you periodically synchronize full or incremental data from a source table, including sharded tables, to a destination table without writing any code. You can configure a synchronization task by selecting the source and destination in the UI and configuring scheduling parameters in DataWorks. This topic describes the general configurations for a batch synchronization task in the codeless UI. The configurations may vary for different data sources. For more information, see Supported data sources and synchronization solutions.

Preparations

Configure the data sources. Before you configure a Data Integration sync task, ensure that you have configured the source and destination databases in Data Source Management in DataWorks. For more information about data source configuration, see Data source list.
Note
- For more information about the data sources supported by batch synchronization and their configurations, see Supported data sources and synchronization solutions.
- For more information about data source features, see Data Source Management.
Purchase a resource group with a suitable specification and attach it to the workspace. For more information, see Use a Serverless resource group for Data Integration and Use an exclusive resource group for Data Integration.
Establish a network connection between the resource group and the data source. For more information, see Configure network connections.

Step 1: Create a batch synchronization node

Data Studio (new version)

Log on to the DataWorks console. Switch to the destination region. In the navigation pane on the left, choose Data Development & O&M > Data Development. Select the desired workspace from the drop-down list and click Go to Data Studio.
Create a workflow. For more information, see Orchestrate workflows.
Create a batch synchronization node. You can use one of the following methods:
- Method 1: Click the icon in the upper-right corner of the workflow list and choose Create Node > Data Integration > Batch Synchronization.
- Method 2: Double-click the workflow name and drag the Batch Synchronization node from the Data Integration directory to the workflow editor on the right.
Configure the basic information, source, and destination for the node. Then, click OK.

DataStudio (legacy version)

Log in to the DataWorks console. Switch to the destination region. In the navigation pane on the left, click Data Development & O&M > Data Development. Select the desired workspace from the drop-down list and click Go to Data Development.
Create a workflow. For more information, see Create a workflow.
Create a batch synchronization node. You can use one of the following methods:
- Method 1: Expand the workflow, right-click Data Integration, and select Create Node > Batch Synchronization.
- Method 2: Double-click the workflow name and drag the Batch Synchronization node from the Data Integration directory to the workflow editor on the right.
Create a batch synchronization node as prompted.

Step 2: Configure the data source and resource group

Select the source and destination data sources for the batch synchronization task.
Select the resource group and resource quota for running the task. For recommended resource quota configurations, see Data Integration performance metrics.
Test the network connectivity between the data source and the resource group. If the connection fails, configure the network connection as prompted or as described in the documentation. For more information, see Configure network connections.

Note

If you created a resource group but it is not displayed, check whether the resource group is attached to the workspace. For more information, see Use a Serverless resource group for Data Integration and Use an exclusive resource group for Data Integration.
Serverless resource groups allow you to specify an upper limit for the computing units (CUs) of a sync task. If your sync task fails due to an out-of-memory (OOM) error because of insufficient resources, you can adjust the CU usage for the resource group.

Step 3: Configure the source and destination

In the source and destination sections, configure the tables from which to read data and to which to write data. You can also specify the data range to synchronize.

Important

Plugin configurations can vary. The following section provides examples of common configurations. To check whether a plugin supports a specific configuration and how to implement it, see the documentation for that plugin. For more information, see Data source list.

Source

Operation	Description
Data Filtering	Some source types support data filtering. You can specify a condition (a `WHERE` clause without the `where` keyword) to filter source data. At runtime, the task synchronizes only the data that meets the condition. For more information, see Scenario: Configure a batch synchronization task for incremental data. To perform incremental synchronization, you can combine this filter condition with scheduling parameters to make it dynamic. For example, with `gmt_create >= '${bizdate}'`, the task synchronizes only the new data from the current day each time it runs. You also need to assign a value to the variable defined here when you configure scheduling properties. For more information, see Supported formats of scheduling parameters. The method for configuring incremental synchronization varies by data source (plugin). If you do not configure a filter condition, the task synchronizes all data from the table by default.
Sharding Key	Define the field in the source data that will be used as the shard key. The synchronization task splits the data into multiple tasks based on this key for concurrent, batched data reading. We recommend using the table's primary key for `splitPk` because primary keys are usually distributed evenly. This helps prevent data hot spots in the created shards. Currently, `splitPk` only supports integer data for sharding. It does not support strings, floating-point numbers, dates, or other types. If you specify an unsupported type, the `splitPk` feature is ignored, and the task uses a single channel for synchronization. If you do not specify `splitPk`, or if the value is empty, the data synchronization task uses a single channel to sync the table data. Not all plugins support specifying a shard key to configure task sharding logic. The preceding information is for example only. See the documentation for your specific plugin. For more information, see Supported data sources and synchronization solutions.

Data processing
Important
Data processing is a feature available in the Data Studio (new version) . If you are using a previous version, you must upgrade your workspace to use this feature. For information about how to upgrade, see Data Studio Upgrade Guide.
Data processing lets you process data from the source table using methods such as string replacement, AI-assisted processing, and data vectorization before you write the processed data to the destination table.
1. Click the switch to turn on data processing.
2. In the Data Processing List, click Add Node and select a data processing type: Replace String, AI process, or Data Embedding. You can add multiple data processing nodes, which DataWorks will process sequentially.
3. Configure the data processing rules as prompted. For AI-assisted processing and data vectorization, see Intelligent data processing.
  Note
  Data processing requires additional computing resources, which increases the resource overhead and runtime of the data synchronization task. To avoid affecting synchronization efficiency, keep the processing logic as simple as possible.

Destination

Operation

Description

Configure statements to execute before and after synchronization

Some data sources support executing SQL statements on the destination before data is written (pre-sync) and after the data is written (post-sync).

MySQL Writer supports `preSql` and `postSql` configuration items, which allow you to execute MySQL commands before or after data is written to MySQL. For example, you can configure the MySQL command truncate table tablename in the Pre-import Preparation Statement (preSql) configuration item to clear existing data from the table before synchronization.

Define the write mode for conflicts

Define how to write data to the destination when conflicts, such as path or primary key conflicts, occur. This configuration varies based on the data source attributes and writer plugin support. For configuration details, see the documentation for the specific writer plugin.

Step 4: Configure field mappings

After you select the source and destination, you must specify the mapping between the source and destination columns. The task writes data from the source fields to the corresponding destination fields based on these mappings.

During synchronization, mismatched field types between the source and destination can generate dirty data and cause write failures. To set the tolerance for dirty data, refer to the Channel Control settings in the next step.

Note

If a source field is not mapped to a destination field, its data is not synchronized.
If the automatic mapping is not what you expect, you can adjust the mappings manually.
If you do not need a mapping for a specific field, you can manually delete the line that connects the source and destination fields. The data from that source field is not synchronized.

Mapping by name and mapping by row are supported. You can also perform the following operations:

Assign values to destination fields: You can use Add Field to add constants, scheduling parameters, or built-in variables to the destination table, such as '123', '${scheduling_parameter}', or '#{built_in_variable}#'.
Note
When you configure scheduling in the next step, you can assign values to scheduling parameters. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Add built-in variables: You can manually add built-in variables and map them to destination fields to output them to a downstream node.

The available built-in variables for each plugin are as follows:

Built-in variable	Description	Supported plugins
'`#{DATASOURCE_NAME_SRC}#`'	Source data source name	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{DB_NAME_SRC}#`'	Name of the database where the source table is located	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{SCHEMA_NAME_SRC}#`'	Name of the schema where the source table is located	PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{TABLE_NAME_SRC}#`'	Source table name	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader

Edit Source Fields: Click Manually Edit Mapping to perform the following operations:
- Use functions that are supported by the source database to process fields. For example, you can use `Max(id)` to synchronize only the maximum value.
- Manually edit source fields if not all fields were pulled during the field mapping process.
Note
MaxCompute Reader does not support the use of functions.

Step 5: Configure the channel

Important

In the Data Studio (new version), the Configure Channel feature is in the Advanced Settings section on the right side of the task configuration interface.

You can use channel control to configure properties related to the data synchronization process. For more information about the parameters, see Relationship between concurrency and throttling for batch synchronization.

Parameter	Description
Expected Maximum Concurrency	Defines the maximum number of threads for concurrently reading from the source or writing to the destination for the current task. Note Due to factors such as resource specifications, the actual concurrency at runtime may be less than or equal to the value configured here. The fee for the test resource group is based on the actual concurrency. For more information, see Performance metrics. Task scheduling fees are related to the number of batch synchronization tasks, not the concurrency configured for the tasks.
Synchronization Rate	Controls the synchronization rate. Throttling: You can control the synchronization rate with throttling to protect the source database and avoid excessive pressure from high extraction speeds. The minimum speed limit is 1 MB/s. No throttling: Without throttling, the task will deliver the maximum possible transfer performance within the configured concurrency limits and available hardware environment. Note The traffic measure is a metric of Data Integration itself and does not represent actual network interface card (NIC) traffic. Typically, NIC traffic is 1 to 2 times the channel traffic. The actual traffic inflation depends on the data storage system's transfer serialization.
Policy for Dirty Data Records	Dirty data refers to records that fail to be written to the destination due to exceptions such as type conflicts or constraint violations. Batch synchronization supports defining a dirty data policy, which lets you set a tolerance for dirty data and its impact on the task. If not configured, dirty data is allowed by default, meaning it will not affect task execution. If set to 0, no dirty data is allowed. If any dirty data is generated during synchronization, the task will fail. If dirty data is allowed and a threshold is set: If the amount of dirty data is within the threshold, the sync task will ignore the dirty data (it will not be written to the destination) and run normally. If the amount of dirty data exceeds the threshold, the sync task will fail. Important An excessive amount of dirty data can affect the overall speed of the synchronization task.
Enable Distributed Processing Capability.	Controls whether to use distributed mode to execute the current task. Enabled: Distributed execution mode can split your task into multiple processes that run concurrently, breaking through single-process bottlenecks and improving synchronization efficiency. Disabled: The task runs as a single process. If you have high requirements for synchronization performance, you can use distributed mode. Distributed mode can also use fragmented machine resources, which is friendly to resource utilization. Important Distributed processing capability can only be enabled when the concurrency is 8 or greater. Enabling the distributed processing switch consumes more resources. If an OOM error occurs at runtime, try disabling this switch.
Time Zone	If the source and destination require cross-time zone synchronization, you can set the source time zone to perform time zone conversion.

Note

In addition to the preceding configurations, the overall synchronization speed is also affected by factors such as source data source performance and the synchronization network environment. For more information about synchronization speed and optimization, see Speed up or limit the speed of batch synchronization tasks.

Step 6: Configure scheduling properties

For a periodically scheduled batch synchronization task, you need to configure its scheduling properties. On the node's edit page, click Scheduling on the right to configure them.

You must configure scheduling parameters, a scheduling policy, a scheduling time, and scheduling dependencies for the sync task. The configuration process is the same as for other Data Studio nodes and is not described in this topic.

For information about scheduling configuration in the Data Studio (new version), see Node scheduling (new version).
For information about scheduling configuration in the DataStudio (legacy version), see Node scheduling configuration (legacy version).

For more information about how to use scheduling parameters, see Common scenarios of scheduling parameters in Data Integration.

Step 7: Test and publish the task

Configure test parameters.

On the batch synchronization task configuration page, you can click Debugging Configurations on the right and configure the following parameters to run a test.

Configuration item	Description
Resource Group	Select a resource group that is connected to the data source.
Script Parameters	Assign values to placeholder parameters in the data synchronization task. For example, if the task is configured with the `${bizdate}` parameter, you need to configure a date parameter in the `yyyymmdd` format.

Run the task.
Click the Run icon in the toolbar to run and test the task in Data Studio. After the task is run, you can create a node of the destination table type to query the destination table data and check whether the synchronized data meets your expectations.
Publish the task.
After the task runs successfully, if it needs to be scheduled periodically, click the icon in the toolbar of the node configuration page to publish the task to the production environment. For more information about how to publish tasks, see Publish tasks.

Limits

Some data sources do not support the configuration of batch synchronization tasks in the codeless UI.
After you select a data source, if a message is displayed indicating that the codeless UI is not supported, click the icon in the toolbar to switch to the code editor and continue to configure the task. For more information, see Configure a task in the code editor.
The codeless UI is easy to use but does not support some advanced features. If you require more fine-grained configuration management, you can click the convert to script icon in the toolbar to switch to the code editor to configure the batch synchronization task.

What to do next

After the task is published to the production environment, you can go to Operation Center in the production environment to view the scheduled task. For more information about how to run and manage batch synchronization tasks, monitor their status, and perform O&M on resource groups, see O&M for batch synchronization tasks.

DataWorks:Configure a batch synchronization task in the codeless UI

Preparations

Step 1: Create a batch synchronization node

Data Studio (new version)

DataStudio (legacy version)

Step 2: Configure the data source and resource group

Step 3: Configure the source and destination

Step 4: Configure field mappings

Step 5: Configure the channel

Step 6: Configure scheduling properties

Step 7: Test and publish the task

Limits

What to do next

References