Configure a single-table batch synchronization task between data sources in wizard mode, batch synchronization task - DataWorks

Data Integration provides a codeless UI to guide you through task development. Using the codeless UI, you can configure a source and destination, then use DataWorks scheduling parameters to periodically synchronize full or incremental data from a single source table or from sharded databases and tables to a destination table. This topic describes how to configure a single-table batch synchronization task in the codeless UI. The specific configuration varies depending on the data source. For more information, see Supported data sources and synchronization solutions.

Prerequisites

Configure the required source and destination databases in Data Source Management in the DataWorks console. For more information, see Data source list.
Note
- For information about data sources that support single-table batch synchronization and their configurations, see Supported data sources and synchronization solutions.
- For an overview of data source features, see Data Source Management.
Purchase a resource group with appropriate specifications and associate it with the workspace. For more information, see Use serverless resource groups.
Establish network connectivity between the resource group and the data sources. For more information, see Configure network connectivity.
If you need to synchronize a MaxCompute table that is not bound to the current workspace (for example, in a cross-project synchronization), you must first add the target MaxCompute project as a DataWorks data source. This enables you to select the table as a source or destination in the synchronization task. For more information about configuring data sources, see Data Source Management.

Step 1: Create a Data Integration node

Data Studio (new version)

Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and O&M > DataStudio. Select the desired workspace from the drop-down list and click <p><a href={url} target="_blank">Learn more.</a></p>Data Studio.
Create a workflow. For more information, see Workflows.
Create a Data Integration node in one of the following ways:
- Method 1: In the upper-right corner of the workflow list, click , and then choose Create Node > Data Integration.
- Method 2: Double-click the workflow name, and then drag the Data Integration node from the Data Integration directory to the workflow editing panel on the right.
Configure the source and destination types for the node, select Single Table Batch Sync as the specific type, and click OK to complete the creation.

Legacy Data Studio

Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and O&M > DataStudio. Select the desired workspace from the drop-down list and click Data Analytics.
Create a workflow. For more information, see Create a workflow.
Create a batch synchronization node in one of the following ways:
- Method 1: Expand the workflow, right-click Data Integration > Create Node > Batch Synchronization.
- Method 2: Double-click the workflow name, and then drag the Batch Synchronization node from the Data Integration directory to the workflow editing panel on the right.
Follow the on-screen instructions to create a batch synchronization node.

Step 2: Configure data sources and runtime resources

In this example, the Source data source type is set to MySQL and the data source is set to mysql. The Destination data source type is set to MaxCompute (ODPS) and the data source is set to own_mc. The Resource Group is set to dwGroup, and CU is set to 0.5 CU.

In the Source Information and Destination sections, select the data source objects from which you want to read data and to which you want to write data.
In the Runtime Resource section, select the Resource Group for the synchronization task, and allocate Resource Group CU to the task. If your synchronization task encounters an out-of-memory (OOM) error due to insufficient resources, increase the CU value for the resource group. For recommended resource quota configurations, see Resource group performance metrics - Data Integration.
Make sure that both the source and destination data sources pass the Connectivity Check. If the network between the data source and the resource group is not connected, follow the on-screen instructions or the documentation to configure network connectivity. For more information, see Configure network connectivity.

Note

If you have created a resource group but it is not displayed here, check whether the resource group has been associated with the workspace. For more information, see Use serverless resource groups.

Step 3: Configure the synchronization solution

In the source and destination sections, configure the tables for reading and writing data, and specify the data scope for synchronization.

Important

The configurations vary depending on the plug-in. The following content uses common configurations as examples. For information about whether a specific plug-in supports a configuration and how the configuration works, see the documentation for that plug-in. For more information, see Data source list.

1. Source

In the source section, configure the data table and fill in the required parameters as prompted.

Operation

Description

Configure data filtering

Some source types support data filtering. You can specify a condition (a WHERE clause, without the WHERE keyword) to filter source data. When the task runs, only data that meets the condition is synchronized. For more information, see Configure data filtering.
Data filtering supports only WHERE clause conditional expressions. It does not support full SELECT, JOIN, or other SQL statements. If you need to run complex SQL queries during synchronization (such as calling UDF functions, performing multi-table JOINs, or applying in-memory SQL transformations), wizard mode does not directly support these operations. We recommend that you use a step-by-step approach: first run complex logic and data cleansing through a MaxCompute SQL node or a PyODPS node, write the results to a temporary table, and then configure a Data Integration node to read data from the temporary table and synchronize it to the final destination.
To implement incremental synchronization, you can use this filter condition together with scheduling parameters so that the condition changes dynamically. For example, if you use gmt_create >= '${bizdate}', the task synchronizes only the data added on the current day each time it runs. You must also assign a value to the variable defined here when you configure the schedule settings. For more information, see Configure scheduling parameters.

The incremental synchronization configuration method varies depending on the data source (plug-in).

If no data filtering condition is configured, all data in the table is synchronized by default.

Configure a split key for relational databases

Specify the column in the source based on which the data to be synchronized is split. When the synchronization task runs, the data is split into multiple tasks based on this column for concurrent, batch data reads.

We recommend that you use the primary key of the table as the splitPk because the primary key is usually evenly distributed, which prevents data hotspots in the resulting shards.

Currently, splitPk supports only integer-type data splitting. String, floating-point, date, and other types are not supported. If you specify a non-supported type, DataWorks ignores the splitPk feature and uses a single channel for synchronization.

If you do not specify a splitPk value, or if the splitPk value is empty, data synchronization uses a single channel to synchronize the table data.

Not all plug-ins support specifying a split key to configure task splitting logic. The preceding information is provided as an example only. Refer to the documentation for the specific plug-in. For more information, see Data source list.

2. Data processing

Important

Data processing is a feature of the new version of Data Studio. In legacy Data Studio, you must select Use New UI (With Data Processing Feature) when you create a task. We recommend that you upgrade legacy workspaces to the new version to use the full range of features: Upgrade to the new version.

Data processing allows you to process data from the source table by using methods such as string replacement, AI-assisted processing, and data vectorization, and then write the processed data to the destination table.

Take string replacement as an example. The configuration items include Name and Description. In the replacement rules, select the Field Name, enter the Content to Replace (which supports Regular Expression Matching and Case-Sensitive Matching), and specify the Replacement Content. You can click Add Rule to add multiple replacement rules, and use Data Output Preview in the upper-right corner to view the processing results.

Click the toggle button to enable data processing.
In the Data Processing List, click Add Node and select a data processing type: Replace String, AI-Assisted Processing, or Data Vectorization. You can add multiple data processing nodes. DataWorks processes them in order.
Configure the data processing rules as prompted. For information about AI-assisted processing and data vectorization, see Data processing.
Note
Data processing requires additional compute resources, which increases the resource consumption overhead. It also increases the duration of the data synchronization task. Minimize the complexity of the processing to avoid affecting synchronization efficiency.

3. Destination

In the destination section, configure the data table and fill in the required parameters as prompted.

Operation

Description

Configure pre-sync and post-sync statements

Some data sources support executing SQL statements on the destination before synchronization (before data is written to the destination) and after synchronization (after data is written to the destination).

Example: MySQL Writer supports preSql and postSql configuration, which allows you to run MySQL commands before or after data is written to MySQL. For example, you can configure the MySQL table truncation command truncate table tablename in the Statement Run Before Writing (preSql) field to clear old data from the table before synchronization.

Define the write conflict resolution mode

Define how data is written to the destination when conflicts occur, such as path or primary key conflicts. This configuration depends on the characteristics of the data source and the capabilities of the writer plug-in. You need to refer to the documentation for the specific writer plug-in for configuration details.

MaxCompute partition table configuration notes

When the destination is a MaxCompute partition table, note the following:

Partition column identification: DataWorks automatically identifies the partition structure of the MaxCompute destination table. If only some partition columns are displayed in the UI, check whether all partition columns are correctly defined in both the development environment and the production environment. If the task fails and prompts you to configure table partition information, complete the partition parameters in the destination configuration.
Column mapping refresh: If new columns are added to the source or destination but are not displayed in the column mapping section, try the following methods to refresh the cache:
1. Make sure that the table schemas in both the development environment and the production environment are updated.
2. On the configuration page, switch to a different table and then switch back to the original table to refresh the cache.
3. If the cache is still not refreshed, restart the browser or use incognito mode to re-open the configuration page.

4. Configure column mappings

After you select the source and destination, you must specify the column mapping between the reader and writer. After you configure the column mappings, the task writes source columns to the corresponding destination columns based on the mappings.

If a source column is not mapped to a destination column, the data in that source column is not synchronized to the destination.
If the automatic mapping does not match your expectations, manually adjust the mappings.
If you do not need a column mapping, you can manually delete the mapping line between the source and destination columns. The data in that source column will not be synchronized to the destination.

During synchronization, column type mismatches between the source and destination may generate dirty data, which prevents data from being written to the destination. For the number of dirty data records that can be tolerated during synchronization, configure the Advanced configuration in the next step.

Same-name mapping, same-row mapping, intelligent mapping, and rule-based mapping are supported. During configuration, you can also:

Intelligent mapping: To improve configuration efficiency and reduce manual matching errors, Data Integration supports the intelligent column mapping feature. The system uses AI semantic analysis to automatically identify column names, data types, and comments of the source and destination tables, and recommends optimal mappings with one click. You only need to confirm the recommendations or make minor adjustments to complete the configuration.

In the column mapping section, click Intelligent Mapping to open the intelligent mapping dialog. You can describe your mapping requirements in natural language.

Applicable scenario	Typical example	Recommended prompt
Global semantic matching	Column names are completely different but have the same meaning (Example: `user_id` ↔ `device_id`)	`Perform semantic matching on all columns of the source and destination tables, and automatically identify columns with the same meaning.`
Specific business domain matching	Only specific business columns need to be mapped (Example: only "user" or "order" related columns)	`Map only the columns in the source table that contain "user information" (such as name, phone number, and ID) to the corresponding columns in the destination table.` (Note: You can replace the keyword with "order", "logistics", "payment", etc.)
Prefix/suffix convention differences	Core names are the same but prefixes/suffixes differ (Example: `src_user_name` ↔ `tgt_user_name`)	`Ignore prefix and suffix differences in column names, and perform semantic matching based on core names only.`
Abbreviation and full name matching	One side uses abbreviations while the other uses full names (Example: `amt` ↔ `amount`)	`Identify common English abbreviation-to-full-name mappings (such as amt=amount, addr=address) and create mappings accordingly.`
Exclude specific columns	Some columns are similar but do not need to be synchronized (Example: `create_time` is not needed)	`Perform semantic matching, but exclude all columns that contain "time" or "log" in their names.`
Complex logic correction	The automatic matching result is incorrect and manual guidance is needed	`Do not map the id column of the source table to the order_id column of the destination table. Regenerate the mapping suggestions.`

After you enter the description, click Generate Preview. The system displays the suggested mappings in the Matching Result Preview section. You can review and select the mappings you need, and then click Apply to add the selected mappings to the column mapping. If you are not satisfied with the results, adjust the description and regenerate the preview.

Rule-based mapping: When column names between the source and destination follow a pattern of differences, you can use the Rule-Based Mapping feature. By configuring rules such as prefix/suffix matching or character replacement, you can create column mappings in bulk. In the column mapping section, click Rule-Based Mapping, select the mapping method and rules, preview the mapping results after configuration, and then click Apply.
Assign values to destination columns: You can add constants, scheduling parameters, and built-in variables to the destination table by clicking Add Fields in the Source Table Field column. For example, '123', '${scheduling parameter}', '#{built-in variable}#'.
Note
For more information about scheduling parameters, see Configure scheduling parameters.

Add built-in variables: You can manually add built-in variables, map them to destination columns, and output the built-in variables to downstream tasks.

The built-in variables available for each plug-in are as follows:

Built-in variable	Description	Supported plug-in
'`#{DATASOURCE_NAME_SRC}#`'	Source data source name	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{DB_NAME_SRC}#`'	Name of the database where the source table resides	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{SCHEMA_NAME_SRC}#`'	Name of the schema where the source table resides	PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{TABLE_NAME_SRC}#`'	Source table name	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{FILE_NAME_SRC}#`'	File name	OSS Reader HDFS Reader FTP Reader TOS Reader COS Reader S3 Reader Azure Blob Reader
'`#{FILE_PATH_SRC}#`'	Absolute file path	OSS Reader HDFS Reader FTP Reader TOS Reader COS Reader S3 Reader Azure Blob Reader

Edit source columns: You can click Manually Edit Mapping to perform the following operations:
- Use functions supported by the source database to process columns. For example, use Max(id) to synchronize only the maximum value.
- Manually edit source columns if the column mapping does not retrieve all columns.
Note
MaxCompute Reader does not support the use of functions.

Step 4: Advanced configuration

Important

The advanced configuration is equivalent to the Channel feature in the earlier version of data synchronization.

You can use the advanced configuration to control properties related to the data synchronization process. For more information about the parameters, see Channel Control.

Parameter	Description
Expected Maximum Concurrency	Specifies the maximum number of threads that the current task uses to read data from the source or write data to the destination concurrently. Note Due to resource specifications and other factors, the actual concurrency during execution may be less than or equal to the configured concurrency. The resource group is charged based on the actual concurrency during execution. For more information, see Billing details. Task scheduling fees are related to the number of single-table batch synchronization tasks, not the configured concurrency.
Sync Rate	Used to control the synchronization speed. Throttling: You can use throttling to control the synchronization speed to protect the source database and prevent excessive read speed from overloading the source. The minimum throttling value is 1 MB/s. No throttling: When throttling is not enabled, the task provides the maximum transfer performance available in the current hardware environment, subject to the configured concurrency. Note The traffic metric is the metric measured by Data Integration itself and does not represent the actual NIC traffic. Typically, NIC traffic is 1 to 2 times the channel traffic. The actual traffic expansion depends on the transport serialization of the specific data storage system.
Policy for Dirty Data Records	Dirty data refers to data records that fail to be written to the destination due to exceptions such as type conflicts or constraint violations. Single-table batch synchronization supports defining a dirty data policy. You can define the number of dirty data records to tolerate and how they affect the task. If no configuration is specified, dirty data is allowed by default, which means that dirty data does not affect task execution. If the value is set to 0, no dirty data is allowed. If dirty data is generated during synchronization, the task fails. If dirty data is allowed and a threshold is set: If the dirty data generated is within the threshold, the synchronization task ignores the dirty data (it is not written to the destination) and continues to run normally. If the dirty data generated exceeds the threshold, the synchronization task fails. Important Excessive dirty data affects the overall synchronization speed of the task.
Distributed Execution	Used to control whether to enable distributed mode for the current task. Enabled: Distributed execution mode splits your task into multiple processes for concurrent execution, breaking through the single-process execution bottleneck and improving synchronization efficiency. Disabled: The task runs in a single process. If you have high requirements for synchronization performance, use distributed mode. Distributed mode can also utilize fragmented resources on machines, which improves resource utilization. Important The concurrency must be 8 or greater to enable distributed processing. Enabling distributed processing consumes more resources. If an out-of-memory (OOM) error occurs during runtime, try disabling this feature.
Time Zone	If the source and destination require cross-time-zone synchronization, you can configure the source time zone for time zone conversion.

Note

In addition to the preceding configurations, the overall synchronization speed is also affected by the performance of the source data source, the network environment, and other factors. For more information about synchronization speed and tuning, see Synchronization speed and tuning.

Step 5: Configure schedule settings

For a single-table batch synchronization task with periodic scheduling, you must configure the properties for automatic scheduling. Go to the editing page of the node, click Scheduling Settings on the right side, and configure the schedule settings for the node.

You must configure scheduling parameters, scheduling policies, scheduling time, and dependencies for the synchronization task. The configuration method is the same as that for other data development nodes and is not described here.

For schedule settings in the new version of Data Studio, see Node scheduling (new version).
For schedule settings in legacy Data Studio, see Node scheduling (legacy).

For more information about using scheduling parameters, see Typical scenarios of scheduling parameters in Data Integration.

Step 6: Test and deploy the task

Configure run parameters.

On the right side of the single-table batch synchronization task configuration page, click Run Configuration and configure the following parameters for test runs.

Configuration item	Description
Resource Group	Select a resource group that has network connectivity with the data sources.
Script Parameters	Assign values to the placeholder parameters in the data synchronization task. For example, if the Data Integration task uses the `${bizdate}` parameter, configure a date parameter in the `yyyymmdd` format.

Run the task.
Click the Run button on the toolbar to run and debug the task in Data Studio. You can then create a node of the corresponding destination table type to query the destination table data and verify whether the synchronized data meets expectations.
Deploy the task.
After the task passes the test run, if the task needs to be run periodically, click the button at the top of the node editing page to deploy the task to the production environment. For more information about task deployment, see Deploy tasks.

Limitations

Single-table batch synchronization tasks can only be configured in Data Studio.
Some data sources do not support wizard mode for configuring single-table batch synchronization tasks.
After you select a data source, if the system indicates that the current data source does not support wizard mode, click the icon on the toolbar to switch to script mode and continue configuring the task. For more information, see Script mode configuration.
Wizard mode has a low learning curve, but it does not support some advanced features. If you need more granular configuration management, click the convert-to-script icon on the toolbar to switch to script mode and configure the single-table batch synchronization task.
A single-table batch synchronization task in wizard mode supports configuring synchronization for a single table and partial sharded database and table synchronization (sharded database and table synchronization is supported only for certain data source types and requires consistent table schemas). It does not support full-database synchronization (bulk synchronization of table schemas and data). For full-database synchronization, see Full-database batch synchronization tasks.
A batch synchronization task cannot be directly converted to a real-time synchronization task. If you need real-time data synchronization, create a single-table real-time synchronization task node.
If a message indicates that the node name is too long when you deploy the task, modify the node name in the advanced configuration on the deployment page. Make sure that the name does not exceed 128 characters.

Next step

After the task is deployed to the production environment, you can go to the Operation Center in the production environment to view the scheduled task. For more information about running and managing Data Integration tasks, monitoring task status, and managing resource groups, see Data Integration task O&M.

DataWorks:Configure a single-table batch synchronization task in wizard mode

Prerequisites

Step 1: Create a Data Integration node

Data Studio (new version)

Legacy Data Studio

Step 2: Configure data sources and runtime resources

Step 3: Configure the synchronization solution

1. Source

2. Data processing

3. Destination

4. Configure column mappings

Step 4: Advanced configuration

Step 5: Configure schedule settings

Step 6: Test and deploy the task

Limitations

Next step

References