Configure a single-table batch synchronization task between data sources in wizard mode, batch synchronization task - DataWorks

Prerequisites

Configure the required source and destination databases in Data Source Management. For more information, see Data source list.
Note
- For information about supported data sources and their configurations, see Supported data sources and synchronization solutions.
- For an overview of data source features, see Data Source Management.
Purchase a resource group with appropriate specifications and associate it with the workspace. For more information, see Use serverless resource groups.
Establish network connectivity between the resource group and the data sources. For more information, see Configure network connectivity.
If you need to synchronize a MaxCompute table that is not bound to the current workspace (for example, in a cross-project synchronization), you must first add the target MaxCompute project as a DataWorks data source. This enables you to select the table as a source or destination in the synchronization task. For more information about configuring data sources, see Data Source Management.

Step 1: Create a Data Integration node

Data Studio (new version)

Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and O&M > DataStudio. Select the desired workspace from the drop-down list and click <p><a href={url} target="_blank">Learn more.</a></p>Data Studio.
Create a workflow. For more information, see Workflows.
Create a Data Integration node in one of the following ways:
- Method 1: In the upper-right corner of the workflow list, click , and then choose Create Node > Data Integration.
- Method 2: Double-click the workflow name, and then drag the Data Integration node from the Data Integration directory to the workflow editing panel on the right.
Configure the source and destination types for the node, select Single Table Batch Sync as the specific type, and click OK to complete the creation.

Legacy Data Studio

Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and O&M > DataStudio. Select the desired workspace from the drop-down list and click Data Analytics.
Create a workflow. For more information, see Create a workflow.
Create a batch synchronization node in one of the following ways:
- Method 1: Expand the workflow, right-click Data Integration > Create Node > Batch Synchronization.
- Method 2: Double-click the workflow name, and then drag the Batch Synchronization node from the Data Integration directory to the workflow editing panel on the right.
Follow the on-screen instructions to create a batch synchronization node.

Step 2: Configure data sources and runtime resources

In this example, the Source data source type is set to MySQL and the data source is set to mysql. The Destination data source type is set to MaxCompute (ODPS) and the data source is set to own_mc. The Resource Group is set to dwGroup, and CU is set to 0.5 CU.

In the Source Information and Destination sections, select the data source objects from which you want to read data and to which you want to write data.
In the Runtime Resource section, select the Resource Group for the synchronization task, and allocate Resource Group CU to the task. If your synchronization task encounters an out-of-memory (OOM) error due to insufficient resources, increase the CU value for the resource group. For recommended resource quota configurations, see Resource group performance metrics - Data Integration.
Make sure that both the source and destination data sources pass the Connectivity Check. If the network between the data source and the resource group is not connected, follow the on-screen instructions or the documentation to configure network connectivity. For more information, see Configure network connectivity.

Note

If a resource group is not displayed, check whether it has been associated with the workspace. For more information, see Use serverless resource groups.

Step 3: Configure the synchronization solution

In the source and destination sections, configure the tables for reading and writing data, and specify the data scope for synchronization.

Important

The configurations vary depending on the plug-in. The following content uses common configurations as examples. For information about whether a specific plug-in supports a configuration and how the configuration works, see the documentation for that plug-in. For more information, see Data source list.

1. Source

In the source section, configure the data table and fill in the required parameters.

Operation

Description

Configure data filtering

Some source types support data filtering. You can specify a WHERE clause condition (without the WHERE keyword) to filter source data. Only data that meets the condition is synchronized. For more information, see Configure data filtering.
Data filtering supports only WHERE clause conditional expressions. It does not support SELECT, JOIN, or other SQL statements. If you need complex SQL queries during synchronization (such as UDF functions, multi-table JOINs, or in-memory SQL transformations), use a step-by-step approach: run the complex logic through a MaxCompute SQL node or a PyODPS node, write the results to a temporary table, and then configure a Data Integration node to read from the temporary table and synchronize to the final destination.
To implement incremental synchronization, combine this filter condition with scheduling parameters so the condition changes dynamically. For example, if you use gmt_create >= '${bizdate}', the task synchronizes only the data added on the current day each time it runs. You must also assign a value to the variable when you configure the schedule settings. For more information, see Configure scheduling parameters.

The incremental synchronization configuration method varies depending on the data source (plug-in).

If no data filtering condition is configured, all data in the table is synchronized by default.

Configure a split key for relational databases

Specify the column used to split the data for concurrent reads during synchronization.

We recommend that you use the primary key as the split key because primary keys are usually evenly distributed, which prevents data hotspots.

Currently, the split key supports only integer types. String, floating-point, date, and other types are not supported. If you specify a non-supported type, DataWorks ignores the split key and uses a single channel for synchronization.

If you do not specify a split key, data synchronization uses a single channel.

Not all plug-ins support a split key. The preceding information is for reference only. For details, see the plug-in documentation. For more information, see Data source list.

2. Data processing

Important

Data processing is a feature of the new version of Data Studio. In legacy Data Studio, you must select Use New UI (With Data Processing Feature) when you create a task. We recommend that you upgrade legacy workspaces to the new version to use the full range of features: Upgrade to the new version.

Data processing lets you transform data from the source table by using methods such as string replacement, AI-assisted processing, and data vectorization before writing it to the destination table.

Take string replacement as an example. The configuration items include Name and Description. In the replacement rules, select the Field Name, enter the Content to Replace (which supports Regular Expression Matching and Case-Sensitive Matching), and specify the Replacement Content. You can click Add Rule to add multiple replacement rules, and use Data Output Preview in the upper-right corner to view the processing results.

Click the toggle button to enable data processing.
In the Data Processing List, click Add Node and select a data processing type: Replace String, AI-Assisted Processing, or Data Vectorization. You can add multiple data processing nodes. DataWorks processes them in order.
Configure the data processing rules as prompted. For information about AI-assisted processing and data vectorization, see Data processing.

Note
Data processing requires additional compute resources, which increases resource consumption and task duration. Minimize the complexity of the processing to avoid affecting synchronization efficiency.

3. Destination

In the destination section, configure the data table and fill in the required parameters.

Operation

Description

Configure pre-sync and post-sync statements

Some data sources support executing SQL statements on the destination before synchronization (before writing data) and after synchronization (after writing data).

Example: MySQL Writer supports preSql and postSql configuration, which allows you to run MySQL commands before or after data is written to MySQL. For example, you can configure the MySQL table truncation command truncate table tablename in the Statement Run Before Writing (preSql) field to clear old data from the table before synchronization.

Define the write conflict resolution mode

Define how data is written to the destination when conflicts occur, such as path or primary key conflicts. Refer to the specific writer plug-in documentation for configuration details.

MaxCompute partition table configuration notes

When the destination is a MaxCompute partition table, note the following:

Partition column identification: DataWorks automatically identifies the partition structure of the MaxCompute destination table. If only some partition columns are displayed in the UI, check whether all partition columns are correctly defined in both the development environment and the production environment. If the task fails and prompts you to configure table partition information, complete the partition parameters in the destination configuration.
Column mapping refresh: If new columns are added to the source or destination but are not displayed in the column mapping section, try the following methods to refresh the cache:
1. Make sure that the table schemas in both the development environment and the production environment are updated.
2. On the configuration page, switch to a different table and then switch back to the original table to refresh the cache.
3. If the cache is still not refreshed, restart the browser or use incognito mode to re-open the configuration page.

4. Configure column mappings

After you select the source and destination, specify the column mapping between the reader and writer. The task writes source columns to the corresponding destination columns based on the mappings.

If a source column is not mapped to a destination column, its data is not synchronized.
If the automatic mapping does not match your expectations, manually adjust the mappings.
To remove a column mapping, delete the mapping line between the source and destination columns. The data in that source column will not be synchronized.

Column type mismatches between the source and destination may generate dirty data, preventing data from being written to the destination. To set the number of tolerable dirty data records, configure the Advanced configuration in the next step.

Same-name mapping, same-row mapping, intelligent mapping, and rule-based mapping are supported. During configuration, you can also:

Intelligent mapping: Data Integration uses AI semantic analysis to automatically identify column names, data types, and comments of the source and destination tables and recommend optimal mappings. You only need to confirm the recommendations or make minor adjustments.

In the column mapping section, click Intelligent Mapping to open the intelligent mapping dialog. You can describe your mapping requirements in natural language.

Applicable scenario	Typical example	Recommended prompt
Global semantic matching	Column names are completely different but have the same meaning (Example: `user_id` ↔ `device_id`)	`Perform semantic matching on all columns of the source and destination tables, and automatically identify columns with the same meaning.`
Specific business domain matching	Only specific business columns need to be mapped (Example: only "user" or "order" related columns)	`Map only the columns in the source table that contain "user information" (such as name, phone number, and ID) to the corresponding columns in the destination table.` (Note: You can replace the keyword with "order", "logistics", "payment", etc.)
Prefix/suffix convention differences	Core names are the same but prefixes/suffixes differ (Example: `src_user_name` ↔ `tgt_user_name`)	`Ignore prefix and suffix differences in column names, and perform semantic matching based on core names only.`
Abbreviation and full name matching	One side uses abbreviations while the other uses full names (Example: `amt` ↔ `amount`)	`Identify common English abbreviation-to-full-name mappings (such as amt=amount, addr=address) and create mappings accordingly.`
Exclude specific columns	Some columns are similar but do not need to be synchronized (Example: `create_time` is not needed)	`Perform semantic matching, but exclude all columns that contain "time" or "log" in their names.`
Complex logic correction	The automatic matching result is incorrect and manual guidance is needed	`Do not map the id column of the source table to the order_id column of the destination table. Regenerate the mapping suggestions.`

After you enter the description, click Generate Preview. The system displays the suggested mappings in the Matching Result Preview section. You can review and select the mappings you need, and then click Apply to add the selected mappings to the column mapping. If you are not satisfied with the results, adjust the description and regenerate the preview.

Rule-based mapping: When column names between the source and destination follow a pattern, you can use the Rule-Based Mapping feature to create column mappings in bulk. Configure rules such as prefix/suffix matching or character replacement, preview the mapping results, and then click Apply.
Assign values to destination columns: You can add constants, scheduling parameters, and built-in variables to the destination table by clicking Add Fields in the Source Table Field column. For example, '123', '${scheduling parameter}', '#{built-in variable}#'.

Note
For more information about scheduling parameters, see Configure scheduling parameters.

Add built-in variables: You can manually add built-in variables, map them to destination columns, and output the built-in variables to downstream tasks.

The built-in variables available for each plug-in are as follows:

Built-in variable	Description	Supported plug-in
'`#{DATASOURCE_NAME_SRC}#`'	Source data source name	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{DB_NAME_SRC}#`'	Name of the database where the source table resides	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{SCHEMA_NAME_SRC}#`'	Name of the schema where the source table resides	PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{TABLE_NAME_SRC}#`'	Source table name	MySQL Reader MySQL (sharded) Reader PolarDB Reader PolarDB (sharded) Reader PostgreSQL Reader PolarDB-O Reader PolarDB-O (sharded) Reader
'`#{FILE_NAME_SRC}#`'	File name	OSS Reader HDFS Reader FTP Reader TOS Reader COS Reader S3 Reader Azure Blob Reader
'`#{FILE_PATH_SRC}#`'	Absolute file path	OSS Reader HDFS Reader FTP Reader TOS Reader COS Reader S3 Reader Azure Blob Reader

Edit source columns: You can click Manually Edit Mapping to perform the following operations:
- Use functions supported by the source database to process columns. For example, use Max(id) to synchronize only the maximum value.
- Manually edit source columns if the column mapping does not retrieve all columns.
Note
MaxCompute Reader does not support the use of functions.

Step 4: Advanced configuration

Important

The advanced configuration is equivalent to the Channel feature in the earlier version of data synchronization.

Use the advanced configuration to control properties of the data synchronization process. For more information, see Channel Control.

Parameter	Description
Expected Maximum Concurrency	Specifies the maximum number of concurrent threads for reading from the source or writing to the destination. Note Due to resource specifications and other factors, the actual concurrency during execution may be less than or equal to the configured value. The resource group is charged based on the actual concurrency. For more information, see Billing details. Task scheduling fees are related to the number of single-table batch synchronization tasks, not the configured concurrency.
Sync Rate	Controls the synchronization speed. Throttling: Limits the synchronization speed to protect the source database from excessive read load. The minimum throttling value is 1 MB/s. No throttling: The task provides the maximum transfer performance available, subject to the configured concurrency. Note The traffic metric is measured by Data Integration and does not represent actual NIC traffic. NIC traffic is typically 1 to 2 times the channel traffic, depending on the transport serialization of the data storage system.
Policy for Dirty Data Records	Dirty data refers to data records that fail to be written to the destination due to exceptions such as type conflicts or constraint violations. Single-table batch synchronization supports defining a dirty data policy. You can define the number of dirty data records to tolerate and how they affect the task. If no configuration is specified, dirty data is allowed by default and does not affect task execution. If the value is set to 0, no dirty data is allowed. If dirty data is generated during synchronization, the task fails. If dirty data is allowed and a threshold is set: If the dirty data generated is within the threshold, the synchronization task ignores the dirty data (it is not written to the destination) and continues to run normally. If the dirty data generated exceeds the threshold, the synchronization task fails. Important Excessive dirty data affects the overall synchronization speed of the task.
Distributed Execution	Controls whether to enable distributed mode for the task. Enabled: Splits the task into multiple processes for concurrent execution, breaking the single-process bottleneck and improving synchronization efficiency. Disabled: The task runs in a single process. Distributed mode is recommended for high-performance requirements. It can also utilize fragmented resources on machines, improving resource utilization. Important The concurrency must be 8 or greater to enable distributed processing. Enabling distributed processing consumes more resources. If an out-of-memory (OOM) error occurs during runtime, try disabling this feature.
Time Zone	For cross-time-zone synchronization, configure the source time zone for time zone conversion.

Note

In addition to the preceding configurations, the overall synchronization speed is also affected by the performance of the source data source, the network environment, and other factors. For more information about synchronization speed and tuning, see Synchronization speed and tuning.

Step 5: Configure schedule settings

For a single-table batch synchronization task with periodic scheduling, you must configure the properties for automatic scheduling. Go to the editing page of the node, click Scheduling Settings on the right side, and configure the schedule settings for the node.

Configure scheduling parameters, scheduling policies, scheduling time, and dependencies for the synchronization task. The configuration method is the same as for other data development nodes.

For schedule settings in the new version of Data Studio, see Node scheduling (new version).
For schedule settings in legacy Data Studio, see Node scheduling (legacy).

For more information about using scheduling parameters, see Typical scenarios of scheduling parameters in Data Integration.

Step 6: Test and deploy the task

Configure run parameters.

On the right side of the single-table batch synchronization task configuration page, click Run Configuration and configure the following parameters for test runs.

Configuration item	Description
Resource Group	Select a resource group that has network connectivity with the data sources.
Script Parameters	Assign values to the placeholder parameters in the data synchronization task. For example, if the Data Integration task uses the `${bizdate}` parameter, configure a date parameter in the `yyyymmdd` format.

Run the task.

Click the Run button on the toolbar to run and debug the task in Data Studio. You can then create a node of the corresponding destination table type to query the destination table data and verify whether the synchronized data meets expectations.
Deploy the task.

After the task passes the test run, if the task needs to be run periodically, click the button at the top of the node editing page to deploy the task to the production environment. For more information about task deployment, see Deploy tasks.

Limitations

Single-table batch synchronization tasks can only be configured in Data Studio.
Some data sources do not support wizard mode for configuring single-table batch synchronization tasks.

After you select a data source, if the system indicates that the current data source does not support wizard mode, click the icon on the toolbar to switch to script mode and continue configuring the task. For more information, see Script mode configuration.
Wizard mode has a low learning curve but does not support some advanced features. For more granular configuration management, click the convert-to-script icon on the toolbar to switch to script mode.
A single-table batch synchronization task in wizard mode supports configuring synchronization for a single table and partial sharded database and table synchronization (sharded database and table synchronization is supported only for certain data source types and requires consistent table schemas). It does not support full-database synchronization (bulk synchronization of table schemas and data). For full-database synchronization, see Full-database batch synchronization tasks.
A batch synchronization task cannot be directly converted to a real-time synchronization task. If you need real-time data synchronization, create a single-table real-time synchronization task node.
If a message indicates that the node name is too long when you deploy the task, modify the node name in the advanced configuration on the deployment page. Make sure that the name does not exceed 128 characters.

DataWorks:Configure a single-table batch synchronization task in wizard mode

Prerequisites

Step 1: Create a Data Integration node

Data Studio (new version)

Legacy Data Studio

Step 2: Configure data sources and runtime resources

Step 3: Configure the synchronization solution

1. Source

2. Data processing

3. Destination

4. Configure column mappings

Step 4: Advanced configuration

Step 5: Configure schedule settings

Step 6: Test and deploy the task

Limitations

Next step

References