Data Integration provides a codeless UI to guide you through task development. Using the codeless UI, you can configure a source and destination, then use DataWorks scheduling parameters to periodically synchronize full or incremental data from a single source table or from sharded databases and tables to a destination table. This topic describes how to configure a single-table batch synchronization task in the codeless UI. The specific configuration varies depending on the data source. For more information, see Supported data sources and synchronization solutions.
Prerequisites
Configure the required source and destination databases in Data Source Management in the DataWorks console. For more information, see Data source list.
NoteFor information about data sources that support single-table batch synchronization and their configurations, see Supported data sources and synchronization solutions.
For an overview of data source features, see Data Source Management.
Purchase a resource group with appropriate specifications and associate it with the workspace. For more information, see Use serverless resource groups.
Establish network connectivity between the resource group and the data sources. For more information, see Configure network connectivity.
If you need to synchronize a MaxCompute table that is not bound to the current workspace (for example, in a cross-project synchronization), you must first add the target MaxCompute project as a DataWorks data source. This enables you to select the table as a source or destination in the synchronization task. For more information about configuring data sources, see Data Source Management.
Step 1: Create a Data Integration node
Data Studio (new version)
Log on to the DataWorks console. In the left-side navigation pane, choose . Select the desired workspace from the drop-down list and click <p><a href={url} target="_blank">Learn more.</a></p>Data Studio.
Create a workflow. For more information, see Workflows.
Create a Data Integration node in one of the following ways:
Method 1: In the upper-right corner of the workflow list, click
, and then choose .Method 2: Double-click the workflow name, and then drag the Data Integration node from the Data Integration directory to the workflow editing panel on the right.
Configure the source and destination types for the node, select Single Table Batch Sync as the specific type, and click OK to complete the creation.
Legacy Data Studio
Log on to the DataWorks console. In the left-side navigation pane, choose . Select the desired workspace from the drop-down list and click Data Analytics.
Create a workflow. For more information, see Create a workflow.
Create a batch synchronization node in one of the following ways:
Method 1: Expand the workflow, right-click .
Method 2: Double-click the workflow name, and then drag the Batch Synchronization node from the Data Integration directory to the workflow editing panel on the right.
Follow the on-screen instructions to create a batch synchronization node.
Step 2: Configure data sources and runtime resources
In this example, the Source data source type is set to MySQL and the data source is set to mysql. The Destination data source type is set to MaxCompute (ODPS) and the data source is set to own_mc. The Resource Group is set to dwGroup, and CU is set to 0.5 CU.
In the Source Information and Destination sections, select the data source objects from which you want to read data and to which you want to write data.
In the Runtime Resource section, select the Resource Group for the synchronization task, and allocate Resource Group CU to the task. If your synchronization task encounters an out-of-memory (OOM) error due to insufficient resources, increase the CU value for the resource group. For recommended resource quota configurations, see Resource group performance metrics - Data Integration.
Make sure that both the source and destination data sources pass the Connectivity Check. If the network between the data source and the resource group is not connected, follow the on-screen instructions or the documentation to configure network connectivity. For more information, see Configure network connectivity.
If you have created a resource group but it is not displayed here, check whether the resource group has been associated with the workspace. For more information, see Use serverless resource groups.
Step 3: Configure the synchronization solution
In the source and destination sections, configure the tables for reading and writing data, and specify the data scope for synchronization.
The configurations vary depending on the plug-in. The following content uses common configurations as examples. For information about whether a specific plug-in supports a configuration and how the configuration works, see the documentation for that plug-in. For more information, see Data source list.
1. Source
In the source section, configure the data table and fill in the required parameters as prompted.
Operation | Description |
Configure data filtering |
The incremental synchronization configuration method varies depending on the data source (plug-in). If no data filtering condition is configured, all data in the table is synchronized by default. |
Configure a split key for relational databases | Specify the column in the source based on which the data to be synchronized is split. When the synchronization task runs, the data is split into multiple tasks based on this column for concurrent, batch data reads. We recommend that you use the primary key of the table as the splitPk because the primary key is usually evenly distributed, which prevents data hotspots in the resulting shards. Currently, splitPk supports only integer-type data splitting. String, floating-point, date, and other types are not supported. If you specify a non-supported type, DataWorks ignores the splitPk feature and uses a single channel for synchronization. If you do not specify a splitPk value, or if the splitPk value is empty, data synchronization uses a single channel to synchronize the table data. Not all plug-ins support specifying a split key to configure task splitting logic. The preceding information is provided as an example only. Refer to the documentation for the specific plug-in. For more information, see Data source list. |
2. Data processing
Data processing is a feature of the new version of Data Studio. In legacy Data Studio, you must select Use New UI (With Data Processing Feature) when you create a task. We recommend that you upgrade legacy workspaces to the new version to use the full range of features: Upgrade to the new version.
Data processing allows you to process data from the source table by using methods such as string replacement, AI-assisted processing, and data vectorization, and then write the processed data to the destination table.
Take string replacement as an example. The configuration items include Name and Description. In the replacement rules, select the Field Name, enter the Content to Replace (which supports Regular Expression Matching and Case-Sensitive Matching), and specify the Replacement Content. You can click Add Rule to add multiple replacement rules, and use Data Output Preview in the upper-right corner to view the processing results.
Click the toggle button to enable data processing.
In the Data Processing List, click Add Node and select a data processing type: Replace String, AI-Assisted Processing, or Data Vectorization. You can add multiple data processing nodes. DataWorks processes them in order.
Configure the data processing rules as prompted. For information about AI-assisted processing and data vectorization, see Data processing.
NoteData processing requires additional compute resources, which increases the resource consumption overhead. It also increases the duration of the data synchronization task. Minimize the complexity of the processing to avoid affecting synchronization efficiency.
3. Destination
In the destination section, configure the data table and fill in the required parameters as prompted.
Operation | Description |
Configure pre-sync and post-sync statements | Some data sources support executing SQL statements on the destination before synchronization (before data is written to the destination) and after synchronization (after data is written to the destination). Example: MySQL Writer supports preSql and postSql configuration, which allows you to run MySQL commands before or after data is written to MySQL. For example, you can configure the MySQL table truncation command |
Define the write conflict resolution mode | Define how data is written to the destination when conflicts occur, such as path or primary key conflicts. This configuration depends on the characteristics of the data source and the capabilities of the writer plug-in. You need to refer to the documentation for the specific writer plug-in for configuration details. |
MaxCompute partition table configuration notes
When the destination is a MaxCompute partition table, note the following:
Partition column identification: DataWorks automatically identifies the partition structure of the MaxCompute destination table. If only some partition columns are displayed in the UI, check whether all partition columns are correctly defined in both the development environment and the production environment. If the task fails and prompts you to configure table partition information, complete the partition parameters in the destination configuration.
Column mapping refresh: If new columns are added to the source or destination but are not displayed in the column mapping section, try the following methods to refresh the cache:
Make sure that the table schemas in both the development environment and the production environment are updated.
On the configuration page, switch to a different table and then switch back to the original table to refresh the cache.
If the cache is still not refreshed, restart the browser or use incognito mode to re-open the configuration page.
4. Configure column mappings
After you select the source and destination, you must specify the column mapping between the reader and writer. After you configure the column mappings, the task writes source columns to the corresponding destination columns based on the mappings.
If a source column is not mapped to a destination column, the data in that source column is not synchronized to the destination.
If the automatic mapping does not match your expectations, manually adjust the mappings.
If you do not need a column mapping, you can manually delete the mapping line between the source and destination columns. The data in that source column will not be synchronized to the destination.
During synchronization, column type mismatches between the source and destination may generate dirty data, which prevents data from being written to the destination. For the number of dirty data records that can be tolerated during synchronization, configure the Advanced configuration in the next step.
Same-name mapping, same-row mapping, intelligent mapping, and rule-based mapping are supported. During configuration, you can also:
Intelligent mapping: To improve configuration efficiency and reduce manual matching errors, Data Integration supports the intelligent column mapping feature. The system uses AI semantic analysis to automatically identify column names, data types, and comments of the source and destination tables, and recommends optimal mappings with one click. You only need to confirm the recommendations or make minor adjustments to complete the configuration.
In the column mapping section, click Intelligent Mapping to open the intelligent mapping dialog. You can describe your mapping requirements in natural language.
Applicable scenario
Typical example
Recommended prompt
Global semantic matching
Column names are completely different but have the same meaning
(Example:user_id↔device_id)Perform semantic matching on all columns of the source and destination tables, and automatically identify columns with the same meaning.Specific business domain matching
Only specific business columns need to be mapped
(Example: only "user" or "order" related columns)Map only the columns in the source table that contain "user information" (such as name, phone number, and ID) to the corresponding columns in the destination table.
(Note: You can replace the keyword with "order", "logistics", "payment", etc.)Prefix/suffix convention differences
Core names are the same but prefixes/suffixes differ
(Example:src_user_name↔tgt_user_name)Ignore prefix and suffix differences in column names, and perform semantic matching based on core names only.Abbreviation and full name matching
One side uses abbreviations while the other uses full names
(Example:amt↔amount)Identify common English abbreviation-to-full-name mappings (such as amt=amount, addr=address) and create mappings accordingly.Exclude specific columns
Some columns are similar but do not need to be synchronized
(Example:create_timeis not needed)Perform semantic matching, but exclude all columns that contain "time" or "log" in their names.Complex logic correction
The automatic matching result is incorrect and manual guidance is needed
Do not map the id column of the source table to the order_id column of the destination table. Regenerate the mapping suggestions.After you enter the description, click Generate Preview. The system displays the suggested mappings in the Matching Result Preview section. You can review and select the mappings you need, and then click Apply to add the selected mappings to the column mapping. If you are not satisfied with the results, adjust the description and regenerate the preview.
Rule-based mapping: When column names between the source and destination follow a pattern of differences, you can use the Rule-Based Mapping feature. By configuring rules such as prefix/suffix matching or character replacement, you can create column mappings in bulk. In the column mapping section, click Rule-Based Mapping, select the mapping method and rules, preview the mapping results after configuration, and then click Apply.
Assign values to destination columns: You can add constants, scheduling parameters, and built-in variables to the destination table by clicking Add Fields in the Source Table Field column. For example, '123', '${scheduling parameter}', '#{built-in variable}#'.
NoteFor more information about scheduling parameters, see Configure scheduling parameters.
Add built-in variables: You can manually add built-in variables, map them to destination columns, and output the built-in variables to downstream tasks.
The built-in variables available for each plug-in are as follows:
Built-in variable
Description
Supported plug-in
'
#{DATASOURCE_NAME_SRC}#'Source data source name
MySQL Reader
MySQL (sharded) Reader
PolarDB Reader
PolarDB (sharded) Reader
PostgreSQL Reader
PolarDB-O Reader
PolarDB-O (sharded) Reader
'
#{DB_NAME_SRC}#'Name of the database where the source table resides
MySQL Reader
MySQL (sharded) Reader
PolarDB Reader
PolarDB (sharded) Reader
PostgreSQL Reader
PolarDB-O Reader
PolarDB-O (sharded) Reader
'
#{SCHEMA_NAME_SRC}#'Name of the schema where the source table resides
PolarDB Reader
PolarDB (sharded) Reader
PostgreSQL Reader
PolarDB-O Reader
PolarDB-O (sharded) Reader
'
#{TABLE_NAME_SRC}#'Source table name
MySQL Reader
MySQL (sharded) Reader
PolarDB Reader
PolarDB (sharded) Reader
PostgreSQL Reader
PolarDB-O Reader
PolarDB-O (sharded) Reader
'
#{FILE_NAME_SRC}#'File name
OSS Reader
HDFS Reader
FTP Reader
TOS Reader
COS Reader
S3 Reader
Azure Blob Reader
'
#{FILE_PATH_SRC}#'Absolute file path
OSS Reader
HDFS Reader
FTP Reader
TOS Reader
COS Reader
S3 Reader
Azure Blob Reader
Edit source columns: You can click Manually Edit Mapping to perform the following operations:
Use functions supported by the source database to process columns. For example, use Max(id) to synchronize only the maximum value.
Manually edit source columns if the column mapping does not retrieve all columns.
NoteMaxCompute Reader does not support the use of functions.
Step 4: Advanced configuration
The advanced configuration is equivalent to the Channel feature in the earlier version of data synchronization.
You can use the advanced configuration to control properties related to the data synchronization process. For more information about the parameters, see Channel Control.
Parameter | Description |
Expected Maximum Concurrency | Specifies the maximum number of threads that the current task uses to read data from the source or write data to the destination concurrently. Note
|
Sync Rate | Used to control the synchronization speed.
Note The traffic metric is the metric measured by Data Integration itself and does not represent the actual NIC traffic. Typically, NIC traffic is 1 to 2 times the channel traffic. The actual traffic expansion depends on the transport serialization of the specific data storage system. |
Policy for Dirty Data Records | Dirty data refers to data records that fail to be written to the destination due to exceptions such as type conflicts or constraint violations. Single-table batch synchronization supports defining a dirty data policy. You can define the number of dirty data records to tolerate and how they affect the task.
Important Excessive dirty data affects the overall synchronization speed of the task. |
Distributed Execution | Used to control whether to enable distributed mode for the current task.
If you have high requirements for synchronization performance, use distributed mode. Distributed mode can also utilize fragmented resources on machines, which improves resource utilization. Important
|
Time Zone | If the source and destination require cross-time-zone synchronization, you can configure the source time zone for time zone conversion. |
In addition to the preceding configurations, the overall synchronization speed is also affected by the performance of the source data source, the network environment, and other factors. For more information about synchronization speed and tuning, see Synchronization speed and tuning.
Step 5: Configure schedule settings
For a single-table batch synchronization task with periodic scheduling, you must configure the properties for automatic scheduling. Go to the editing page of the node, click Scheduling Settings on the right side, and configure the schedule settings for the node.
You must configure scheduling parameters, scheduling policies, scheduling time, and dependencies for the synchronization task. The configuration method is the same as that for other data development nodes and is not described here.
For schedule settings in the new version of Data Studio, see Node scheduling (new version).
For schedule settings in legacy Data Studio, see Node scheduling (legacy).
For more information about using scheduling parameters, see Typical scenarios of scheduling parameters in Data Integration.
Step 6: Test and deploy the task
Configure run parameters.
On the right side of the single-table batch synchronization task configuration page, click Run Configuration and configure the following parameters for test runs.
Configuration item
Description
Resource Group
Select a resource group that has network connectivity with the data sources.
Script Parameters
Assign values to the placeholder parameters in the data synchronization task. For example, if the Data Integration task uses the
${bizdate}parameter, configure a date parameter in theyyyymmddformat.Run the task.
Click the
Run button on the toolbar to run and debug the task in Data Studio. You can then create a node of the corresponding destination table type to query the destination table data and verify whether the synchronized data meets expectations.Deploy the task.
After the task passes the test run, if the task needs to be run periodically, click the
button at the top of the node editing page to deploy the task to the production environment. For more information about task deployment, see Deploy tasks.
Limitations
Single-table batch synchronization tasks can only be configured in Data Studio.
Some data sources do not support wizard mode for configuring single-table batch synchronization tasks.
After you select a data source, if the system indicates that the current data source does not support wizard mode, click the
icon on the toolbar to switch to script mode and continue configuring the task. For more information, see Script mode configuration.Wizard mode has a low learning curve, but it does not support some advanced features. If you need more granular configuration management, click the convert-to-script icon on the toolbar to switch to script mode and configure the single-table batch synchronization task.
A single-table batch synchronization task in wizard mode supports configuring synchronization for a single table and partial sharded database and table synchronization (sharded database and table synchronization is supported only for certain data source types and requires consistent table schemas). It does not support full-database synchronization (bulk synchronization of table schemas and data). For full-database synchronization, see Full-database batch synchronization tasks.
A batch synchronization task cannot be directly converted to a real-time synchronization task. If you need real-time data synchronization, create a single-table real-time synchronization task node.
If a message indicates that the node name is too long when you deploy the task, modify the node name in the advanced configuration on the deployment page. Make sure that the name does not exceed 128 characters.
Next step
After the task is deployed to the production environment, you can go to the Operation Center in the production environment to view the scheduled task. For more information about running and managing Data Integration tasks, monitoring task status, and managing resource groups, see Data Integration task O&M.