Configure batch synchronization using codeless UI 2.0 - - Alibaba Cloud Documentation Center

Prerequisites

Configure the source and destination data sources before you create a batch synchronization task. This lets you select the data sources by name during task configuration.
Note For more information about data sources, see Overview of data sources.
Purchase an exclusive resource group for Data Integration that meets your business requirements. For more information, see Use an exclusive resource group for Data Integration.
Establish network connectivity between the exclusive resource group for Data Integration and the data sources. For more information, see Network connectivity solutions.

Go to the DataStudio page

Log on to the DataWorks console.
In the left-side navigation pane, click Workspace list.
Select the region of your workspace, find the workspace, and then click Data Studio in the Actions column.

Procedure

Step 1: Create a batch synchronization node
Step 2: Configure the batch synchronization task
1. Configure the network connection
2. Select objects to synchronize
3. Configure field mappings
4. Configure channel control
5. Configure scheduling properties
Step 3: Commit and deploy the task

Step 1: Create a batch synchronization node

Create a workflow. For more information, see Create a workflow.
Create a batch synchronization node.
Create a batch synchronization node in one of two ways:
- Method 1: Expand the workflow, right-click Data integration, and then choose Create Node > Batch Synchronization.
- Method 2: Double-click the workflow name. From the Data integration folder, drag the Batch Synchronization node to the workflow editor canvas on the right.
In the dialog box that appears, configure the parameters for the batch synchronization node.

Step 2: Configure the batch synchronization task

Configure the network connection.
Select the source and destination for the batch synchronization task and the resource group to run the task, and then test the connectivity.
- Data Integration also supports synchronizing data from sharded source databases and tables to a single destination table. For more information, see Sharding synchronization.
- If a network connection fails between the data source and the resource group, follow the on-screen instructions or see Network connectivity solutions to configure network connectivity.
Important Configuration options vary by plugin. The following content uses common settings as an example. To check if a plugin supports a specific setting and how to configure it, see the plugin's documentation. For more information, see Supported data sources and Reader/Writer plugins.

Click Next step to configure the synchronization task.

Select objects to synchronize.

In the data source selection area, configure the source and destination tables, and specify the synchronization scope. In the Source area on the left, select a data source type (for example, MySQL), a database, and a table. In the Destination area on the right, select a destination data source (for example, MaxCompute) and a destination table. Then, configure the Clean-up Rule (for example, Clean Existing Data Before Writing (Insert Overwrite)) and Empty String as Null options. You can click Generate Target Table to quickly create a destination table.

Reader

Actions	Description
Configure the synchronization scope	If you specify a filter condition in the Data Filter text box, the synchronization task synchronizes only the data that meets the filter condition. You can also use scheduling parameters in the filter condition to allow the filter condition to dynamically change with the task scheduling time. This implements incremental data synchronization. The method to configure incremental data synchronization varies by plugin. For more information, see Scenario: Configure a batch synchronization task for incremental data. Note When you click Next to configure scheduling properties, you can assign values to the variables defined in the data filter and destination table settings. This lets you write incremental or full data to specific time-based partitions in the destination table. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters. The syntax for the incremental synchronization filter condition is similar to standard database syntax. During synchronization, the batch synchronization task concatenates a complete SQL statement to extract data from the source. If you do not specify a data filter condition, the task synchronizes all data in the table by default.
Configure a shard key for a relational database	Specifies the source field for splitting data. When the synchronization task runs, it splits the data into multiple sub-tasks based on this key, which enables parallel and batched data reading. Note Use the primary key of the table as the splitPk because primary keys are usually evenly distributed, which helps prevent data hotspots in the resulting shards. Currently, splitPk supports splitting data only of the integer type. It does not support strings, floats, dates, or other data types. If you specify an unsupported data type, the splitPk function is ignored, and the task uses a single channel for synchronization. If you do not specify splitPk or leave it empty, the task synchronizes data in a single channel. Not all plugins support shard keys to configure task splitting logic. For more information, see the documentation for your specific plugin. For more information, see Supported data sources and Reader/Writer plugins.

Writer

Actions	Description
Configure pre- and post-synchronization SQL statements	Some data sources allow you to execute SQL statements on the destination before data is written (pre-synchronization) or after data is written (post-synchronization). Example: MySQL Writer supports preSql and postSql. You can run MySQL commands before or after data is written to MySQL. For example, in the Prepare statement before import (preSql) configuration of MySQL Writer, you can enter the command `truncate table tablename` to clear old data from a table before new data is written.
Define the write mode for conflicts	Specifies how to handle write conflicts, such as path or primary key conflicts. The available settings depend on the data source and Writer plugin features. For configuration details, see the documentation for your specific Writer plugin.

Configure field mappings.
After you configure field mappings, the task writes data from the source fields to the corresponding destination fields based on the mappings.
During synchronization, data type mismatches between source and destination fields can cause dirty data, which can cause write failures. You can configure the tolerance for dirty data in the Channel step.

Note If a source field is not mapped to a destination field, its data is not synchronized.
You can map fields by name or by row. You can also perform the following actions:
- Assign values to destination fields: Click Add a row to add constants or variables to the destination table, such as '123' or '${variable_name}'.
  Note When you click Next to configure scheduling properties, you can assign values to the variables defined here. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.
- Edit source fields: Click the icon next to a data type to perform the following actions:
  - Use functions supported by the source database to process fields. For example, use Max(id) to synchronize only the record with the maximum ID.
  - Manually edit source fields if some fields were not automatically fetched during the mapping process.
  Note MaxCompute Reader does not support functions.

Configure channel control.

You can configure channel settings to control various properties of the data synchronization process.

Parameter	Description
Expected Maximum Concurrency	The maximum number of parallel threads for reading from the source or writing to the destination. Note Due to resource constraints and other factors, the actual number of concurrent sessions may not reach the specified value. Charges for a debugging resource group are based on the actual number of concurrent sessions. For more information, see Performance metrics.
Bandwidth Throttling	The data synchronization rate. Rate Limit: You can limit the synchronization rate to protect the source database from excessive load. The minimum rate is 1 MB/s. No Rate Limit: The task runs at the maximum possible speed allowed by the hardware environment and concurrency settings. Note This traffic metric is measured by Data Integration and may not reflect the actual network interface card (NIC) traffic. Typically, NIC traffic is 1 to 2 times the channel traffic. The actual difference depends on the data storage system's serialization method.
Error record count (dirty data control)	The threshold for dirty data and its impact on the task. Important A large amount of dirty data can slow down the overall synchronization speed. If not configured, the task allows dirty data by default, and task execution is not affected. If set to 0, no dirty data is allowed. The task fails if any dirty data is generated. If you allow dirty data and set a threshold: If the number of dirty data records is within the threshold, the task ignores the dirty data (it is not written to the destination) and continues to run. If the number of dirty data records exceeds the threshold, the task fails. Note What is dirty data? Dirty data is data that is meaningless to your business, incorrectly formatted, or causes an issue during synchronization. Data Integration considers a record as dirty data if an error occurs when writing it to the destination data source. Any record that fails to be written is classified as dirty data. For example, if you attempt to write a VARCHAR value from the source to an INT column in the destination, a conversion error may occur, and the data fails to be written. This record is considered dirty data. When you configure a synchronization task, you can control whether dirty data is allowed and set a limit on the number of dirty data records. If the limit is exceeded, the task fails.
Distributed Execution	Specifies whether to run the task in distributed mode. Enabled: Distributed mode splits the task into slices and executes them concurrently across multiple nodes. This enables the synchronization speed to scale horizontally with the size of the execution cluster, overcoming single-machine bottlenecks. Disabled: The configured concurrency applies only to threads on a single machine and does not leverage multi-machine computing. You can use distributed mode if you have high-performance synchronization requirements. This mode also improves resource utilization by using fragmented resources on the machines. Important If your exclusive resource group has only one machine, do not use distributed mode because it cannot leverage multi-machine capabilities. If a single machine already meets your speed requirements, use the single-machine mode for simplicity. You can enable distributed processing capability only if the number of concurrent sessions is 8 or greater. Support for distributed mode varies by data source. For details, refer to the documentation for your specific plugin: Supported data sources and Reader/Writer plugins.

Note The preceding settings, along with the source data source performance and network environment, affect the overall synchronization speed. For more information about speed tuning, see Accelerate or throttle batch synchronization tasks.

Click Next step to configure scheduling properties.
For a periodically scheduled batch synchronization task, you must configure its automatic scheduling properties. For information about how to use scheduling parameters, see Use scheduling parameters in Data Integration.
- Configure node scheduling properties: Assign scheduling parameters as values to the variables used in the preceding configurations. You can assign constants or variables.
- Configure time properties: Set the periodic schedule for the task in the production environment. In the time properties section of the scheduling configuration, you can configure properties such as the instance generation method, scheduling type, and scheduling cycle.
- Configure resource properties: Select the scheduling resource group to dispatch the task to the execution resource group for Data Integration. In the resource properties section of the scheduling configuration, you can select the resource group to use for running the scheduled task.
  Note A scheduling resource group dispatches Data Integration batch tasks to an execution resource group, which incurs scheduling fees. For more information about the task dispatch mechanism, see Resource groups in DataWorks.
Click Complete configuration.

Step 3: Commit and deploy the task

If the task needs to run on a periodic schedule, you must deploy it to the production environment. For more information about task deployment, see Deploy tasks.

:Configure batch synchronization using codeless UI 2.0