Configure a script-mode batch synchronization task 2.0 -

Prerequisites

Before you configure a batch synchronization task, you must configure the source and destination data sources. This lets you select them by name when you configure the task. For more information about supported data sources and their Reader and Writer plugins, see Supported data sources and Reader and Writer plugins.
Note For more information about data sources, see Data source overview.
You have purchased an exclusive resource group for Data Integration. For more information, see Use an exclusive resource group for Data Integration.
A network connection is established between the exclusive resource group for Data Integration and the data source. For more information, see Network connectivity solutions.

Go to the DataStudio page

Log on to the DataWorks console.
In the left-side navigation pane, click Workspace list.
Select the region of your workspace, find the workspace, and then click Data Studio in the Actions column.

Procedure

Step 1: Create a batch synchronization node
Step 2: Configure the batch synchronization task
1. Configure the synchronization network link
2. Switch to script mode and import a template
3. Edit the script to configure the data source, destination, transmission rate limit, and dirty data handling rules
4. Configure scheduling properties
Step 3: Commit and publish the task

Step 1: Create a batch synchronization node

Create a workflow. For more information, see Create a workflow.
Create a batch synchronization node.
Create a batch synchronization node in one of two ways:
- Method 1: Expand the workflow, right-click Data integration, and then choose Create Node > Batch Synchronization.
- Method 2: Double-click the workflow name. From the Data integration folder, drag the Batch Synchronization node to the workflow editor canvas on the right.
In the dialog box that appears, configure the parameters for the batch synchronization node.

Step 2: Configure the batch synchronization task

Configure the synchronization network link.
Select the source and destination data sources, and the resource group for the task. Then, test the connectivity.
Note

You can synchronize data from sharded databases and tables at the source to a single table at the destination. For more information, see Synchronize data from sharded databases and tables.

If a network connection fails between the data source and the resource group, follow the on-screen instructions or documentation to configure network connectivity. For more information, see Network connectivity solutions.
Switch to script mode.
Click the Switch to script icon in the toolbar.
If no script is configured, you can click the icon in the toolbar and follow the prompts to import a script template.

Edit the script to configure the synchronization task.

The following code shows the general configuration for a script in script mode:

Note

The type and version fields have default values and cannot be modified.
You can ignore the Processor configuration in the script. No configuration is required.

{
  "type": "job",
  "version": "2.0",
  "steps": [
    {
      "stepType": "plugin_name",
      "parameter": {...},
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "plugin_name",
      "parameter": {...},
      "name": "Writer",
      "category": "writer"
    },
    {
      "name": "Processor",
      "stepType": null,
      "category": "processor",
      "copies": 1,
      "parameter": {...}
    }
  ],
  "setting": {
    "executeMode": null,
    "errorLimit": {
      "record": ""
    },
    "speed": {
      "concurrent": 2,
      "throttle": false
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Configure the basic information and field mappings for the Reader and Writer.

Configuring these parameters lets you perform the following operations: For details, see the documentation for each plugin: Supported data sources and Reader and Writer plugins.

Reader

Actions	Description
Configure synchronization scope	Some plugins support filter parameters to perform incremental synchronization. For example, when you use the MySQL Reader plugin to synchronize MySQL data, you can use the where parameter in conjunction with DataWorks scheduling parameters to implement incremental synchronization. For more information, see Scenario: Configure an incremental batch synchronization task. Note Whether a plugin supports incremental synchronization and how it is implemented vary. For details, see the documentation for the specific plugin. If you use a plugin that supports incremental synchronization but do not specify a data filter condition, a full data synchronization is performed by default. When you configure scheduling properties, you can assign values to variables defined in the data filter and destination table configurations. This allows you to perform operations such as writing incremental or full data to the corresponding time partitions in the destination table. For more information about scheduling parameters, see Supported formats of scheduling parameters. The syntax for incremental synchronization filter conditions is nearly identical to the database syntax. During synchronization, the batch synchronization task assembles a complete SQL statement to extract data from the data source.
Configure a shard key for a relational database	Specifies the field in the source data to use for sharding. The task then splits into multiple subtasks based on this field to read data in parallel. Note We recommend that you use the primary key of the table as the `splitPk` value. A primary key usually has an even data distribution, which helps prevent data hotspots in the shards. Currently, `splitPk` only supports integer-based sharding. It does not support other types such as strings, floats, or dates. If you specify a column of an unsupported type, the `splitPk` parameter is ignored, and the task synchronizes data using a single channel. If you do not specify `splitPk`, or if its value is empty, the data synchronization is performed through a single channel. Not all plugins support specifying a shard key to configure the task sharding logic. The information provided is an example. For details, see the documentation for the specific plugin. For more information, see Supported data sources and Reader and Writer plugins.
Assign values to destination fields	You can add columns to the output that are populated with constant values or variables, such as '123' or '${variable_name}'. You can assign values to the variables you define here when you configure scheduling properties in the next step. For more information about scheduling parameters, see Supported formats of scheduling parameters.
Edit source table fields	You can use functions supported by the source database to process fields. For example, use `Max(id)` to synchronize only the record with the maximum ID value. Note MaxCompute Reader does not support functions.

Writer

Actions	Description
Configure pre- and post-synchronization SQL statements	Some data sources let you execute SQL statements on the destination before and after writing data. Example: MySQL Writer supports the configuration of preSql and postSql, which allows you to execute MySQL commands before or after data is written to MySQL. For example, in the Pre-SQL Statement (preSql) parameter, you can specify the `truncate table tablename` command to clear a table of old data before a synchronization task begins.
Configure the write mode for conflicts	Define how to handle conflicts, such as path or primary key conflicts, when writing data to the destination. The available options depend on the destination data source and the Writer plugin. For configuration details, see the documentation for the specific Writer plugin.

Channel control.

You can configure performance settings in the setting section, such as concurrency, transmission rate, and dirty data handling.

Parameter	Description
executeMode (distributed processing capability)	Determines whether the task runs in distributed mode. `distribute`: Enables distributed mode. In this mode, the task is split and distributed across multiple execution nodes to run in parallel. This allows the synchronization speed to scale horizontally with the size of the cluster, overcoming single-node performance bottlenecks. `null`: Disables distributed mode. The configured concurrency is limited to threads on a single node, and the task cannot leverage the computing power of multiple machines. Important If your exclusive resource group for Data Integration has only one machine, do not use distributed mode. If a single machine already meets your performance requirements, we recommend using the single-node mode to simplify task execution. Distributed mode can be enabled only when the concurrency is set to 8 or greater. Support for distributed mode varies by data source. For details, see the documentation for the specific plugin.
concurrent (expected maximum concurrency)	Specifies the maximum number of concurrent threads for reading from the source and writing to the destination. Note Due to factors such as resource specifications, the actual concurrency during execution may be less than or equal to the configured value. You are billed based on the actual concurrency used. For more information, see Performance metrics.
throttle (transmission rate)	Controls the transmission rate. `true`: Enables throttling. This protects the source database from excessive load by limiting the data extraction speed. The minimum rate is 1 MB/s. Note If you set `throttle` to true, you must also set the mbps parameter to specify the maximum transmission rate. `false`: Disables throttling. The task uses the maximum available bandwidth based on the hardware environment and the configured concurrency. Note The transmission rate is a metric internal to Data Integration and does not represent the actual network interface card (NIC) traffic. Typically, NIC traffic is one to two times the channel traffic, depending on the data serialization of the storage system.
errorLimit (error record control)	Sets the tolerance for dirty data. Important An excessive amount of dirty data can degrade the overall synchronization speed. If this parameter is not configured, dirty data is allowed by default, and the task continues to run even if dirty data is generated. If this parameter is set to `0`, no dirty data is allowed. The task fails if any dirty data is generated. If you allow dirty data and set a threshold: If the number of dirty data records is within the threshold, the records are ignored (not written to the destination), and the task continues to run normally. If the number of dirty data records exceeds the threshold, the task fails. Note What is dirty data? Dirty data is data that is meaningless to your business, has an invalid format, or causes an error during synchronization. Any record that fails to write to the destination data source is considered dirty data. For example, if you attempt to write VARCHAR data from a source to an INT column in the destination, a conversion error occurs, and the record is not written. This record is considered dirty data. You can configure whether to allow dirty data and set a limit on the number of dirty data records. If the number of dirty records exceeds the specified limit, the task fails.

Note In addition to these settings, the overall synchronization speed is affected by factors such as the performance of the source data source and the network environment. For more information about transmission rates and performance tuning, see Accelerate or limit the speed of a batch synchronization task.

Click Next to configure scheduling properties.
To run a batch synchronization task periodically, you must configure its scheduling properties. For more information about scheduling parameters, see Use scheduling parameters in Data Integration.
- Configure node scheduling properties: Assign scheduling parameters to the variables that you defined in the previous steps. You can assign both constants and variables.
- Configure time properties: Define how the task is periodically scheduled in the production environment. You can configure properties such as the instance generation mode, scheduling type, and scheduling cycle.
- Configure resource properties: Define the scheduling resource group used to submit the task to the Data Integration execution resource group. You can select the resource group for running the task in this section.
  Note A Data Integration batch task is submitted by a scheduling resource group to the corresponding execution resource group for Data Integration. This process incurs scheduling-related fees. For more information about the submission mechanism, see Overview of DataWorks resource groups.

Step 3: Commit and publish the task

If the task needs to run on a periodic schedule, you must deploy it to the production environment. For more information about task deployment, see Deploy tasks.

:Configure a script-mode batch synchronization task 2.0