All Products
Search
Document Center

DataWorks:Configure a batch synchronization task in script mode

Last Updated:Jun 30, 2026

For more fine-grained configuration of batch synchronization tasks, you can use the Code Editor. In the Code Editor, you can write a JSON script for data synchronization and use DataWorks scheduling parameters to periodically synchronize full or incremental data from a single source table or sharded tables to a target data table. This topic describes the common configurations for such tasks. The configurations vary for different data sources. For details, refer to the list of data sources.

Usage notes

Use the Code Editor to configure a synchronization task in the following scenarios:

  • The data source does not support configuration in wizard mode.

    Note

    The user interface indicates whether a data source supports wizard mode.

    For example, if you select HBase11xsql as the destination data source, a yellow warning message appears on the network and resource configuration page: "The current data source type does not support task editing in wizard mode. The task will be configured in script mode." In this case, click the Code Editor button in the toolbar to switch modes and configure the task.

  • Some data source configuration parameters are available only in the Code Editor.

  • You can use the Code Editor to configure some data sources that cannot be created directly in DataWorks.

Prerequisites

Step 1: Create a batch synchronization node

Data Studio (new version)

  1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and O&M > DataStudio. Select the desired workspace from the drop-down list and click <p><a href={url} target="_blank">Learn more.</a></p>Data Studio.

  2. Create a workflow. For more information, see Workflows.

  3. Create a Data Integration node in one of the following ways:

    • Method 1: In the upper-right corner of the workflow list, click image, and then choose Create Node > Data Integration.

    • Method 2: Double-click the workflow name, and then drag the Data Integration node from the Data Integration directory to the workflow editing panel on the right.

  4. Configure the source and destination types for the node, select Single Table Batch Sync as the specific type, and click OK to complete the creation.

Legacy Data Studio

  1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and O&M > DataStudio. Select the desired workspace from the drop-down list and click Data Analytics.

  2. Create a workflow. For more information, see Create a workflow.

  3. Create a batch synchronization node in one of the following ways:

    • Method 1: Expand the workflow, right-click Data Integration > Create Node > Batch Synchronization.

    • Method 2: Double-click the workflow name, and then drag the Batch Synchronization node from the Data Integration directory to the workflow editing panel on the right.

  4. Follow the on-screen instructions to create a batch synchronization node.

Step 2: Configure the data source and resource group

You can switch from wizard mode to script mode at any step. To ensure a complete script configuration, we recommend the following approach:

  1. First, use the wizard to select the data source and resource group, and test network connectivity.

  2. Then switch to script mode.

The system automatically populates this information into the generated JSON script.

Alternatively, you can switch directly and then manually configure the settings in script mode: specify the data source in the JSON code and set the resource group and the required resource size for the task in the Advanced Settings panel on the right.

Note

Step 3: Switch to script mode and import a template

Click the Convert to Script image icon in the toolbar.

If the script has not been configured, you can click the Import Template icon in the toolbar to quickly import a script template by following the on-screen instructions.

Step 4: Edit the script to configure the synchronization task

The common configurations in script mode are as follows:

Note
  • The type and version fields are default values and cannot be modified.

  • You can ignore the Processor-related configuration in the script (no configuration is required).

{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"plugin_name",
            "parameter":{...},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"plugin_name",
            "parameter":{...},
            "name":"Writer",
            "category":"writer"
        }
    ],
    {
        "name":"Processor",
        "stepType":null,
        "category":"processor",
        "copies":1,
        "parameter":{...}
    },
    "setting":{
        "executeMode":null,
        "errorLimit":{
            "record":""
        },
        "speed":{
            "concurrent":2,
            "throttle":false
        },
        "timeZone":"Asia/Shanghai"
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

  1. Configure the basic information and column mappings for the reader and writer.

    Important

    The configurations vary for different plug-ins. The following content describes only common configurations as examples. Whether a plug-in supports a specific configuration and how the configuration is implemented depend on the plug-in. For details, see the Reader Script Demo and Writer Script Demo for each data source in the list of data sources.

    By configuring parameters, you can:

    • Reader

      Configuration

      Description

      where (configure the synchronization scope)

      Some source types support data filtering. You can specify a condition (a WHERE clause, but you do not need to include the where keyword) to filter source data. During task execution, only data that meets the condition is synchronized. For more information, see Configure a filter condition.

      To implement incremental synchronization, you can combine the filter condition with scheduling parameters to make it dynamic. For example, by using gmt_create >= '${bizdate}', the task synchronizes only newly added data each day. You also need to assign a value to the variable defined here when configuring the schedule settings. For more information, see Configure scheduling parameters.

      The incremental synchronization configuration method varies for different data sources (plug-ins).

      If no data filter condition is configured, full data of the table is synchronized by default.

      splitPk (configure the split key for relational databases)

      Specifies the column based on which the source data to be synchronized is split. During task execution, the data is split into multiple tasks based on this column to enable concurrent and batch data reading.

      • We recommend that you use the primary key of the table as the splitPk value because primary keys are generally evenly distributed, which helps prevent data hotspots in the resulting shards.

      • Currently, splitPk supports only integer-based splitting. Other types such as strings, floating-point numbers, and dates are not supported. If you specify an unsupported type, the splitPk feature is ignored and the data is synchronized through a single channel.

      • If splitPk is not specified, including when the splitPk parameter is not provided or is left empty, the data is synchronized through a single channel.

      • Not all plug-ins support the split key configuration. The preceding information is provided as an example only. For details, refer to the specific plug-in documentation. For more information, see Supported data sources and sync solutions.

      column (define source columns)

      Define the columns to be synchronized from the source in the column array. You can write constants, variables, and functions as custom columns to the destination. For example, '123', '${variable_name}', and 'now()'.

    • Writer

      Configuration

      Description

      preSql & postSql (configure pre-sync and post-sync SQL statements)

      Some data sources support executing database SQL statements on the destination before data synchronization (before data is written to the destination) and after data synchronization (after data is written to the destination).

      Example: MySQL Writer supports preSql and postSql configuration, which allows you to execute MySQL commands before or after data is written to MySQL. For example, you can configure the MySQL truncate table command truncate table tablename in the Pre-import Preparation Statement (preSql) setting of the MySQL Writer to clear old data from the table before synchronization (before writing data to MySQL).

      writeMode (define the write mode for conflicts)

      Specifies the write mode when conflicts occur, such as path or primary key conflicts. This configuration varies depending on the characteristics of the data source and the capabilities of the writer plug-in. Refer to the specific writer plug-in documentation for configuration.

  2. Channel Control.

    You can configure efficiency settings in the setting section, including concurrency, synchronization speed, and dirty data handling settings.

    Parameter

    Description

    executeMode (distributed processing)

    Controls whether to enable distributed mode for the current task.

    • distribute: Enables distributed processing. In distributed execution mode, your task is split into slices that are distributed to multiple execution nodes for concurrent execution. This allows the synchronization speed to scale horizontally with the cluster size, breaking through the bottleneck of single-node execution.

    • null: Disables distributed processing. The configured concurrency applies only to process-level concurrency on a single node, and multi-node computing cannot be used.

    Important
    • If you use an exclusive resource group for Data Integration that has only one node, we do not recommend distributed mode because it cannot leverage multi-node resources.

    • If a single node already meets speed requirements, we recommend single-node mode to simplify the task execution mode.

    • The concurrency must be 8 or higher to enable distributed processing.

    • Some data sources support distributed mode for task execution. For details, refer to the specific plug-in documentation.

    • Enabling distributed processing consumes more resources. If an out-of-memory (OOM) error occurs during execution, try disabling this option.

    concurrent (maximum expected concurrency)

    Defines the maximum number of threads for parallel reading from the source or parallel writing to the destination for the current task.

    Note

    Due to factors such as resource specifications, the actual concurrency during execution may be less than or equal to the configured value. The debug resource group is billed based on the actual concurrency. For more information, see Billing.

    throttle (synchronization speed)

    Controls the synchronization speed.

    • true: Enables throttling. This protects the source database by preventing excessive extraction speeds that could overload the source. The minimum throttling rate is 1 MB/s.

      Note

      When throttle is set to true, you also need to set the mbps (synchronization speed) parameter.

    • false: Disables throttling. Without throttling, the task delivers the maximum transfer performance that the hardware environment allows within the configured concurrency limit.

    Note

    The throughput metric is a Data Integration internal measurement and does not represent the actual NIC traffic. Typically, the NIC traffic is 1 to 2 times the channel traffic. The actual traffic inflation depends on the serialization mechanism of the specific data storage system.

    errorLimit (error count control)

    Defines the dirty data threshold and its impact on the task.

    Important

    Excessive dirty data can affect the overall synchronization speed of the task.

    • If not configured, dirty data is allowed by default, which means dirty data does not affect task execution.

    • If set to 0, no dirty data is allowed. The task fails if dirty data is generated during synchronization.

    • If dirty data is allowed and a threshold is set:

      • If the dirty data is within the threshold, the synchronization task ignores the dirty data (does not write it to the destination) and continues normally.

      • If the dirty data exceeds the threshold, the synchronization task fails.

    Note

    Dirty data criteria: Dirty data is data that has no business significance, has an invalid format, or encounters issues during synchronization. If an exception occurs while writing a single record to the destination data source, that record is considered dirty data. Any data that fails to be written is classified as dirty data.

    For example, writing VARCHAR data from the source to an INT column in the destination results in dirty data due to an invalid type conversion. You can control whether dirty data is allowed during synchronization task configuration, and set a maximum error count so that the task fails when the dirty data count exceeds the specified limit.

    timeZone (time zone setting)

    Specifies the time zone for the synchronization task. This setting takes effect when time-type column conversion is involved on the source or destination. After this parameter is configured, Data Integration reads and writes time columns based on the specified time zone.

    Configuration example: "timeZone":"Asia/Shanghai".

    • This parameter can be configured only in the setting section in script mode. The destination in wizard mode (codeless UI) does not support time zone settings.

    • The time zone value uses the standard IANA time zone format, such as Asia/Shanghai and America/New_York.

    • If not configured, Data Integration uses the system default time zone.

    Note

    The overall synchronization speed is affected not only by the configurations described above but also by the source data source performance, network environment, and other factors. For more information about synchronization speed and tuning, see Tune batch synchronization tasks.

Step 5: Configure schedule settings

For a single-table batch synchronization task with periodic scheduling, you must configure the properties for automatic scheduling. Go to the editing page of the node, click Scheduling Settings on the right side, and configure the schedule settings for the node.

You must configure scheduling parameters, scheduling policies, scheduling time, and dependencies for the synchronization task. The configuration method is the same as that for other data development nodes and is not described here.

For more information about using scheduling parameters, see Typical scenarios of scheduling parameters in Data Integration.

Step 6: Submit and deploy the task

  • Configure run parameters.

    On the right side of the single-table batch synchronization task configuration page, click Run Configuration and configure the following parameters for test runs.

    Configuration item

    Description

    Resource Group

    Select a resource group that has network connectivity with the data sources.

    Script Parameters

    Assign values to the placeholder parameters in the data synchronization task. For example, if the Data Integration task uses the ${bizdate} parameter, configure a date parameter in the yyyymmdd format.

  • Run the task.

    Click the image Run button on the toolbar to run and debug the task in Data Studio. You can then create a node of the corresponding destination table type to query the destination table data and verify whether the synchronized data meets expectations.

  • Deploy the task.

    After the task passes the test run, if the task needs to be run periodically, click the image button at the top of the node editing page to deploy the task to the production environment. For more information about task deployment, see Deploy tasks.

Next step

After the task is deployed to the production environment, you can go to the Operation Center in the production environment to view the scheduled task. For more information about running and managing Data Integration tasks, monitoring task status, and managing resource groups, see Data Integration task O&M.

References