Configure an offline sync task between data sources in the code editor - DataWorks

For more fine-grained control over offline task configuration, you can use the Code Editor. In the code editor, you can write a JSON script for data synchronization and use DataWorks scheduling parameters to periodically sync full or incremental data from a single source table or sharded tables to a destination table. This topic describes the common configurations for an offline sync task in the code editor. The configurations vary based on the data source. For more information, see the configuration details for each data source in the Data source list.

Scenarios

You can use the code editor to configure a sync task in the following scenarios:

The data source does not support configuration in the codeless UI.
Note
The UI indicates whether a data source supports the codeless UI.
Some configuration parameters for a data source are available only in the code editor.
You can use the code editor to configure data sources that cannot be created in the DataWorks UI.

Preparations

The required source and destination data sources are configured. Before you set up a Data Integration sync task, you must configure the required source and destination databases on the Data Source page of the DataWorks console. For more information, see Data source list.
Note
- For more information about the data sources supported by offline sync and their configurations, see Supported data sources and sync solutions.
- For more information about the features of data sources, see Data Source Configuration.
Purchase a resource group with a suitable specification and attach it to the workspace. For more information, see Use a Serverless resource group for Data Integration and Use an exclusive resource group for Data Integration.
Establish a network connection between the resource group and the data source. For more information, see Configure network connections.

Step 1: Create an offline sync node

Data Studio (new version)

Log on to the DataWorks console. Switch to the destination region. In the navigation pane on the left, choose Data Development & O&M > Data Development. Select the desired workspace from the drop-down list and click Go to Data Studio.
Create a workflow. For more information, see Orchestrate workflows.
Create a batch synchronization node. You can use one of the following methods:
- Method 1: Click the icon in the upper-right corner of the workflow list and choose Create Node > Data Integration > Batch Synchronization.
- Method 2: Double-click the workflow name and drag the Batch Synchronization node from the Data Integration directory to the workflow editor on the right.
Configure the basic information, source, and destination for the node. Then, click OK.

DataStudio (legacy version)

Log in to the DataWorks console. Switch to the destination region. In the navigation pane on the left, click Data Development & O&M > Data Development. Select the desired workspace from the drop-down list and click Go to Data Development.
Create a workflow. For more information, see Create a workflow.
Create a batch synchronization node. You can use one of the following methods:
- Method 1: Expand the workflow, right-click Data Integration, and select Create Node > Batch Synchronization.
- Method 2: Double-click the workflow name and drag the Batch Synchronization node from the Data Integration directory to the workflow editor on the right.
Create a batch synchronization node as prompted.

Step 2: Configure the data source and resource group

You can switch from the codeless UI to the code editor at any step. However, to ensure that the script is fully configured, we recommend that you perform the following steps:

First, select the data source and resource group in the codeless UI and test the network connectivity.
Then, switch to the code editor.

The system automatically populates the generated JSON script with this information.

Alternatively, you can switch to the code editor directly and then manually configure the settings. To do this, specify the data source in the JSON code, and set the resource group and required resources for the task in the Advanced Settings panel on the right.

Note

If you have created a resource group but it is not displayed, check whether the resource group is attached to the workspace. For more information, see Use a Serverless resource group and Use an exclusive resource group for Data Integration.
For more information about the recommended resource quotas, see Resource group performance metrics - Data Integration.

Step 3: Switch to the code editor and import a template

In the toolbar, click the Code Editor icon.

If the script is not yet configured, click the Import Template icon in the toolbar and follow the on-screen instructions to import a script template.

Step 4: Edit the script to configure the sync task

The following section describes the general configurations in the code editor:

Note

The `type` and `version` fields have default values and cannot be changed.
You can ignore the processor-related configurations in the script because you do not need to configure them.

script

Configure the basic information and field mappings for the reader and writer.

Important

The configurations vary based on the plug-in. The following content provides examples of common configurations. To check whether a plug-in supports a specific configuration and how to configure it, see the documentation for that plug-in. For more information, see the Reader Script Demo and Writer Script Demo sections for each data source in the Data source list.

You can use configuration parameters to perform the following operations:

Reader

Operation	Description
where (Configure the sync scope)	Some source types support data filtering. You can specify a condition (a `WHERE` clause without the `where` keyword) to filter source data. At runtime, the task synchronizes only the data that meets the condition. For more information, see Scenario: Configure a batch synchronization task for incremental data. To perform incremental synchronization, you can combine this filter condition with scheduling parameters to make it dynamic. For example, with `gmt_create >= '${bizdate}'`, the task synchronizes only the new data from the current day each time it runs. You also need to assign a value to the variable defined here when you configure scheduling properties. For more information, see Supported formats of scheduling parameters. The method for configuring incremental synchronization varies by data source (plugin). If you do not configure a filter condition, the task synchronizes all data from the table by default.
splitPk (Configure a shard key for a relational database)	Defines the field in the source data based on which the data is split. During task execution, the data is split into multiple tasks based on this field for concurrent, batched data reading. We recommend that you use the primary key of the table for `splitPk` because primary keys are usually evenly distributed. This helps prevent data hot spots in the resulting shards. Currently, `splitPk` supports splitting only for integer data. It does not support strings, floating-point numbers, dates, or other data types. If you specify an unsupported type, the `splitPk` feature is ignored, and a single channel is used for synchronization. If you do not specify `splitPk`, or if its value is empty, the data is synced through a single channel. Not all plug-ins support specifying a shard key to configure task splitting logic. The preceding information is for reference only. For more information, see the documentation for the specific plug-in. For more information, see Supported data sources and sync solutions.
column (Define source fields)	In the `column` array, define the source fields to be synced. You can use constants, variables, and functions as custom fields to write to the destination. Examples include '123', '${variable_name}', and 'now()'.

Writer

Operation

Description

preSql and postSql (Configure statements to execute before and after synchronization)

Some data sources support the execution of SQL statements on the destination before (pre-sync) and after (post-sync) data is written.

For example, in the Pre-import Preparation Statement (preSql) configuration item for MySQL Writer, you can configure the truncate table tablename command to clear existing data from the table before the synchronization task starts.

writeMode (Define the write mode for handling conflicts)

This parameter defines how to write data to the destination when conflicts, such as path or primary key conflicts, occur. This configuration varies based on the data source and the writer plug-in. You must configure this parameter based on the requirements of the specific writer plug-in.

Channel control.

You can configure performance settings, such as concurrency, sync rate, and dirty data handling, in the setting section.

Parameter	Description
executeMode (Enable distributed processing capability)	Controls whether to enable distributed mode for the current task. `distribute`: Enables distributed processing. Distributed execution mode can split your task into shards and distribute them across multiple execution nodes for concurrent execution. This allows the sync speed to scale horizontally with the size of the execution cluster, breaking through single-node bottlenecks. `null`: Disables distributed processing. The configured concurrency is limited to processes on a single machine and cannot leverage the computing power of multiple machines. Important If you use an exclusive resource group for Data Integration with only one machine, we do not recommend using distributed mode because it cannot leverage multi-machine resources. If a single machine meets your speed requirements, we recommend using single-node mode to simplify task execution. A concurrency of 8 or more is required to enable distributed processing. Some data sources support distributed execution mode. For more information, see the documentation for the specific plug-in. Enabling distributed processing consumes more resources. If an out-of-memory (OOM) error occurs at runtime, try disabling this switch.
concurrent (Expected Maximum Concurrency)	Defines the maximum number of threads for the current task to read from the source or write to the destination in parallel. Note Due to factors such as resource specifications, the actual concurrency at runtime may be less than or equal to the configured value. The fee for the test resource group is based on the actual concurrency. For more information, see: Performance metrics.
throttle (Synchronization Rate)	Controls the synchronization rate. `true`: Enables throttling. This protects the source database by preventing an excessively high extraction speed from putting too much pressure on it. The minimum throttling rate is 1 MB/s. Note When `throttle` is set to `true`, you must also set the mbps (sync rate) parameter. `false`: Disables throttling. Without throttling, the task uses the maximum transfer performance available in the current hardware environment, within the limits of the configured concurrency. Note The traffic measure is a metric of Data Integration itself and does not represent the actual network interface card (NIC) traffic. Typically, the NIC traffic is 1 to 2 times the channel traffic. The actual traffic inflation depends on the data storage system's transfer serialization.
errorLimit (Policy for Dirty Data Records)	Defines the threshold for dirty data and its impact on the task. Important An excessive amount of dirty data can affect the overall sync speed of the task. If not configured, dirty data is allowed by default. This means that the task continues to run even if dirty data is generated. If set to 0, no dirty data is allowed. If dirty data is generated during synchronization, the task fails. If you allow dirty data and set a threshold: If the amount of dirty data is within the threshold, the sync task ignores the dirty data (it is not written to the destination) and runs normally. If the amount of dirty data exceeds the threshold, the sync task fails. Note Dirty data is data that is meaningless to the business, has an invalid format, or causes problems during synchronization. A record is considered dirty data if an exception occurs when it is being written to the destination data source. Therefore, any data that fails to be written is classified as dirty data. For example, if data of the VARCHAR type from the source is written to a destination column of the INT type, the data cannot be successfully written to the destination due to an invalid conversion, resulting in dirty data. When you configure a sync task, you can control whether dirty data is allowed and set a threshold. If the number of dirty data records exceeds the specified threshold, the task fails.

Note

In addition to the preceding configurations, the overall sync speed is also affected by factors such as the performance of the source data source and the network environment. For more information about how to optimize the sync speed, see Optimize an offline sync task.

Step 5: Configure scheduling properties

For a periodically scheduled batch synchronization task, you need to configure its scheduling properties. On the node's edit page, click Scheduling on the right to configure them.

You must configure scheduling parameters, a scheduling policy, a scheduling time, and scheduling dependencies for the sync task. The configuration process is the same as for other Data Studio nodes and is not described in this topic.

For information about scheduling configuration in the Data Studio (new version), see Node scheduling (new version).
For information about scheduling configuration in the DataStudio (legacy version), see Node scheduling configuration (previous version).

For more information about how to use scheduling parameters, see Common scenarios of scheduling parameters in Data Integration.

Step 6: Submit and publish the task

Configure test parameters.

On the batch synchronization task configuration page, you can click Debugging Configurations on the right and configure the following parameters to run a test.

Configuration item	Description
Resource Group	Select a resource group that is connected to the data source.
Script Parameters	Assign values to placeholder parameters in the data synchronization task. For example, if the task is configured with the `${bizdate}` parameter, you need to configure a date parameter in the `yyyymmdd` format.

Run the task.
Click the Run icon in the toolbar to run and test the task in Data Studio. After the task is run, you can create a node of the destination table type to query the destination table data and check whether the synchronized data meets your expectations.
Publish the task.
After the task runs successfully, if it needs to be scheduled periodically, click the icon in the toolbar of the node configuration page to publish the task to the production environment. For more information about how to publish tasks, see Publish tasks.

What to do next

After the task is published to the production environment, you can go to Operation Center in the production environment to view the scheduled task. For more information about how to run and manage batch synchronization tasks, monitor their status, and perform O&M on resource groups, see O&M for batch synchronization tasks.

DataWorks:Configure an offline sync task in the code editor

Scenarios

Preparations

Step 1: Create an offline sync node

Data Studio (new version)

DataStudio (legacy version)

Step 2: Configure the data source and resource group

Step 3: Switch to the code editor and import a template

Step 4: Edit the script to configure the sync task

Step 5: Configure scheduling properties

Step 6: Submit and publish the task

What to do next

References