All Products
Search
Document Center

DataWorks:Configure a batch synchronization task by using the code editor

Last Updated:Dec 15, 2023

If you want to make finer-grained configurations for your batch synchronization task, you can configure the task by using the code editor. You can write a JSON script for data synchronization and configure the required scheduling parameters to periodically synchronize full or incremental data from a single source table or tables in sharded source databases to a single destination table. This topic describes how to configure a batch synchronization task by using the code editor. The configurations that are required vary based on the data source type. For more information, see Supported data source types, Readers, and Writers.

Background information

The batch synchronization feature of Data Integration provides Readers and Writers for you to read data from and write data to data sources. You can configure batch synchronization tasks for different types of data sources by using the codeless UI or code editor to synchronize data from a single source table or tables in sharded source databases to a single destination table. For more information, see Overview of the batch synchronization feature.

Use scenarios

You can configure a synchronization task by using the code editor in one of the following scenarios:

  • The data source that you want to use does not support the codeless UI.

    Note

    You can check whether a data source supports the codeless UI in the DataWorks console.

  • Parameters of specific types of data sources support only the code editor.

  • Your synchronization task uses a data source that cannot be added in DataWorks.

Prerequisites

  1. The required data sources are configured. Before you configure a data synchronization task, you must configure the data source from which you want to read data and the data source to which you want to write data. This way, when you configure a data synchronization task, you can select the data sources. For information about the data source types that are supported by batch synchronization, see Supported data source types, Readers, and Writers.

    Note

    For information about the items that you need to understand before you configure a data source, see Overview.

  2. An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.

  3. Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.

Go to the DataStudio page

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

Procedure

  1. Step 1: Create a batch synchronization task

  2. Step 2: Establish network connections between the exclusive resource group for Data Integration and the data sources

  3. Step 3: Switch from the codeless UI to the code editor and apply a script template

  4. Step 4: Edit the script of the batch synchronization task to configure the task

  5. Step 5: Configure scheduling properties for the batch synchronization task

  6. Step 6: Commit and deploy the batch synchronization task

Step 1: Create a batch synchronization task

  1. Create a workflow. For more information, see Create a workflow.

  2. Create a batch synchronization task.

    You can use one of the following methods to create a batch synchronization task:

    • Method 1: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization task and click DataStudio in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and click its name. Right-click Data Integration and choose Create Node > Offline synchronization.

    • Method 2: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization task and click DataStudio in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and double-click its name. In the Data Integration section of the workflow editing tab that appears, drag Offline synchronization to the canvas on the right.

  3. In the Create Node dialog box, configure the parameters to create a batch synchronization task.

Step 2: Establish network connections between the exclusive resource group for Data Integration and the data sources

Select the source, destination, and exclusive resource group for Data Integration, and establish network connections between the resource group and data sources.

Note

Step 3: Switch from the codeless UI to the code editor and apply a script template

Click the Conversion script icon in the top toolbar of the configuration tab to switch from the codeless UI to the code editor.转换脚本

If no script is configured, you can click the 导入模板 icon in the top toolbar of the configuration tab to apply a script template.

Step 4: Edit the script of the batch synchronization task to configure the task

The following figure shows the common settings that can be configured for a batch synchronization task in the code editor.

Note
  • DataWorks provides default settings for the type and version parameters. You cannot change the default settings.

  • You can ignore the processor-related settings in the code. The settings are not required.

脚本

  1. Configure basic information for the Reader and Writer and configure mappings between source fields and destination fields.

    Important

    The operations that you can perform when you configure a batch synchronization task vary based on the Reader or Writer type. The following tables describe the common operations that you can perform when you configure a batch synchronization task. For information about the operations supported by a Reader or Writer and how to perform the operations, see the topic for the related Reader or Writer. For more information, see Supported data source types, Readers, and Writers.

    The following tables describe the operations that you can perform when you configure a batch synchronization task.

    • Operations related to the source

      Operation

      Description

      Specify a filter condition

      DataWorks allows you to specify a filter condition used to implement incremental synchronization when you configure specific types of Readers. For example, when you configure MySQL Reader, you can specify such a filter condition. You can use a filter condition together with scheduling parameters to implement incremental synchronization. For information about how to configure a batch synchronization task to synchronize incremental data, see Configure a batch synchronization task to synchronize only incremental data.

      Note
      • Support of different Readers for incremental synchronization and implementation of incremental synchronization vary based on the Reader type. For more information, see the related topic.

      • If you do not specify a filter condition when you configure a Reader that supports parameters related to incremental synchronization, full data is synchronized by default.

      • When you configure scheduling properties for a batch synchronization task, you can assign values to the variables that you specified in the filter condition. You can configure scheduling parameters for a batch synchronization task to enable full or incremental data in the source to be written to the related time-based partitions in the destination table. For more information about scheduling parameters, see Supported formats of scheduling parameters.

      • The syntax of the filter condition that is used to implement incremental synchronization is almost the same as the syntax supported by a database. During data synchronization, the batch synchronization task uses a complete SQL statement that is obtained based on the filter condition to extract data from the source.

      Specify a shard key used to shard data in a relational database

      A shard key specifies a field based on which source data is sharded. After you specify a shard key, source data is sharded and distributed to multiple shards. This way, the batch synchronization task can run parallel threads to read the data in batches.

      Note
      • We recommend that you specify the name of the primary key column of a source table as the shard key. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards.

      • A shard key can be used to shard data only of an integer data type. If you use a shard key to shard data of an unsupported data type, the batch synchronization task ignores the shard key that you specified and uses a single thread to read data.

      • If no shard key is specified, a data synchronization task uses a single thread to read data.

      • Support of Readers for the configuration of a shard key varies based on the Reader type. The instructions provided in this topic are for reference only. You can refer to the topic for a Reader to check whether the Reader supports the configuration of a shard key. For more information, see Supported data source types, Readers, and Writers.

      Add fields to a source table and assign values to the fields

      You can add fields to a source table and assign values to the fields. When the batch synchronization task is run, the added fields are synchronized to the related destination table. The fields can be constants and variables that are enclosed in single quotation marks ('), such as '123' and '${Variable name}'. If you add variables to a source table as fields, you can assign values to the variables when you configure scheduling properties for the batch synchronization task. For information about scheduling parameters, see Supported formats of scheduling parameters.

      Edit fields in a source table

      You can use a function that is supported by the source to process fields in a source table. For example, you can use the Max(id) function to implement synchronization of data in the row with the largest ID in the source table.

      Note

      Functions are not supported if you configure a batch synchronization task that uses MaxCompute Reader.

    • Operations related to the destination

      Operation

      Description

      Configure SQL statements that you want to execute before and after data synchronization

      DataWorks allows you to execute SQL statements before and after data is written to specific types of destinations.

      For example, when you configure a batch synchronization task that uses MySQL Writer, you can configure the SQL statement truncate table tablename as a statement to be executed before data is written to the destination. This statement is used to delete existing data in a specified table. You can also configure an SQL statement as a statement to be executed after data is written to the destination.

      Specify the write mode that is used when a conflict occurs|

      You can specify the write mode in which you want to write data to the destination when a conflict, such as a path conflict or primary key conflict, occurs. The configuration varies based on the attributes of destinations and the support of Writers. To configure this item, refer to the topic for the related Writer.

  2. Configure channel control policies.

    You can configure channel control policies in the setting field, such as the maximum number of parallel threads that can be used for data synchronization, the maximum transmission rate, and settings for dirty data records.

    Parameter

    Description

    executeMode

    Specifies whether to enable the distributed execution mode for the batch synchronization task. Valid values:

    • distribute: Enables the distributed execution mode. If you enable the distributed execution mode for the batch synchronization task, the system splits the task into slices and distributes them to multiple Elastic Compute Service (ECS) instances for parallel running. In this case, the more ECS instances, the higher the data synchronization speed.

    • null: Disables the distributed execution mode. If you do not enable the distributed execution mode for the batch synchronization task, the specified maximum number of parallel threads is used only for a single ECS instance to run the task.

    Important
    • If your exclusive resource group contains only one ECS instance, we recommend that you do not run your batch synchronization task in distributed execution mode.

    • If one ECS instance can meet your business requirements for data transmission speed, you do not need to enable the distributed execution mode. This can simplify the execution mode of your task.

    • The distributed execution mode can be enabled only if the maximum number of parallel threads that you specified is greater than or equal to 8.

    • Whether a batch synchronization task supports the distributed execution mode varies based on the data source type. For more information, see the topics for Readers and Writers.

    concurrent

    The maximum number of parallel threads that the batch synchronization task uses to read data from the source or write data to the destination.

    Note

    The actual number of parallel threads that are used during data synchronization may be less than or equal to the specified threshold due to the specifications of the exclusive resource group for Data Integration. You are charged for the exclusive resource group for Data Integration based on the number of parallel threads that are used. For more information, see Performance metrics.

    throttle

    Specifies whether to enable throttling.

    • true: Enables throttling. If you enable throttling, you can specify a maximum transmission rate to prevent heavy read workloads on the source. The minimum value of this parameter is 1 MB/s.

      Note

      If you set the throttle parameter to true, you must also configure the mbps parameter. The mbps parameter specifies the maximum transmission rate that is allowed.

    • false: Disables throttling. If you do not enable throttling, data is transmitted at the maximum transmission rate allowed by the hardware based on the specified maximum parallel threads.

    Note

    The bandwidth value is a metric provided by Data Integration and does not represent the actual traffic of an elastic network interface (ENI). In most cases, the ENI traffic is one to two times the channel traffic. The actual ENI traffic depends on the serialization of the data storage system.

    errorLimit

    Specifies whether to allow the generation of dirty data during data synchronization.

    Important

    If a large amount of dirty data is generated during data synchronization, the overall data synchronization speed is affected.

    • If this parameter is not configured, dirty data records are allowed during data synchronization, and the batch synchronization task can continue to run if dirty data records are generated.

    • If you set this parameter to 0, no dirty data records are allowed. If dirty data records are generated during data synchronization, the batch synchronization task fails.

    • If you specify a value that is greater than 0 for this parameter, the following situations occur:

      • If the number of dirty data records that are generated during data synchronization is less than or equal to the value that you specified, the dirty data records are ignored and are not written to the destination, and the batch synchronization task continues to run.

      • If the number of dirty data records that are generated during data synchronization is greater than the value that you specified, the batch synchronization task fails.

    Note

    Dirty data indicates data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Data records that fail to be written to a destination are considered as dirty data.

    For example, when a batch synchronization task attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a batch synchronization task, you can control whether dirty data is allowed. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specified, the synchronization task fails.

    Note

    In addition to the preceding configurations, the overall data synchronization speed of a batch synchronization task is also affected by factors such as the performance of the source and the network environment for data synchronization. For information about the data synchronization speed and performance tuning of a batch synchronization task, see Optimize the performance of batch synchronization tasks.

Step 5: Configure scheduling properties for the batch synchronization task

If you want DataWorks to periodically schedule your batch synchronization task, you must configure scheduling properties for the task. This step describes how to configure scheduling properties for a batch synchronization task. You can go to the configuration tab of the batch synchronization task, click Properties in the right-side navigation pane of the configuration tab, and then configure scheduling properties for the batch synchronization task. You can configure the following properties for a batch synchronization task: For information about how to use scheduling parameters, see Description for using scheduling parameters in data synchronization.

  • Configure scheduling parameters: If you use variables in the preceding steps for the batch synchronization task, you can assign scheduling parameters or constants to the variables as values.

  • Configure time properties: The time properties define the mode in which the batch synchronization task is scheduled in the production environment. In the Schedule section of the Properties tab of the task, you can configure properties such as the instance generation mode, scheduling type, and scheduling cycle for the task.

  • Configure the resource property: The resource property defines the exclusive resource group for scheduling that is used to issue the batch synchronization task to the related exclusive resource group for Data Integration. You can select the exclusive resource group for scheduling that you want to use in the Resource Group section of the Properties tab.

    Note

    DataWorks uses resource groups for scheduling to issue batch synchronization tasks in Data Integration to resource groups for Data Integration and uses the resource groups for Data Integration to run the tasks. You are charged for using the resource groups for scheduling to schedule batch synchronization tasks. For information about the task issuing mechanism, see Mechanism for issuing tasks.

Step 6: Commit and deploy the batch synchronization task

If you want to periodically run the batch synchronization task, you must deploy the task to the production environment. For more information about how to deploy a task, see Deploy tasks.

What to do next

After the batch synchronization task is deployed to the production environment, you can go to Operation Center in the production environment to view the task. For information about how to perform O&M operations for a batch synchronization task, such as running and managing the task, monitoring the status of the task, and performing O&M for the resource group that is used to run the task, see O&M for batch synchronization tasks.

References