This topic describes how to use the codeless user interface (UI) to configure a batch synchronization node that is periodically scheduled and how to commit and deploy the node.

Prerequisites

  1. The required data sources are configured. Before you configure a data synchronization node, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a batch synchronization node. For information about the data source types, Reader plug-ins, and Writer plug-ins that are supported by batch synchronization, see Supported data source types, Reader plug-ins, and Writer plug-ins.
    Note For information about the items that you must understand before you configure a data source, see Overview.
  2. An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
  3. Network connections between the exclusive resource group for Data Integration and the data sources are established. For more information, see Establish a network connection between a resource group and a data source.

Go to the DataStudio page

  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspaces.
  3. In the top navigation bar, select the region where the desired workspace resides. On the Workspaces page, find the workspace and click DataStudio in the Actions column. The DataStudio page appears.

Procedure

  1. Step 1: Create a batch synchronization node
  2. Step 2: Configure the batch synchronization node
    1. Establish network connections between the exclusive resource group for Data Integration and the data sources
    2. Select the tables from which you want to read data and the tables to which you want to write data, and specify a filter condition when you configure the source
    3. Configure field mappings
    4. Configure channel control policies, such as the maximum transmission rate and settings for dirty data records
    5. Configure scheduling properties for the batch synchronization node
  3. Step 3: Commit and deploy the batch synchronization node

Step 1: Create a batch synchronization node

  1. Create a workflow. For more information, see Manage workflows.
  2. Create a batch synchronization node.
    You can use one of the following methods to create a batch synchronization node:
    • Method 1: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization node and click DataStudio in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and click its name. Right-click Data Integration and choose Create Node > Batch Synchronization.
    • Method 2: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization node and click DataStudio in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and double-click its name. In the Data Integration section of the workflow editing tab that appears, click Batch Synchronization.
  3. In the Create Node dialog box, configure the parameters to create a batch synchronization node.

Step 2: Configure the batch synchronization node

  1. Establish network connections between the exclusive resource group for Data Integration and the data sources.
    Select the source, destination, and exclusive resource group for Data Integration, and establish network connections between the resource group and the data sources.
    Important The items that you must configure vary based on the Reader or Writer plug-in. The following tables describe the common configuration items that are required when you configure a batch synchronization node. For information about the configuration items supported by a Reader or Writer plug-in and how to configure the items, see the topic for the related Reader or Writer plug-in. For more information about the data source types, Reader plug-ins, and Writer plug-ins that are supported by batch synchronization, see Supported data source types, Reader plug-ins, and Writer plug-ins.
  2. Click Next Step to configure the source and destination for the batch synchronization node.
    1. Select the tables from which you want to read data and the tables to which you want to write data.
      In the data source selection section, select the tables from which you want to read data and the tables to which you want to write data, and specify a filter condition when you configure the source.
      • Configuration items for the source
        Configuration itemDescription
        Filter condition
        • If you specify a filter condition after you select the tables from which you want to read data, only data that meets the filter condition in the tables can be synchronized. You can use a filter condition together with scheduling parameters. This way, the filter condition can dynamically change with the settings of the scheduling parameters, and incremental data can be synchronized. Configurations for incremental synchronization and implementation of incremental synchronization vary based on the Reader plug-in type. For more information, see Configure a batch synchronization node to synchronize only incremental data.
          Note
          • When you configure scheduling properties for the batch synchronization node, you can assign values to the variables that you specified in the filter condition. You can configure scheduling parameters for the batch synchronization node to enable full or incremental data in the source to be written to the related time-based partitions in the destination tables. For more information, see Supported formats of scheduling parameters.
          • The syntax of the filter condition that is used to implement incremental synchronization is almost the same as the syntax supported by a database. During data synchronization, the batch synchronization node uses a complete SQL statement that is obtained based on the specified filter condition to extract data from the source.
        • By default, if you do not specify a filter condition, full data in the source is synchronized.
        Shard key for a relational database
        A shard key specifies a field based on which source data is sharded. After you specify a shard key, source data is sharded and distributed to multiple shards. This way, the batch synchronization node can run parallel threads to read the data in batches.
        Note
        • We recommend that you specify the name of the primary key column of a source table as the shard key. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards.
        • A shard key can be used to shard data only of an integer data type. If you use a shard key to shard data of an unsupported data type, the batch synchronization node ignores the shard key that you specified and uses a single thread to read data.
        • If no shard key is specified, a data synchronization node uses a single thread to read data.
        • Support of Reader plug-ins for the configuration of a shard key varies based on the Reader plug-in type. The instructions provided in this topic are for reference only. You can refer to the topic for a Reader plug-in to check whether the Reader plug-in supports the configuration of a shard key. For more information about the data source types, Reader plug-ins, and Writer plug-ins that are supported by batch synchronization, see Supported data source types, Reader plug-ins, and Writer plug-ins.
      • Configuration items for the destination
        Configuration itemDescription
        SQL statements that you want to execute before and after data synchronizationDataWorks allows you to execute SQL statements before and after data is written to specific types of destinations.

        For example, when you configure a batch synchronization node that uses MySQL Writer, you can configure the SQL statement truncate table tablename as a statement to be executed before data is written to the destination. This statement is used to delete existing data in a specified table. You can also configure an SQL statement as a statement to be executed after data is written to the destination.

        Write mode that is used when a conflict occursYou can specify the write mode that is used to write data to the destination when a conflict, such as a path conflict or primary key conflict, occurs. The configuration varies based on the attributes of destinations and the support of Writer plug-ins. To configure this item, refer to the topic for the related Writer plug-in.
    2. Configure field mappings.
      After the mappings between source fields and destination fields are configured, the batch synchronization node writes the values of the source fields to the destination fields of the same data type based on the mappings.

      The data type of source fields may be different from that of destination fields. In this case, the values of the source fields cannot be written to the destination fields. The values that fail to be written to the destination are considered as dirty data. You can refer to the operations described in the Configure channel control policies substep to specify the maximum number of dirty data records that are allowed during data synchronization.

      Note If a source field has no mapped destination field, the source field cannot be synchronized to the destination.
      You can map source fields to destination fields that have the same names as the source fields or map fields in a row of the source to the fields in the same row of the destination. When you configure field mappings, you can also perform the following operations:
      • Add fields to a source table and assign values to the fields: You can click Add in the source field list to add fields to the source table. The added fields are synchronized to the destination table during data synchronization. The fields can be constants and variables that are enclosed in single quotation marks ('), such as '123' and '${Variable name}'.
        Note If you add variables to the source table as fields, you can assign values to the variables when you configure scheduling properties for the batch synchronization node. For information about scheduling parameters, see Supported formats of scheduling parameters.
      • Edit fields in a source table: You can click the Edit icon in the source field list to perform the following operations:
        • Use a function that is supported by the source to process fields in the source table. For example, you can use the Max(id) function to implement synchronization of data in the row with the largest ID in the source table.
        • If only some fields in the source table are displayed when you configure field mappings, edit the fields in the source table.
        Note Functions are not supported if you configure a batch synchronization node that uses MaxCompute Reader.
    3. Configure channel control policies.
      You can configure channel control policies to define attributes for data synchronization.
      ParameterDescription
      Expected Maximum ConcurrencyThe maximum number of parallel threads that the batch synchronization node uses to read data from the source or write data to the destination.
      Note The actual number of parallel threads that are used during data synchronization may be less than or equal to the specified threshold due to the specifications of the exclusive resource group for Data Integration. You are charged for the exclusive resource group for Data Integration based on the number of parallel threads that are used. For more information, see Performance metrics.
      Bandwidth ThrottlingSpecifies whether to enable throttling.
      • If you enable throttling, you can specify a maximum transmission rate to prevent heavy read workloads on the source. The minimum value of this parameter is 1 MB/s.
      • If you do not enable throttling, data is transmitted at the maximum transmission rate allowed by the hardware based on the specified maximum number of parallel threads.
      Note The bandwidth is a metric provided by Data Integration and does not represent the actual traffic of an elastic network interface (ENI). In most cases, the ENI traffic is one to two times the channel traffic. The actual ENI traffic depends on the serialization of the data storage system.
      Dirty Data Records AllowedThe maximum number of dirty data records allowed.
      Important If a large amount of dirty data is generated during data synchronization, the overall data synchronization speed is affected.
      • If this parameter is not configured, dirty data records are allowed during data synchronization, and the batch synchronization node can continue to run if dirty data records are generated.
      • If you set this parameter to 0, no dirty data records are allowed. If dirty data records are generated during data synchronization, the batch synchronization node fails.
      • If you specify a value that is greater than 0 for this parameter, the following situations occur:
        • If the number of dirty data records that are generated during data synchronization is less than or equal to the value that you specified, the dirty data records are ignored and are not written to the destination, and the batch synchronization node continues to run.
        • If the number of dirty data records that are generated during data synchronization is greater than the value that you specified, the batch synchronization node fails.
      Note Dirty data indicates data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Data records that fail to be written to a destination are considered as dirty data.

      For example, when a batch synchronization node attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a batch synchronization node, you can control whether dirty data is allowed. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specified, the batch synchronization node fails and exits.

      Distributed ExecutionSpecifies whether to enable the distributed execution mode for the batch synchronization node.
      • If you enable the distributed execution mode for a batch synchronization node, the system splits the node into slices and distributes them to multiple Elastic Compute Service (ECS) instances for parallel running. In this case, the more ECS instances, the higher the data synchronization speed.
      • If you do not enable the distributed execution mode for a batch synchronization node, the specified maximum number of parallel threads is used only for a single ECS instance to run the node.
      If you have a high requirement for data synchronization performance, you can run your batch synchronization node in distributed execution mode. If you run your batch synchronization node in distributed execution mode, fragment resources of ECS instances can be utilized. This improves resource utilization.
      Important
      • If your exclusive resource group contains only one ECS instance, we recommend that you do not run your batch synchronization node in distributed execution mode.
      • If one ECS instance can meet your business requirements for data transmission speed, you do not need to enable the distributed execution mode. This can simplify the execution mode of your node.
      • The distributed execution mode can be enabled only if the maximum number of parallel threads that you specified is greater than or equal to 8.
      • Whether a batch synchronization node supports the distributed execution mode varies based on the data source type. For more information, see the topics for Reader plug-ins and Writer plug-ins. For more information about the data source types, Reader plug-ins, and Writer plug-ins that are supported by batch synchronization, see Supported data source types, Reader plug-ins, and Writer plug-ins.
      Note In addition to the preceding configurations, the overall data synchronization speed of a batch synchronization node is also affected by factors such as the performance of the source and the network environment for data synchronization. For information about the data synchronization speed and performance tuning of a batch synchronization node, see Scenario: Optimize the performance of batch synchronization nodes.
  3. Click Next Step to configure scheduling properties for the batch synchronization node.
    If you want DataWorks to periodically schedule your batch synchronization node, you must configure scheduling properties for the node. This substep describes how to configure scheduling properties for a batch synchronization node. For information about how to use scheduling parameters, see Description for using scheduling parameters in data synchronization.
    • Configure scheduling parameters: If you use variables in the configurations of the batch synchronization node, you can assign scheduling parameters to the variables as values.
    • Configure time properties: The time properties define the mode in which the batch synchronization node is scheduled in the production environment. In the section in which you configure time properties for the batch synchronization node, you can configure attributes such as the instance generation mode, scheduling type, and scheduling cycle for the node.
    • Configure the resource property: The resource property defines the exclusive resource group for scheduling that is used to issue the batch synchronization node to the related exclusive resource group for Data Integration. You can select the exclusive resource group for scheduling that you want to use.
      Note DataWorks uses resource groups for scheduling to issue batch synchronization nodes in Data Integration to resource groups for Data Integration and uses the resource groups for Data Integration to run the nodes. You are charged for using the resource groups for scheduling to schedule batch synchronization nodes. For more information about the node issuing mechanism, see Mechanism for issuing nodes.
  4. Click Complete Configuration.

Step 3: Commit and deploy the batch synchronization node

If you want DataWorks to periodically run the batch synchronization node, you must deploy the node to the production environment. For more information about how to deploy a node, see Deploy nodes.

What to do next

After the batch synchronization node is deployed to the production environment, you can go to Operation Center in the production environment to view the node. For information about how to perform O&M operations for a batch synchronization node, such as running and managing the node, monitoring the status of the node, and performing O&M for the resource group that is used to run the node, see O&M for batch synchronization nodes.