The batch synchronization feature of Data Integration provides readers and writers for you to read data from and write data to data sources. You can specify a source and a destination for your batch synchronization node, and configure scheduling parameters for the node. This way, you can use the node to synchronize full data or incremental data from the source to the destination. This topic describes the capabilities provided by the batch synchronization feature.

Limits

The batch synchronization feature provided by DataWorks does not support data synchronization across time zones. If the data sources of a batch synchronization node reside in a different time zone from the resource group that is used to run the node, errors may occur during data synchronization.

Billing

  • Data synchronization nodes in Data Integration occupy resources for Data Integration. You are charged for the resources that you use. The scheduling system issues batch synchronization nodes in Data Integration to the related exclusive resource groups for scheduling and uses the resource groups to schedule the nodes. During this process, scheduling fees are generated. For more information about the billing details of resources for Data Integration, see Billing on resources: Data Integration.
    Note
  • If you configure the public IP address of a data source when you add the data source to DataWorks, and you use the data source for your batch synchronization node, Internet traffic is generated when you run the node. You are charged for the generated Internet traffic. For more information about the billing details of Internet traffic, see Billing of Internet traffic.

Overview

The following figure shows the capabilities of the batch synchronization feature.Capabilities provided by batch synchronization
CapabilityDescription
Data synchronization between heterogeneous data sourcesData Integration supports data synchronization between more than 40 types of data sources, including relational databases, unstructured storage systems, big data storage systems, and message queues. You can specify a source and a destination for your batch synchronization node and use the related reader and writer to synchronize data between the data sources. All structured data sources and semi-structured data sources are supported. For more information, see Supported data source types and read and write operations.
Data synchronization from or to data sources that are deployed in complex network environmentsThe batch synchronization feature supports data synchronization from or to Alibaba Cloud data sources, data centers, self-managed data sources that are hosted on Elastic Compute Service (ECS) instances, and data sources that do not belong to Alibaba Cloud. You can select appropriate network connectivity solutions to establish network connections between your resource group and data sources based on the network environments in which the data sources are deployed. Before you configure a data synchronization node, you must make sure that network connections are established between your resource group for Data Integration and data sources. For more information about how to establish a network connection between a resource group and a data source, see Establish a network connection between a resource group and a data source.
Data synchronization scenariosThe batch synchronization feature allows you to synchronize data from a single table to another single table or synchronize data from tables in sharded databases to a single table. You can configure scheduling parameters for a batch synchronization node and use the node to periodically synchronize full data and incremental data in the source to the related partition in the destination table. You can also configure scheduling parameters for a batch synchronization node and use the data backfill feature provided in Operation Center to backfill the historical data of a specific period of time for the node. This way, you can use the node to synchronize the historical data to the specified partition or table in the destination database or data warehouse. For more information about scheduling parameters, see Supported formats of scheduling parameters.
Note
  • Data synchronization from tables in sharded databases is supported for database types such as MySQL, SQL Server, Oracle, PostgreSQL, PolarDB, and AnalyticDB. For more information, see Scenario: Configure a batch synchronization node to synchronize data from tables in sharded databases.
  • The batch synchronization feature can be used to synchronize data only from a single table or tables in sharded databases to a single table. If you want to synchronize data from tables in multiple databases to multiple tables, you can use the batch synchronization solution that is provided by the solution-based synchronization feature and used to synchronize data from all tables in a database. For more information about how to select a data synchronization feature, see Overview.
Node configuration methodsYou can use one of the following methods to configure a batch synchronization node:
Note For more information about the settings that are supported for configuring a batch synchronization node, see Configurations for a batch synchronization node.
O&M for batch synchronization nodes
  • Monitor the status of a batch synchronization node: You can monitor the status of a batch synchronization node and configure monitoring and alerting settings for the node based on a condition such as Uncompleted, Error, or Completed. You can also configure DataWorks to send alert notifications to the specified alert recipient by email, text message, DingTalk chatbot, or webhook URL. For more information, see Create a custom alert rule.
  • Monitor the quality of table data: You can monitor the quality of table data that is synchronized to a destination. You can configure monitoring rules only for tables of specific types of databases. For more information, see Overview.
  • Isolate the same data source in different environments: You can add the same data source separately for the development environment and production environment. When you configure a batch synchronization node, the data source in the development environment is used. When you commit the node to and run the node in the production environment, the data source in the production environment is used. You can use the data source isolation feature to isolate the same data source in different environments.

Configurations for a batch synchronization node

Node configuration
ConfigurationDescription
Synchronize full or incremental dataYou can configure a filter condition and scheduling parameters when you configure a batch synchronization node to synchronize incremental data from the source. The parameters that need to be configured to implement incremental synchronization vary based on the reader type. For more information, see Configure a batch synchronization node to synchronize only incremental data.
Configure field mappings, and add fields to a source table and assign values to the fieldsWhen you configure a batch synchronization node, you can configure mappings between fields in the source and fields in the destination. The values of the fields in the source are written to the fields of the same data type in the destination based on the mappings.
  • Methods used to configure field mappings:
    • If you configure a batch synchronization node by using the codeless UI, you can map fields in the source to fields with the same names in the destination, map the fields in a row of the source to the fields in the same row of the destination, or customize mappings between all or specific fields in the source and all or specific fields in the destination. Data in the source fields that do not have mapped destination fields is not synchronized. Make sure that destination fields that do not have mapped source fields have default values or the default values of these destination fields are NULL. Otherwise, data may fail to be written to the destination.
    • If you configure a batch synchronization node by using the code editor, the system establishes mappings between fields in the source and fields in the destination based on the fields that you specify when you configure the related reader and writer. The number of fields to which you want to write data must be the same as the number of fields from which you want to read data. If the numbers are different, the batch synchronization node fails.
  • Add fields to a source table and assign values to the fields: You can add fields, such as constants and variables, to a source table. If you add variables to a source table as fields, you can assign values to the variables.
Specify the maximum number of parallel threads that can be used and the maximum transmission rate
  • When you configure a batch synchronization node, you can specify the maximum number of parallel threads that can be used to read data from the source and write data to the destination.
  • When you configure a batch synchronization node, you can specify the maximum transmission rate to prevent heavy read workloads on the source or heavy write workloads on the destination.
    Note If you do not specify the maximum transmission rate when you configure a batch synchronization node, data is transmitted at the maximum transmission rate that is allowed by the hardware.
Enable the distributed execution mode

Batch synchronization nodes for specific types of data sources can be run in distributed execution mode. If you enable the distributed execution mode for a batch synchronization node when you configure the node, the system splits the node into slices and uses multiple machines to run the node at the same time. In this case, the more ECS instances, the higher the data synchronization speed. If you have a high requirement for data synchronization performance, you can run your batch synchronization node in distributed execution mode. If you run a batch synchronization node in distributed execution mode, fragment resources of ECS instances can be utilized. This helps improve resource utilization.

Note Support for the distributed execution mode varies based on the data source type. For more information about whether a data source supports the distributed execution mode, see the topic for the related reader or writer and parameters displayed in the DataWorks console.
Specify the maximum number of dirty data records that are allowed
Data Integration allows the generation of dirty data records by default. Data Integration also allows you to specify the maximum number of dirty data records that are allowed during data synchronization and define the impacts of dirty data records.
  • If you do not allow the generation of dirty data records and dirty data records are generated during data synchronization, the batch synchronization node fails.
  • If you allow the generation of dirty data records and specify the maximum number of dirty data records that are allowed during data synchronization, one of the following situations occurs:
    • If the number of dirty data records that are generated is less than the upper limit, the dirty data records are ignored and not written to the destination, and the batch synchronization node can continue to run.
    • If the number of dirty data records that are generated is greater than the upper limit, the batch synchronization node fails.
Note Dirty data is the data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination data source, the data record is considered dirty data. Data records that fail to be written to a destination are considered as dirty data. For example, when a data synchronization node attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a batch synchronization node, you can specify whether dirty data can be generated. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specify, the batch synchronization node fails.

Description for using scheduling parameters in data synchronization

Batch synchronization of data in a table

If you use a batch synchronization node to synchronize data, you can configure scheduling parameters for the node to specify the path and scope of the data that you want to synchronize and the location to which you want to write the data. The method used to configure scheduling parameters for a batch synchronization node is the same as the method used to configure scheduling parameters for other types of nodes.

When a batch synchronization node is run, the scheduling parameters configured for the node are replaced with the actual values based on the value formats of the scheduling parameters. Then, the batch synchronization node synchronizes data based on the values.

Example: You want to configure a batch synchronization node to synchronize the order data that is generated on the previous day in an order table in a MySQL data source to the partition for the current day in a destination MaxCompute table every day. The source order table contains the field gmd_created that specifies the time when an order is created, and the destination MaxCompute table contains the partition field ds. .

The incremental order data in the source order table is obtained based on a filter condition that contains WHERE.
  • bizdate_yesterday specifies the date on which an incremental order is created. The date is one day earlier than the date on which the node is scheduled to run. The value format of the parameter is ${yyyy-mm-dd}.
  • bizdate_today specifies the date on which data of an incremental order is synchronized. The node is scheduled to run on the day indicated by this date. The value format of the parameter is $[yyyy-mm-dd].
  • bizdate_today and bizdate_yesterday are the names of scheduling parameters. You can specify the names based on your business requirements. When the node is run, the bizdate_today and bizdate_yesterday parameters are replaced with actual values based on the value formats of the parameters.
The partition in the destination MaxCompute table is also specified by a scheduling parameter. $bizdate specifies the data timestamp of the node. When the node is run, the partition filter expression configured for the node is replaced with the data timestamp specified by the scheduling parameter. For more information about how to configure and use scheduling parameters, see Configure and use scheduling parameters.

Batch synchronization of all data in a database

For a batch synchronization node that is used to synchronize all data in a database, only the following scheduling parameters can be configured:

bizdate=${yyyymmdd} year=$[yyyy] month=$[mm] day=$[dd] hour=$[hh24]

When you configure the node, the following variables must be defined: ${bizdate}, ${year},${month}, ${hour}, ${day}, and ${hour}.

Example: If you want to configure a data synchronization solution to synchronize full data from a source database at a time and periodically synchronize incremental data from the source database to MaxCompute, you can configure the filter condition STR_TO_DATE('${bizdate}', '%Y%m%d') <= columnName AND columnName < DATE_ADD(STR_TO_DATE('${bizdate}', '%Y%m%d'), interval 1 day) to obtain the daily incremental data that you want to periodically synchronize to MaxCompute.