Configure a synchronization task in Data Integration - DataWorks

After you prepare data sources, network environments, and resources, you can select a synchronization task type based on your business requirements, and create and run a synchronization task to synchronize data between the data sources. This topic describes a common procedure that is required to configure a synchronization task. The detailed procedure varies based on the synchronization task type that you select. You can view the configuration details for each type of synchronization task in the DataWorks console.

Prerequisites

The required data sources are configured. Before you configure a synchronization task, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a synchronization task. For information about the supported data source types and the configuration of a data source, see Supported data source types and read and write operations.
Note
For information about the items that you need to understand before you configure a data source, see Overview.
The data source environments are prepared. Before you configure a synchronization task, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.

Background information

Data Integration provides various types of synchronization tasks, including batch synchronization of all data in a database (one-time full synchronization, periodical full synchronization, one-time full synchronization and periodical incremental synchronization, one-time incremental synchronization, and periodical incremental synchronization) and one-click real-time synchronization (one-time full synchronization and real-time incremental synchronization). Different sources and destinations support different types of synchronization tasks. You can view the synchronization task types that are supported by each type of data source in the DataWorks console. For information about the capabilities provided by synchronization tasks, see Overview of the full and incremental synchronization feature.

Limits

Data synchronization across time zones
You cannot run a synchronization task to synchronize data across time zones. If the data sources used for a synchronization task reside in different time zones from the resource group that you use, errors occur on fields of a date or time data type during data synchronization.
Number of databases from which data can be synchronized
- If you run a synchronization task that is used for batch synchronization of all data in a database, the task can read data only from the default database in the source that is used for the task.
- If you run a one-click real-time synchronization task to synchronize data from an ApsaraDB RDS data source, the task can read data from all databases on which the account configured for the data source has permissions.

Go to Data Integration

You can create a synchronization task in Data Integration.

Go to the Data Integration page.
Log on to the DataWorks console. In the left-side navigation pane, click Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.

Procedure

Step 1: Create a synchronization task
Step 2: Select a synchronization task type
Step 3: Configure network connectivity between an exclusive resource group for Data Integration and the data sources
Step 4: Configure the synchronization task
Step 5: Start the synchronization task

Step 1: Create a synchronization task

You can use one of the following methods to create a synchronization task:

Method 1: Go to the Data Integration page. On this page, select the source type and destination type and click Create.
Method 2: Go to the Data Integration page. If no synchronization tasks are displayed in the Nodes section of this page, click Create.

Step 2: Select a synchronization task type

On the Create Data Synchronization Solution page, select the source and destination based on your business requirements. After you select the data source types, the system displays the supported synchronization task types in the Synchronization Method drop-down list. You can select a synchronization task type based on your business requirements.

Note

For information about the supported data source types and synchronization task types, see Supported data source types and data synchronization solutions.

Step 3: Configure network connectivity between an exclusive resource group for Data Integration and the data sources

Select the source, destination, and an exclusive resource group for Data Integration that you want to use. Then, test network connectivity between the resource group and data sources.

Note

If no data sources are available, click New to add a data source. For more information about data sources, see Overview.

Step 4: Configure the synchronization task

Configure the synchronization task based on the instructions that are displayed.

Step 5: Start the synchronization task

Run the created synchronization task and view the running details of the task.

Go to the Nodes section of the Data Integration page and find the newly created synchronization task.
Click Start or Commit and Run in the Actions column to run the synchronization task.
Click Running Details in the Actions column to view the running details of the synchronization task.

Appendix: Advanced settings

Select source databases and tables and configure mapping rules

After you select a source database and a table, data is written to the destination schema and table that are named the same as the source database and table by default. If no such destination schema or table exists, the system automatically creates the schema or table in the destination. You can configure a mapping rule to specify the name of the destination schema or table to which you want to write data in the Set Mapping Rules for Table/Database Names section. You can specify a destination table name in a mapping rule to write data in multiple source tables to the same table. You can also specify prefixes in a mapping rule to write data to a database whose name starts with a different prefix from the source database or to tables whose names start with a different prefix from the source tables.

Conversion Rule for Table Name: This type of mapping rule allows you to use a regular expression to map the names of the destination tables to which you want to write data to the names of source tables.
- Example 1: Synchronize data from the source tables whose names start with the prefix doc_ to the destination tables whose names start with the prefix pre_.
- Example 2: Synchronize data from multiple source tables to the same destination table.
  To synchronize data from table_01, table_02, and table_03 to my_table, you can configure a mapping rule of the Conversion Rule for Table Name type, and set Source to table.* and Destination to my_table.
Rule for Destination Table Name: This type of mapping rule allows you to use built-in variables to specify the names of the destination tables to which you want to write data and add a prefix and a suffix to the names of the destination tables. The following built-in variables are supported:
- ${db_table_name_src_transed}: the name of the destination table that is mapped based on a mapping rule of the Conversion Rule for Table Name type
- ${db_name_src_transed}: the name of the destination schema that is mapped based on a mapping rule of the Rule for Conversion Between Source Database Name and Destination Schema Name type
- ${ds_name_src}: the name of the source
Example: Configure pre_${db_table_name_src_transed}_post to convert the table name my_table that is generated in the previous example to pre_my_table_post.
Rule for Conversion Between Source Database Name and Destination Schema Name: This type of mapping rule allows you to use a regular expression to specify the names of the destination schemas to which you want to write data.
Example: Synchronize data from the source schemas whose names start with the prefix doc_ to the destination schemas whose names start with the prefix pre_.

Configure the destination tables

You can configure the properties related to the destination tables. For example, you can specify the write mode, whether to write data to partitioned destination tables, and the name of the partition key column. You can choose to write data to the existing destination tables or newly created destination tables. You can also add additional fields to the destination tables and assign values to the additional fields.

Note

After you click Refresh Source Table and Destination Table Mapping, the system maps the source tables and destination tables based on the mapping rules that you configured in the Set Mapping Rules for Table/Database Names section.
The items that are required when you configure the destination tables vary based on the destination type. You can view the items that you need to configure in the DataWorks console. For more information, see Supported data source types and data synchronization solutions.

Configure rules to process DDL or DML messages and configure synchronization rules

The parameters that you must configure vary based on the synchronization task type that you selected.

Configure rules to process DDL or DML messages for a one-click real-time synchronization task
DDL or DML operations may be performed on the source. To ensure that data synchronized to the destination meets your business requirements, you can configure rules to process DDL or DML messages from the source based on the destination type. For information about how to configure rules to process DDL messages or DML messages, see Configure rules to process DDL messages or Configure rules to process DML messages.
Configure synchronization rules for a synchronization task that is used for batch synchronization of all data in a database
When you configure a synchronization task that is used for batch synchronization of all data in a database, you must configure synchronization rules for the task. For example, you must specify a filter condition to implement incremental synchronization and configure scheduling settings for the task.
- Implement incremental synchronization: You can use a WHERE clause to extract only incremental data from the source tables. You need to enter only the WHERE clause in the Condition for Incremental Synchronization field without the need to enter the WHERE keyword. When you specify the WHERE clause, you can use a built-in variable. For example, you can use ${bdp.system.bizdate} to represent the data timestamp of the task or use ${bdp.system.cyctime} to represent the scheduling time of the task.
  Note
  You can use scheduling parameters to specify the source tables from which you want to read data and the destination tables to which you want to write data. For more information about how to use scheduling parameters, see Description for using scheduling parameters in data synchronization.
- Configure scheduling settings: A synchronization task that is used for batch synchronization of all data in a database must be scheduled on a regular basis. Before you run such a synchronization task to synchronize data, you must configure scheduling settings for the task such as the scheduling cycle, the effective date, and whether to pause scheduling. The scheduling settings for a synchronization task that is used for batch synchronization of all data in a database in Data Integration are the same as the scheduling settings for a synchronization node in DataStudio. For more information about the parameters that are required, see Configure time properties.

Configure resources required to run the synchronization task

When you configure a synchronization task, you must configure names or naming rules for subtasks that will be generated by the synchronization task, and specify the resource groups that you want to use to run the subtasks. Data Integration provides default settings for the maximum numbers of connections and parallel threads that are allowed. You can click Advanced configuration to modify the default settings to meet your business requirements.

If you create and run a one-click real-time synchronization task, the task will generate batch synchronization subtasks that are used to synchronize full data and real-time synchronization subtasks that are used to synchronize incremental data. You must configure properties related to the batch synchronization subtasks and real-time synchronization subtasks in the Run resource settings step. The properties include the names or naming rules of the real-time synchronization subtasks and batch synchronization subtasks, the exclusive resource groups for Data Integration used to run the real-time synchronization subtasks and batch synchronization subtasks, and the exclusive resource group for scheduling used to run the batch synchronization subtasks.
If you create and run a synchronization task that is used for batch synchronization of all data in a database, you must specify a naming rule for the batch synchronization subtasks that will be generated by the task, and select the exclusive resource group for Data Integration and exclusive resource group for scheduling used to run the batch synchronization subtasks.

Note

DataWorks uses resource groups for scheduling to issue the generated batch synchronization subtasks to resource groups for Data Integration and runs the subtasks on the resource groups for Data Integration. Therefore, a batch synchronization subtask also occupies the resources of a resource group for scheduling. You are charged fees for using the exclusive resource group for scheduling to schedule the batch synchronization subtasks. For information about the task issuing mechanism, see Mechanism for issuing nodes.
We recommend that you use different resource groups to run the generated batch and real-time synchronization subtasks. If you use the same resource group to run the subtasks, the subtasks compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of subtasks may affect each other. In this case, the batch synchronization subtasks may slow down, or the real-time synchronization subtasks may be delayed. The batch or real-time synchronization subtasks may even be terminated by the out of memory (OOM) killer due to insufficient resources.

What to do next

After a synchronization task is configured, you can manage the task. For example, you can add source tables to or remove source tables from the task, configure alerting and monitoring settings for the subtasks that are generated by the task, and view information about the running of the subtasks. For more information, see Perform O&M on a full and incremental synchronization task.