All Products
Search
Document Center

DataWorks:Configure a synchronization task in Data Integration

Last Updated:Jan 17, 2025

After you prepare data sources, network environments, and resources, you can select a synchronization task type based on your business requirements, and create and run a synchronization task to synchronize data between the data sources. This topic describes a common procedure that is required to configure a synchronization task. The detailed procedure varies based on the synchronization task type that you select. You can view the configuration details for each type of synchronization task in the DataWorks console.

Prerequisites

  1. The data sources that you want to use are prepared. Before you configure a synchronization task, you must prepare the data sources from which you want to read data and to which you want to write data. This way, when you configure a synchronization task, you can select the data sources. For information about the supported data source types and the addition of data sources, see Supported data source types and synchronization operations.

    Note

    For information about the items that you need to understand before you add a data source, see Overview.

  2. The data source environments are prepared. Before you configure a synchronization task, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.

Background information

Data Integration provides various types of synchronization tasks, including batch synchronization of all data in a database (one-time full synchronization, periodic full synchronization, one-time full synchronization and periodic incremental synchronization, one-time incremental synchronization, and periodic incremental synchronization) and real-time synchronization (one-time full synchronization and real-time incremental synchronization). Different sources and destinations support different types of synchronization tasks. You can view the synchronization task types that are supported by each type of data source in the DataWorks console. For information about the capabilities provided by synchronization tasks, see Overview of the full and incremental synchronization feature.

Limits

  • You cannot run a synchronization task to synchronize data across time zones. If the data sources used for a synchronization task reside in different time zones from the resource group that you use, errors occur on fields of a date or time data type during data synchronization.

  • If you run a synchronization task that is used for batch synchronization of all data in a database, the task can read data only from the default database in the source that is used for the task.

Precautions

In the following situations, you must manually specify a synchronization offset for a real-time synchronization task:

  • If you want to resume a real-time synchronization task after the task is interrupted, you must manually set the point in time at which the task is interrupted as the synchronization offset of the task. This way, after the real-time synchronization task is restarted, the task can start synchronizing data from the point in time.

  • If data is lost or an exception occurs during the running of a real-time synchronization task, you must manually reset the synchronization offset of the task to a point in time that is earlier than the point in time at which data starts to be written to the destination. This can ensure data integrity.

  • After you modify the configurations of a real-time synchronization task, such as modifying configurations related to destination tables or field mappings, you must manually specify a synchronization offset for the task to ensure the accuracy of the synchronized data.

If the system reports an error indicating that the synchronization offset is incorrect or does not exist during the running of a synchronization task, you can refer to the following methods to resolve the issue:

  • Reset the synchronization offset: If the synchronization task is a real-time synchronization task, you can reset the synchronization offset for the task and select the earliest synchronization offset that is available in the source database.

  • Modify the retention period of binary logs: If the synchronization offset of the source database used in the synchronization task is expired, you can modify the retention period of binary logs for the source database. For example, you can set the retention period to seven days.

  • Synchronize data again: If data is lost during the running of the synchronization task, you can run the synchronization task to synchronize full data again or configure a batch synchronization task to manually synchronize the lost data.

Go to the Data Integration page

You can create a synchronization task in Data Integration.

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.

Procedure

  1. Step 1: Create a synchronization task

  2. Step 2: Select a synchronization task type

  3. Step 3: Configure network connectivity between an exclusive resource group for Data Integration and the data sources

  4. Step 4: Configure the synchronization task

  5. Step 5: Start the synchronization task

Step 1: Create a synchronization task

You can use one of the following methods to create a synchronization task:

  • Method 1: Go to the Synchronization Task page in Data Integration. On this page, select the desired source type and destination type and click Create Synchronization Task.

  • Method 2: Go to the Synchronization Task page in Data Integration. If no synchronization tasks are displayed in the Tasks section of this page, click Create.

image

Step 2: Select a synchronization task type

On the Create Data Synchronization Solution page, select the source type and destination type based on your business requirements. After you select the data source types, the system displays supported synchronization task types in the Synchronization Method drop-down list. You can select a synchronization task type based on your business requirements.

Note

For information about the supported data source types and synchronization task types, see Supported data source types and data synchronization solutions.

Step 3: Configure network connectivity between an exclusive resource group for Data Integration and the data sources

Select the source, destination, and a resource group that you want to use. Then, test network connectivity between the resource group and data sources.

Note
  • If no data sources are available, you can click Add Data Source to add a data source. For more information about data sources, see Overview.

  • If you use a serverless resource group to run a synchronization task, you can specify an upper limit for the number of CUs that can be used to run the synchronization task. If an out of memory (OOM) error is reported for the synchronization task due to insufficient resources, you can appropriately change the upper limit.

Step 4: Configure the synchronization task

Click Next and configure the synchronization task based on the instructions that are displayed.

Step 5: Start the synchronization task

Run the created synchronization task and view the running details of the task.

  1. Go to the Tasks section of the Data Integration page and find the newly created synchronization task.

  2. Click Start in the Actions column to start the synchronization task.

  3. Click the blank area next to each stage displayed in the Execution Overview column to view the running details of the synchronization task.

Appendix: Advanced settings

Select source databases and tables and configure mapping rules

After you select a source database and a table, data is written to the destination schema and table that are named the same as the source database and table by default. If no such destination schema or table exists, the system automatically creates the schema or table in the destination. You can configure the Customize Mapping Rules for Destination Schema Names or Customize Mapping Rules for Destination Table Names parameter to specify the name of the destination schema or table to which you want to write data. You can specify a destination table name in a mapping rule to write data in multiple source tables to the same table. You can also specify prefixes in a mapping rule to write data to a database whose name starts with a different prefix from the source database or to tables whose names start with a different prefix from the source tables.

Note

When you specify a destination schema or table name, you must abide by the naming conventions and do not use periods (.) to ensure that the system can correctly identify and parse the specified name.

Customize Mapping Rules for Destination Schema Names

  • Replace the prefix for the name of a source database with a string: You can use a regular expression to specify the names of the destination schemas to which you want to write data synchronized from specific source databases or schemas.

    Example: Synchronize data from the source databases whose names start with the prefix doc_ to the destination schemas whose names start with the prefix pre_.

    image

  • Generate the name of a destination schema: You can concatenate built-in variables and a string to specify the names of the destination schemas to which you want to write data.

    Example: Concatenate a string to the names of destination schemas that are obtained in the previous example. Use Source database name to represent the processing result obtained in the previous example and add a suffix to the processing result. Example: Source database name_d.

    image

Customize Mapping Rules for Destination Table Names

  • Replace the prefix for the name of a source table with a string: You can use a regular expression to specify the names of the destination tables to which you want to write data synchronized from specific source tables.

    • Example 1: Synchronize data from the source tables whose names start with the prefix doc_ to the destination tables whose names start with the prefix pre_.

      image

    • Example 2: Synchronize data from multiple source tables to the same destination table.

      To synchronize data from table_01, table_02, and table_03 to my_table, you must set Source to table_* and Destination to my_table when you use a regular expression to configure a mapping rule.

      image

  • Generate the name of a destination table: You can concatenate built-in variables and a string to specify the names of destination tables to which you want to write data.

    You can configure a string replacement rule separately on the Source Datasource Name, Source Database Name, and Source Table Name tabs in the Edit Built-in Variable section, and use the built-in variables specified in the rules when you configure the Destination Table Name parameter.

    Example: Concatenate strings to the names of destination tables that are obtained in Example 2. Use Source table name to represent my_table, which is the processing result obtained in Example 2. Then, add a prefix and a suffix to the processing result. For example, pre_Source table name_post can be mapped to the destination table name pre_my_table_post.

    image

Configure the destination tables

You can define the properties of destination tables. For example, you can specify whether to write data to an existing table or a new table, and the fields, field descriptions, partition field, and data lifecycle of a destination table.

image

Note
  • After you configure the properties of destination tables and click Apply and Refresh Mapping, source tables are automatically mapped to destination tables based on the table rules that you configure.

  • The items that are required when you configure the destination tables vary based on the destination type. You can view the items that you need to configure in the DataWorks console. For more information, see Supported data source types and data synchronization solutions.

Configure rules to process DDL or DML messages and configure synchronization rules

The parameters that you must configure vary based on the synchronization task type that you selected.

  • Configure rules to process DDL or DML messages for a real-time synchronization task

    The binary logs of relational databases may contain DDL statements. When you configure a real-time synchronization task to synchronize data from a relational database, you can click Configure DDL Capability in the upper-right corner of the configuration page of the real-time synchronization task to configure rules to process the related DDL messages. Support for synchronizing data changes generated by DDL operations varies based on the destination type. For more information, see Supported DML and DDL operations. You can also configure processing rules for a specific destination type. To configure processing rules for a specific destination type, perform the following steps: In the left-side navigation pane of the Data Integration page, choose Configuration Options > Processing Policy for DDL Messages in Real-time Sync. On the Processing Policy for DDL Messages in Real-time Sync page, configure DDL processing rules. If no DDL processing rules are configured when you configure a real-time synchronization task, the DDL processing rules that are configured on the Processing Policy for DDL Messages in Real-time Sync page are used by default.

  • Configure synchronization rules for a synchronization task that is used for batch synchronization of all data in a database

    When you configure a synchronization task that is used for batch synchronization of all data in a database, you must configure synchronization rules for the task. For example, you must specify a filter condition to implement incremental synchronization and configure scheduling settings for the task.

    • Implement incremental synchronization: You can use a WHERE clause to extract only incremental data from the source tables. You need to enter only the WHERE clause in the Condition for Incremental Synchronization field without the need to enter the WHERE keyword. When you specify the WHERE clause, you can use a built-in variable. For example, you can use ${bdp.system.bizdate} to represent the data timestamp of the task or use ${bdp.system.cyctime} to represent the scheduling time of the task.

      image

      Note

      You can use scheduling parameters to specify the source tables from which you want to read data and the destination tables to which you want to write data. For more information about how to use scheduling parameters, see Description for using scheduling parameters in data synchronization.

    • Configure scheduling settings: A synchronization task that is used for batch synchronization of all data in a database must be scheduled on a regular basis. Before you run such a synchronization task to synchronize data, you must configure scheduling settings for the task such as the scheduling cycle, the effective date, and whether to pause scheduling. The scheduling settings for a synchronization task that is used for batch synchronization of all data in a database in Data Integration are the same as the scheduling settings for a synchronization node in DataStudio. For more information about the parameters that are required, see Configure time properties.

      image

Configure resources and advanced parameters for the synchronization task

You can perform the following operations on the synchronization task:

  • Click Configure Resource Group in the upper-right corner of the configuration page of the synchronization task and select a resource group for the synchronization task.

    Note

    DataWorks uses resource groups for scheduling to issue batch synchronization tasks to resource groups for Data Integration and runs the tasks on the resource groups for Data Integration. Therefore, a batch synchronization task also occupies the resources of a resource group for scheduling.

    • If you use a serverless resource group to run the synchronization task, you do not need to pay attention to the differences between resource groups. You can use a serverless resource group for data synchronization and task scheduling at the same time.

    • If you use an exclusive resource group for scheduling to run the synchronization task, you are charged for scheduling instances.

    For more information, see Overview of DataWorks resource groups.

  • Click Configure Advanced Parameters in the upper-right corner of the configuration page of the synchronization task, and configure items such as the maximum number of connections that are allowed for the source database and the number of parallel threads.

    Note

    The advanced parameters that you can configure vary based on the data source type. You can view the advanced parameters that you must configure for different types of data sources in the DataWorks console.

What to do next

After a synchronization task is configured, you can manage the task. For example, you can add source tables to or remove source tables from the task, configure alerting and monitoring settings for the task, and view information about the running of the subtasks. For more information, see Perform O&M on a full and incremental synchronization task.