After you prepare data sources, network environments, and resources, you can select a data synchronization solution type based on your business requirements, and create and run a data synchronization solution to synchronize data between the data sources. This topic describes the common procedure that is required to configure a data synchronization solution. The detailed procedure varies based on the data synchronization solution type that you select. You can view the configuration details for each type of data synchronization solution in the DataWorks console.

Prerequisites

  1. The required data sources are configured. Before you configure a data synchronization solution, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a data synchronization solution. For information about the data source types that support the solution-based synchronization feature and the configuration of a data source, see Supported data source types and read and write operations.
    Note For information about the items that you need to understand before you configure a data source, see Overview.
  2. An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
  3. Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.
  4. The data source environments are prepared. Before you configure a data synchronization solution, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.

Background information

Data Integration provides various types of data synchronization solutions that you can use to synchronize data among data sources. You can use one of the following types of data synchronization solutions to synchronize data: batch synchronization of all data in a database (one-time full synchronization, periodical full synchronization, one-time full synchronization and periodical incremental synchronization, one-time incremental synchronization, and periodical incremental synchronization) and one-click real-time synchronization (one-time full synchronization and real-time incremental synchronization). Different sources and destinations support different types of data synchronization solutions. You can view the data synchronization solution types that are supported for each type of data source in the DataWorks console. For information about the capabilities provided by the solution-based synchronization feature, see Overview of the solution-based synchronization feature.

Limits

  • Data synchronization across time zones

    You cannot use a data synchronization solution to synchronize data across time zones. If the data sources used for a data synchronization solution reside in different time zones from the resource group that you use, errors occur on fields of a date or time data type during data synchronization.

  • Number of databases from which data can be synchronized
    • If you run a data synchronization solution that is used for batch synchronization of all data in a database, the solution can read data only from the default database in the source that is used for the solution.
    • If you run a one-click real-time synchronization solution to synchronize data from an ApsaraDB RDS instance, you can synchronize data in all databases on which the account configured for the data source has permissions.

Go to the related page in Data Integration

You can create a data synchronization solution only in Data Integration.

  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspaces.
  3. In the top navigation bar, select the region where the workspace that you want to manage resides. Find the workspace and click Data Integration in the Actions column.

Procedure

  1. Step 1: Create a data synchronization solution
  2. Step 2: Select a data synchronization solution type
  3. Step 3: Establish network connections between the exclusive resource group for Data Integration and the data sources
  4. Step 4: Select the source databases and tables and configure mapping rules
  5. Step 5: Configure the destination tables
  6. Step 6: Configure rules to process DDL or DML messages and configure synchronization rules
  7. Step 7: Configure resources required to run the data synchronization solution
  8. Step 8: Run the data synchronization solution

Step 1: Create a data synchronization solution

You can use one of the following methods to create a data synchronization solution:
  • Method 1: Go to the homepage of Data Integration. On the homepage, click Create Data Synchronization Solution.
  • Method 2: Go to the Tasks page in Data Integration. On the Tasks page, click Create Node.

Step 2: Select a data synchronization solution type

On the Create Data Synchronization Solution page, select a source type and a destination type based on your business requirements. After you select the data source types, the system displays the supported data synchronization solution types. You can select a data synchronization solution type based on your business requirements.
Note For information about the destination types that support the solution-based synchronization feature and the available data synchronization solution types, see Supported data source types and data synchronization solutions.

Step 3: Establish network connections between the exclusive resource group for Data Integration and the data sources

Select the source, destination, and the exclusive resource group for Data Integration that you want to use. Then, establish network connections between the resource group and data sources.
Note If no data sources are available, click Create Data Source to create a data source. For more information, see Overview.

Step 4: Select the source databases and tables and configure mapping rules

Select the source databases and tables. By default, data is written to the destination schema and table that are named the same as the source database and table. If no such destination schema or table exists, the system automatically creates the schema or table in the destination. You can configure a mapping rule to specify the name of the destination schema or table to which you want to write data in the Set Mapping Rules for Table/Database Names section. You can configure a mapping rule to synchronize data from multiple tables in the source to the same table in the destination. You can also configure a mapping rule to synchronize data from source tables whose names start with a specified prefix to the destination tables whose names start with another specified prefix.
  • Conversion Rule for Table Name: This type of mapping rule allows you to use a regular expression to map the names of the destination tables to which you want to write data to the names of source tables.
    • Example 1: Synchronize data from the source tables whose names start with the prefix doc_ to the destination tables whose names start with the prefix pre_. Rename
    • Example 2: Synchronize data from multiple source tables to the same destination table.
      To synchronize data from table_01, table_02, and table_03 to my_table, you can configure a mapping rule of the Conversion Rule for Table Name type, and set Source to table.* and Target to my_table. Example
  • Rule for Destination Table name: This type of mapping rule allows you to use a built-in variable to specify the names of the destination tables to which you want to write data and add a prefix and a suffix to the names of the destination tables. The following built-in variables are supported:
    • ${db_table_name_src_transed}: the name of the destination table that is mapped based on a mapping rule of the Conversion Rule for Table Name type
    • ${db_name_src_transed}: the name of the destination schema that is mapped based on a mapping rule of the Rule for Conversion Between Source Database Name and Destination Schema Name type
    • ${ds_name_src}: the name of the source

    For example, you can configure pre_${db_table_name_src_transed}_post to convert the table name my_table that is generated in the previous example to pre_my_table_post.

  • Rule for Conversion Between Source Database Name and Destination Schema Name: This type of mapping rule allows you to use a regular expression to specify the names of the destination schemas to which you want to write data.
    Example: Synchronize data from the source schemas whose names start with the prefix doc_ to the destination schemas whose names start with the prefix pre_. Schema

Step 5: Configure the destination tables

Specify the attributes related to the destination tables. For example, you can specify the write mode, whether to write data to partitioned destination tables, and the name of the partition key column. You can choose to write data to the existing destination tables or newly created destination tables. You can also add additional fields to the destination tables and assign values to the additional fields.
Note
  • After you click Refresh Source Table and Destination Table Mapping, the system maps the source tables and destination tables based on the mapping rules that you configured in the Set Mapping Rules for Table/Database Names section.
  • The items that you need to configure in this step vary based on the destination type. You can view the items that you need to configure in the DataWorks console. For more information, see Supported data source types and data synchronization solutions.

Step 6: Configure rules to process DDL or DML messages and configure synchronization rules

The items that you need to configure in this step vary based on the data synchronization solution type that you selected.
  • Configure rules to process DDL or DML messages for a one-click real-time synchronization solution

    DDL or DML operations may be performed on the source. To ensure that data synchronized to the destination meets your business requirements, you can configure rules to process DDL or DML messages from the source based on the destination type. For information about how to configure rules to process DDL messages or DML messages, see Synchronize incremental data in a database in real time or Synchronize data to Hologres in real time.

  • Configure synchronization rules for a data synchronization solution that is used for batch synchronization of all data in a database
    When you configure a data synchronization solution that is used for batch synchronization of all data in a database, you must configure synchronization rules for the solution. For example, you must specify a filter condition to implement incremental synchronization and configure scheduling settings for the solution.
    • Implement incremental synchronization: You can use a WHERE clause to extract only incremental data from the source tables. You need to enter only the WHERE clause in the Condition for Incremental Synchronization field without the need to enter the WHERE keyword. When you specify the WHERE clause, you can use a built-in variable. For example, you can use ${bdp.system.bizdate} to represent the data timestamp of the solution or use ${bdp.system.cyctime} to represent the scheduling time of the solution.
      Note You can use scheduling parameters to specify the scope of the data that you want to synchronize and the location to which you want to write the data. For more information about how to use scheduling parameters, see Description for using scheduling parameters in data synchronization.
    • Configure scheduling settings: A data synchronization solution that is used for batch synchronization of all data in a database must be scheduled on a regular basis. Before you run such a data synchronization solution to synchronize data, you must configure scheduling settings for the solution, such as Recurrence, Run At, and Pause Scheduling. The scheduling settings for a data synchronization solution that is used for batch synchronization of all data in a database are the same as the scheduling settings for a data synchronization node. For more information, see Configure time properties.

Step 7: Configure resources required to run the data synchronization solution

In this step, specify names for the nodes that will be generated by the data synchronization solution, and the exclusive resource groups for Data Integration and exclusive resource group for scheduling that you want to use to run the nodes. Data Integration provides default settings for the maximum numbers of connections and parallel threads that are allowed. You can click Advanced Configuration to modify the default settings to meet your business requirements.
  • If you create and run a one-click real-time synchronization solution, the solution will generate batch synchronization nodes that are used to synchronize full data and a real-time synchronization node that is used to synchronize incremental data. You must configure attributes related to the batch synchronization nodes and the real-time synchronization node in the Configure Resources step. The attributes include the name of the real-time synchronization node and naming formats of the batch synchronization nodes, the exclusive resource groups for Data Integration used to run the real-time synchronization node and batch synchronization nodes, and the exclusive resource group for scheduling used to run the batch synchronization nodes.
  • If you create and run a data synchronization solution that is used for batch synchronization of all data in a database, you must specify a naming format for the batch synchronization nodes that will be generated by the solution, and select the exclusive resource group for Data Integration and exclusive resource group for scheduling used to run the batch synchronization nodes.
Note
  • DataWorks uses resource groups for scheduling to issue batch synchronization nodes to resource groups for Data Integration and runs the nodes on the resource groups for Data Integration. Therefore, a batch synchronization node also occupies the resources of a resource group for scheduling. You are charged fees for using the resource group for scheduling to schedule the batch synchronization nodes. For information about the node issuing mechanism, see Mechanism for issuing nodes.
  • We recommend that you use different resource groups to run batch and real-time synchronization nodes. If you use the same resource group to run batch and real-time synchronization nodes, the nodes compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of nodes may affect each other. In this case, the batch synchronization nodes may slow down, or the real-time synchronization node may be delayed. Out of memory (OOM) errors may also occur due to insufficient resources.

Step 8: Run the data synchronization solution

Run the created data synchronization solution and view the execution details of the solution.

  1. Go to the Tasks page in Data Integration and find the created data synchronization solution.
  2. Click Submit and Run in the Actions column to run the data synchronization solution.
  3. Click Execution details in the Actions column to view the execution details of the data synchronization solution.

What to do next

After a data synchronization solution is configured, you can manage the solution. For example, you can add tables to or remove tables from the solution, configure alerting and monitoring settings for the nodes that are generated by the solution, and view information about the running of the nodes. For more information, see Perform O&M for a data synchronization solution.