All Products
Search
Document Center

DataWorks:Create a real-time synchronization solution to synchronize data to DataHub

Last Updated:Dec 19, 2023

You can create a real-time synchronization solution and use the solution to synchronize full and incremental data to DataHub. This topic describes how to create a real-time synchronization solution to synchronize data to DataHub.

Prerequisites

  1. The required data sources are configured. Before you configure a data synchronization solution, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a data synchronization solution. For information about the data source types that support the solution-based synchronization feature and the configuration of a data source, see Supported data source types and read and write operations.
    Note For information about the items that you need to understand before you configure a data source, see Overview.
  2. An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
  3. Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.
  4. The data source environments are prepared. Before you configure a data synchronization solution, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.

Background information

Item

Description

Number of tables from which you can read data

  • You can read data from multiple source tables and write the data to multiple destination topics.

  • You can configure mapping rules for the source and destination. This way, you can read data from multiple tables and write the data to the same destination topic.

Nodes

A real-time synchronization solution generates batch synchronization nodes to synchronize full data and real-time synchronization nodes to synchronize incremental data. The number of batch synchronization nodes that are generated by the solution varies based on the number of tables from which you can read data.

Data write

After you run a real-time synchronization solution, full data in the source is written to the destination by using batch synchronization nodes. Then, incremental data in the source is written to the destination by using real-time synchronization nodes. When you write data to DataHub, take note of the following points:

  • You can write data only to topics of the TUPLE type. For more information about the data types that are supported by a TUPLE topic, see Data types.

  • When you use a real-time synchronization solution to synchronize data to DataHub, five additional fields are added to the destination topic by default. You can also add other fields to the destination topic based on your business requirements. For more information about the DataHub message formats, see Appendix: DataHub message formats.

Procedure

  1. Step 1: Select a synchronization solution

  2. Step 2: Configure network connections for data synchronization

  3. Step 3: Configure the source and synchronization rules

  4. Step 4: Configure a destination topic

  5. Step 5: Configure the resources required by the synchronization solution

  6. Step 6: Run the synchronization solution

Step 1: Select a synchronization solution

Go to the Data Integration page in the DataWorks console and click Create Data Synchronization Solution. On the Create Data Synchronization Solution page, select a source and a destination for data synchronization from the drop-down lists. Then, select One-click real-time synchronization to DataHub. For more information, see Create a synchronization solution.

Step 2: Configure network connections for data synchronization

Select a source, a destination, and a resource group that is used to run nodes. Test the network connectivity to make sure that the resource group is connected to the source and destination. For more information, see Configure network connections for data synchronization.

Step 3: Configure the source and synchronization rules

  1. In the Basic Configuration section, configure the parameters, such as the Solution Name and Location parameters, based on your business requirements.
  2. In the Data Source section, confirm the information about the source.
  3. In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the Icon icon to add the tables to the Selected Source Table list.

    The Selected Source Table list displays all tables in the source. You can select all or specific tables.

  4. In the Conversion Rule for Table Name section, click Add Rule, select a rule type, and then configure a mapping rule of the selected type.

    By default, data in a source table is written to a DataHub topic that has the same name as the source table. You can specify a destination topic name in a mapping rule to write data in multiple source tables to the same DataHub topic. You can also specify prefixes in a mapping rule to write data in source tables with a specific prefix to DataHub topics with the same names as the source tables but a different prefix. You can use regular expressions to convert the names of the destination topics. You can also use built-in variables to add prefixes and suffixes to the names of destination topics. For more information, see Configure the source and synchronization rules.

Step 4: Configure a destination topic

  1. Configure the DataHub write mode parameter.

    You can write only incremental data of the source tables to topics of the TUPLE type in DataHub in real time.

  2. Configure the Source tables without primary keys can be synchronized parameter.

    This parameter specifies whether to synchronize data in a source table that does not have a primary key to the destination DataHub topic.

  3. Configure mappings between the source table and the destination DataHub topic.

    Click Refresh source table and DataHub Topic mapping to create a destination topic based on the rules you configured in the Conversion Rule for Table Name section in Step 3. If no mapping rule is configured in Step 3, data in the source table is written to the destination topic that has the same name as the source table. If no destination topic that has the same name as the source table exists, the system automatically creates such a destination topic. You can also change the method of creating the destination topic and add additional fields to the destination topic.

    Note

    The name of the destination topic is generated based on the mapping rules that you configured in the Conversion Rule for Table Name section.

    Operation

    Description

    Select a primary key for a source table

    If you do not select Source tables without primary keys can be synchronized in the Set Destination Topic step and source tables without primary keys need to be synchronized, you can click the Edit icon in the Synchronized Primary Key column to specify primary keys for the tables.

    Select the method of creating a destination topic

    You can set the Topic creation method parameter to Create Topic or Use Existing Topic.

    • Use Existing Topic: If you select this method, you must select the desired destination topic from the drop-down list in the DataHub Topic column.

    • Create Topic: If you select this method, the name of the topic that is automatically created appears in the DataHub Topic column.

    Edit additional fields

    You can click Edit additional fields in the Actions column to add additional fields to a destination topic and assign values to the fields. The values can be constants or variables.

    Note

    You can add additional fields only if you select Create Topic from the drop-down list in the Topic creation method column.

    Edit destination topics

    By default, the lifecycle of topics that are created by using the Create Topic method is seven days and field type conversion may occur. For example, if the data types of the fields in a destination topic are different from the data types of the fields in a source table, the synchronization solution converts the fields in the source table to the data types that can be written to the destination topic. You can click the name of a topic in the DataHub Topic column to modify the lifecycle or the mappings between the source fields and the destination fields.

    Note

    You can edit a destination topic only if you set Table Creation Method to Create Topic.

Step 5: Configure the resources required by the synchronization solution

After you create a synchronization solution, the synchronization solution generates batch synchronization nodes for full data synchronization and real-time synchronization nodes for incremental data synchronization. You must configure the parameters in the Configure Resources step.

You can configure the exclusive resource groups for Data Integration that you want to use to run real-time synchronization nodes and batch synchronization nodes, and the resource groups for scheduling that you want to use to run batch synchronization nodes. You can also click Advanced Configuration to configure the Number of concurrent writes on the target side and Allow Dirty Data Records parameters.

Note
  • DataWorks uses resource groups for scheduling to issue the generated batch synchronization subtasks to resource groups for Data Integration and runs the subtasks on the resource groups for Data Integration. Therefore, a batch synchronization subtask also occupies the resources of a resource group for scheduling. You are charged fees for using the exclusive resource group for scheduling to schedule the batch synchronization subtasks. For information about the task issuing mechanism, see Mechanism for issuing nodes.

  • We recommend that you use different resource groups to run the generated batch and real-time synchronization subtasks. If you use the same resource group to run the subtasks, the subtasks compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of subtasks may affect each other. In this case, the batch synchronization subtasks may slow down, or the real-time synchronization subtasks may be delayed. The batch or real-time synchronization subtasks may even be terminated by the out of memory (OOM) killer due to insufficient resources.

Step 6: Run the synchronization solution

  1. Go to the Tasks page in Data Integration and find the created data synchronization solution.
  2. Click Submit and Run in the Actions column to run the data synchronization solution.
  3. Click Execution details in the Actions column to view the execution details of the data synchronization solution.

What to do next

After a data synchronization solution is configured, you can manage the solution. For example, you can add tables to or remove tables from the solution, configure alerting and monitoring settings for the nodes that are generated by the solution, and view information about the running of the nodes. For more information, see Perform O&M on a full and incremental synchronization task.