After you configure data sources, network environments, and resource groups, you can create and run a real-time sync solution to synchronize all data in a database. This topic describes how to create a real-time sync solution to synchronize data in some or all tables in a database to DataHub in batch mode and then synchronize incremental data in the database to DataHub in real time. This topic also describes how to view the statuses of the nodes generated by the real-time synchronization solution.

Background information

DataWorks provides a sync solution that can be used to synchronize all data in a database to DataHub in real time. The synchronization solution synchronizes all data in the database to DataHub in batch mode and then synchronizes incremental data in the database to DataHub in real time. You can view the details of the sync solution, the statuses of the nodes generated by the solution, and data updates in the database in real time. This facilitates subsequent data searches, analysis, and development.

Real-time sync solutions that are used to synchronize all data in a database provide the following benefits:
  • Synchronizes the full data of a database.

    You do not need to create multiple batch data synchronization nodes to synchronize source tables one by one. You can directly create a batch synchronization solution to synchronize some or all of the tables in a database at a time.

  • You can configure synchronization rules in a flexible manner.
    • You can configure synchronization rules for different data definition language (DDL) messages based on your business requirements. For example, if you select Ignore for a DDL message that is specified in the source and used to drop a table in the destination, the system ignores the message and does not drop the table in the destination when the system receives the message.
    • You can add or remove source tables for a sync solution that is running.
    • You can configure synchronization rules for destination DataHub topics to determine whether to synchronize the incremental data in source tables to destination DataHub topics based on your business requirements. After the incremental data is synchronized, the incremental data can be searched in destination DataHub topics.
  • Requires only simple configurations.

    You do not need to perform complex operations, such as creating synchronization nodes, databases, and tables, configuring dependencies for nodes, and configure mappings between sources and destinations. Instead, you need only to configure a batch synchronization solution in a configuration wizard.

  • Large amounts of data can be updated in real time. This improves the efficiency of automated O&M.

Scenarios

If you want the system to monitor data updates in business databases in real time, you can use real-time sync solutions to synchronize all data in the databases. This way, upper-layer applications can search for, analyze, and develop data in real time.

Limits

  • A real-time sync solution that is used to synchronize all data in a database can synchronize data only from MySQL, PolarDB, or Oracle to DataHub.

  • A real-time sync solution that is used to synchronize all data in a database can be run only on exclusive resource groups.

Create a real-time sync solution to synchronize all data in a database

  1. Go to the Data Integration page and choose Sync Solutions > Nodes to go to the Task list page.
    For more information, see Select a data synchronization solution.
  2. On the Task list page, click New task in the upper-right corner.
  3. On the Create Data Synchronization Solution page, click One-click real-time synchronization to DataHub.
  4. In the Set synchronization sources and rules step, configure basic information such as the solution name for the data synchronization solution.
    In the Basic configuration section, configure the parameters.Basic configuration
    Parameter Description
    Scheme name The name of the data synchronization solution. The name can be a maximum of 50 characters in length.
    Description The description of the data synchronization solution. The description can be a maximum of 50 characters in length.
    Destination task storage location The Automatically establish workflow check box is selected by default. This indicates that DataWorks automatically creates a workflow named in the format of clone_database_Source data source name+to+Destination data source name in the Data Integration directory. All synchronization nodes generated by the data synchronization solution are placed in the directory of this workflow.

    If you clear the Automatically establish workflow check box, select a directory from the Select Location drop-down list. All synchronization nodes generated by the data synchronization solution are placed in the specified directory.

  5. Select a source and configure synchronization rules.
    1. In the Data Source section, specify the Type and Data source parameters.
      Note

      A real-time sync solution that is used to synchronize all data in a database can synchronize data only from MySQL, PolarDB, or Oracle to DataHub.

    2. In the Source Table section, select the tables whose data you want to synchronize from the Source Table list. Then, click the Move icon icon to move the tables to the Selected Source Table list.
      Select tables from the source

      The Source Table list displays all tables in the selected source. You can choose to synchronize data in some or all tables in the source.

    3. In the Conversion Rule for Table Name section, click Add rule to select a rule.
      Supported options include Conversion Rule for Table Name and Rule for Destination Topic.
      • Conversion Rule for Table Name: the rule for converting the names of source tables to those of destination topics.
      • Rule for Destination Topic: the rule for adding prefixes and suffixes to destination topics.
    4. Click Next Step.
  6. Select a data source as the destination and configure the destination topics.
    1. In the Set Destination Topic step, select the destination DataHub data source.
    2. Click Refresh source table and DataHub Topic mapping to configure the mappings between the source tables and destination DataHub topics.
    3. View the mapping progress, source tables, and mapped destination DataHub topics.
      Mapping progress
      No. Description
      The progress of mapping the source tables to destination DataHub topics.
      Note The mapping may require a long period of time if you synchronize data from a large number of tables.
      • If the tables in the source database contain primary keys, the system removes duplicate data based on the primary keys during the synchronization.
      • If the tables in the source database do not contain primary keys, you can click the Edit icon to customize primary keys. You can use one field or a combination of several fields as the primary keys of the tables. This way, the system removes duplicate data based on the primary keys during the synchronization.
      The methods of creating the destination DataHub topics.
      • If you set Topic creation method to Create Topic for a destination DataHub topic, the DataHub topic is automatically created. The name of the DataHub topic is displayed in the DataHub Topic column. You can click the name of the DataHub topic to modify the configurations of the topic.
      • If you set Topic creation method to Use Existing Topic for a destination DataHub topic, select the topic that you want to use from the drop-down list in the DataHub Topic column.
      If you set Topic creation method to Create Topic for a destination DataHub topic, you can click the name of the DataHub topic to modify the configurations of the topic based on your business requirements. Modify the configurations of a destination DataHub topic
      • Create Topic in Production Environment: indicates whether to create the topic in the production environment. This option is displayed for a DataWorks workspace in standard mode and is selected by default.
      • Life cycle: the lifecycle of the topic. Unit: days. Default value: 7.
      • Data field structure: the fields and their data types in the topic.
      Note If you do not change the values of the parameters related to a topic after the topic is created, the system synchronizes data based on the default values of the parameters.
    4. Click Next.
  7. Configure the resources required by the data sync solution.
    In the Set Resources for Solution Running step, set the parameters. Set Resources for Solution Running
    • Offline Full synchronization
      Parameter Description
      Offline task name rules The name of the batch sync node that is used to synchronize the full data of the source. After a data sync solution is created, DataWorks first generates a batch sync node to synchronize full data, and then generates real-time sync nodes to synchronize incremental data.
      Resource Groups for Full Batch Sync Nodes

      The exclusive resource group for Data Integration that is used to run the batch sync node.

      Only exclusive resource groups for Data Integration can be used to run sync solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Full Batch Scheduling
      Parameter Description
      Select scheduling Resource Group

      The resource group for scheduling that is used to run the nodes.

      Only exclusive resource groups for Data Integration can be used to run sync solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Real-time Incremental synchronization
      Parameter Description
      Select an exclusive resource group for real-time tasks

      The exclusive resource group that is used to run the real-time sync nodes.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Channel Settings
      Parameter Description
      Maximum number of connections supported by source read The maximum number of Java Database Connectivity (JDBC) connections that are allowed for the source. Specify an appropriate number based on the resources of the source. Default value: 15.
  8. Click Complete Configuration. The real-time sync solution used to synchronize all data in a database is created.
    Note

Run the real-time sync solution

On the Tasks page, find the newly created data sync solution and choose More > Submit and Run in the Operation column to run the data sync solution.

View the statuses and running results of the sync nodes

  • On the Tasks page, find the solution that is run and click Execution details in the Operation column. Then, you can view the execution details of all nodes generated by the sync solution. Status of the synchronization nodes
  • Find a node whose execution details you want to view and click Execution details in the Status column. Then, you can click the link provided in the dialog box that appears to go to the DataStudio page.

Manage the real-time sync solution

  • View the configurations of the sync solution.

    On the Tasks page, find the newly created sync solution and choose More > View Configuration. Then, you can view the configurations of the sync solution.

  • Modify the sync solution.

    On the Tasks page, find the newly created sync solution and choose More > Modify Configuration. Then, you can modify the configurations of the sync solution.

    For a sync solution that is successfully run, you can choose More > Modify Configuration to add or remove source tables. Procedure:

    In the Source Table section of the Set Synchronization Sources and Rules step, add or remove source tables for the sync solution. Then, save the modification and run the sync solution.

  • Change the priority for the batch synchronization solution
    Find the newly created batch synchronization solution and choose More > Change Priority in the Operation column. In the Change Priority dialog box, enter the desired priority and click Confirm. You can set the priority to an integer from 1 to 8. A larger value indicates a higher priority.
    Note If multiple batch synchronization solutions have the same priority, the system runs them based on the order they are committed.
  • Delete the batch synchronization solution.
    Find the batch synchronization solution that you want to delete and choose More > Delete in the Operation column. In the Delete message, click OK.
    Note After you click OK, only the configuration record of the batch synchronization solution is deleted. The synchronization nodes generated by the solution and data tables generated by the synchronization nodes are not affected.