After you configure the data sources, network environments, and resource groups, you can create and run a real-time synchronization solution to synchronize all data in a database. This topic describes how to create such a real-time synchronization solution and view the running status of the nodes generated by the solution. You can use the solution to first synchronize data to Elasticsearch based on your business requirements in offline mode. In this process, you can synchronize some or all tables in the database. Then, you can use the solution to synchronize incremental data in real time to Elasticsearch.

Background information

You can use the hot-warm architecture that is provided by Elasticsearch in a fully managed manner to store the real-time data of enterprises. DataWorks provides real-time synchronization solutions that enable you to synchronize all data in a database to Elasticsearch in real time based on the architecture. You can use such a solution to first synchronize all data in a database to Elasticsearch in offline mode and then incremental data in the database in real time. You can also view the details of the solution, the running status of the nodes generated by the solution, and data updates in the database in real time. This facilitates subsequent data searches, analysis, and development.

Real-time synchronization solutions used to synchronize all data in a database have the following benefits:
  • Synchronization rules can be configured in a flexible manner.

    You can configure rules to process different data definition language (DDL) statements based on your business requirements. For example, you select Ignore for a DDL message that is specified in the source and used to drop a table in the destination. In this case, the system ignores the statement and does not drop the table in the destination when it receives the statement.

  • Large amounts of data can be updated in real time. This makes automated O&M and management more efficient.

Scenarios

Real-time synchronization solutions to synchronize all data in databases can be used when you want the system to monitor data updates in business databases in real time. This way, upper-layer applications can perform searches, analysis, and development on real-time data.

Limits

  • Only data in a MySQL database can be synchronized to Elasticsearch in real time.
  • A real-time synchronization solution used to synchronize all data in a database can be run only on exclusive resource groups.

Create a real-time synchronization solution used to synchronize all data in a database

  1. Go to the Data Integration page and choose Sync Solutions > Nodes to go to the Task list page.
    For more information, see Go to the Sync Solutions page.
  2. On the Task list page, click New task in the upper-right corner.
  3. In the New synchronization solution dialog box, click One-click realtime synchronization to Elasticsearch.
  4. In the Set synchronization sources and rules step, configure basic information such as the solution name for the data synchronization solution.
    In the Basic configuration section, configure the parameters.Basic configuration
    Parameter Description
    Scheme name The name of the data synchronization solution. The name can be a maximum of 50 characters in length.
    Description The description of the data synchronization solution. The description can be a maximum of 50 characters in length.
    Destination task storage location The Automatically establish workflow check box is selected by default. This indicates that DataWorks automatically creates a workflow named in the format of clone_database_Source data source name+to+Destination data source name in the Data Integration directory. All synchronization nodes generated by the data synchronization solution are placed in the directory of this workflow.

    If you clear the Automatically establish workflow check box, select a directory from the Select Location drop-down list. All synchronization nodes generated by the data synchronization solution are placed in the specified directory.

  5. Select a source and configure synchronization rules.
    1. In the Select the source table for synchronization section, select the tables whose data you want to synchronize from the Source Table list. Then, click the Icon icon to move the tables to the Selected Source table list.
      Select tables from the source
      The Source Table list displays all tables in the selected source. You can choose to synchronize data in some or all tables in the source.
      Notice If a selected table has no primary key, you must customize a primary key when you create the mapping between the table and the destination Elasticsearch index. This primary key is used to remove duplicated data during synchronization. For example, you can use one field or a combination of several fields as the primary key of the table. For more information about how to configure mappings between the source tables and the destination Elasticsearch indexes, see Step 6.
    2. In the Set synchronization rules section, click Add rule and select an option to configure the naming rules for destination tables.
      Supported options include Conversion Rule for Table Name and Rule for Destination Table name.
      • Conversion Rule for Table Name: the rule used to convert the names of source tables to those of destination tables.
      • Rule for Destination Table name: the rule used to add a prefix and a suffix to the converted names of destination tables.
  6. Select the destination and configure the destination index.
    1. In the Set Destination Index step, specify Target Elasticsearch data source.
    2. Click Refresh source table and Elasticsearch Index mapping to configure the mappings between the source tables and destination Elasticsearch indexes.
    3. View the mapping progress, source tables, and mapped destination indexes.
      Progress
      No. Description
      1 The progress of mapping between the source tables and destination indexes.
      Note The mapping may require a long time if you want to synchronize data from a large number of tables.
      2
      3 The method used to create an index. Valid values:
    4. Click Next Step.
  7. Configure rules to process DDL messages.
    DDL messages exist in the source. Before you synchronize data, you can configure synchronization rules for different DDL messages based on your business requirements.
    Note The rules apply the first time a real-time synchronization solution is run. If you want to modify the rules in subsequent operations, go to the configuration page of the real-time synchronization solution to perform the operation. For more information, see Manage the real-time synchronization solution.
    1. In the Processing Policy for DDL Messages step, configure the policies to process DDL messages for the real-time synchronization nodes generated by the synchronization solution. Processing Policy for DDL Messages
      The following table describes the policies to process different DDL messages.
      DDL message Policy
      CreateTable DataWorks processes a DDL message of the related type based on the following policies after it receives the message:
      • Normal: sends the message to the destination. Then, the destination processes the message. Each destination may process DDL messages based on its own business logic. If you select Normal for CreateTable, DataWorks only forwards the messages. For example, an AddColumn DDL message is a wrong instruction to MaxCompute, but it is a normal instruction to Hologres.
      • Ignore: ignores the message and does not send it to the destination.
      • Alert: ignores the message and records the alert in real-time synchronization logs. In addition, the alert contains information about the reason that a message is ignored because of a running error.
      • Error: returns an error during the running of the real-time synchronization solution and terminates the real-time synchronization solution.
      DropTable
      AddColumn
      DropColumn
      RenameTable
      RenameColumn
      ChangeColumn
      TruncateTable
    2. Click Next Step.
  8. Configure the resources required for the data synchronization solution.
    In the Run resource settings step, configure the parameters. Run resource settings
    • Offline Full synchronization
      Parameter Description
      Offline task name rules The name of the batch synchronization node that is used to synchronize the full data of the source. After a data synchronization solution is created, DataWorks first generates a batch synchronization node to synchronize full data, and then generates real-time synchronization nodes to synchronize incremental data.
      Resource Group for Full Batch Sync Nodes

      The exclusive resource group for data integration that is used to run the batch synchronization node.

    • Full Batch Scheduling
      Parameter Description
      Select scheduling Resource Group

      The resource group for scheduling that is used to run the nodes.

      Only exclusive resource groups for data integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for data integration that you purchased.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Real-time incremental synchronization
      Parameter Description
      Select an exclusive resource group for real-time tasks

      The exclusive resource group that is used to run the real-time synchronization nodes.

    • Channel Settings
      Parameter Description
      Maximum number of connections supported by source read The maximum number of Java Database Connectivity (JDBC) connections that are allowed by the source database. Specify an appropriate number based on the resources of the source database. The default value is 20.
  9. Click Complete configuration. The real-time synchronization solution used to synchronize all data in a database is created.

Run the real-time synchronization solution

On the Task list page, find the created data synchronization solution and click Submit execution in the Operation column to run the data synchronization solution.

View the running status and result of the synchronization nodes

  • On the Task list page, find the solution that is run and click Execution details in the Operation column. Then, you can view the running details of all nodes generated by the synchronization solution. Status of the synchronization nodes
  • Find a node whose running details you want to view and click Execution details in the Status column. In the dialog box that appears, click the provided link to go to the DataStudio page.

Manage the real-time synchronization solution