After you configure data sources, network environments, and resource groups, you can create and run a real-time synchronization solution to synchronize all data in a database. This topic describes how to create a real-time synchronization solution to synchronize data in some or all tables in a database to Elasticsearch in offline mode and then synchronize incremental data in the database to Elasticsearch in real time. This topic also describes how to view the statuses of the nodes generated by the real-time synchronization solution.

Prerequisites

Before you create a data synchronization solution, make sure that the following operations are performed:

Background information

You can use the hot-warm architecture that is provided by Elasticsearch in a fully managed manner to store the real-time data of enterprises. DataWorks provides real-time synchronization solutions that can be used to synchronize all data in a database to Elasticsearch in real time based on the architecture. You can use such a solution to first synchronize data in some or all tables in a database to Elasticsearch in offline mode and then synchronize incremental data in the database to Elasticsearch in real time. You can also view the details of the solution, the statuses of the nodes generated by the solution, and data updates in the database in real time. This facilitates subsequent data searches, analysis, and development.

Real-time synchronization solutions used to synchronize all data in a database provide the following benefits:
  • Synchronizes the full data of a database.

    You do not need to create multiple batch data synchronization nodes to synchronize source tables one by one. You can directly create a batch synchronization solution to synchronize some or all of the tables in a database at a time.

  • You can configure synchronization rules in a flexible manner.
    • You can configure synchronization rules for different data definition language (DDL) messages based on your business requirements. For example, if you select Ignore for a DDL message that is specified in the source and used to drop a table in the destination, the system ignores the message and does not drop the table in the destination when the system receives the message.
    • You can add or remove source tables for a synchronization solution that is running.
    • You can configure synchronization rules for destination Elasticsearch indexes to determine whether to synchronize the incremental data in source tables to destination Elasticsearch indexes based on your business requirements. After the incremental data is synchronized, the incremental data can be searched in destination Elasticsearch indexes.
  • Requires only simple configurations.

    You do not need to perform complex operations, such as creating synchronization nodes, databases, and tables, configuring dependencies for nodes, and configure mappings between sources and destinations. Instead, you need only to configure a batch synchronization solution in a configuration wizard.

  • Large amounts of data can be updated in real time. This makes automated operations and maintenance (O&M) and management more efficient.

Scenarios

If you want the system to monitor data updates in business databases in real time, you can use real-time synchronization solutions to synchronize all data in the databases. This way, upper-layer applications can search for, analyze, and develop data in real time.

Limits

  • You can use a real-time synchronization solution to synchronize all data only in a MySQL, SQL Server, or PolarDB database to Elasticsearch.

  • A real-time synchronization solution used to synchronize all data in a database can be run only on exclusive resource groups.

Create a real-time synchronization solution to synchronize all data in a database

  1. Go to the Data Integration page and choose Sync Solutions > Nodes to go to the Task list page.
    For more information, see Go to the Sync Solutions page.
  2. On the Task list page, click New task in the upper-right corner.
  3. In the New synchronization solution dialog box, click One-click realtime synchronization to Elasticsearch.
  4. In the Set synchronization sources and rules step, configure basic information such as the solution name for the data synchronization solution.
    In the Basic configuration section, configure the parameters. Basic configuration
    Parameter Description
    Scheme name The name of the data synchronization solution. The name can be a maximum of 50 characters in length.
    Description The description of the data synchronization solution. The description can be a maximum of 50 characters in length.
    Destination task storage location The Automatically establish workflow check box is selected by default. This indicates that DataWorks automatically creates a workflow named in the format of clone_database_Source data source name+to+Destination data source name in the Data Integration directory. All synchronization nodes generated by the data synchronization solution are placed in the directory of this workflow.

    If you clear the Automatically establish workflow check box, select a directory from the Select Location drop-down list. All synchronization nodes generated by the data synchronization solution are placed in the specified directory.

  5. Select a source and configure synchronization rules.
    1. In the Data Source section, specify the Type and Data source parameters.
      Note

      You can use a real-time synchronization solution to synchronize all data only in a MySQL, SQL Server, or PolarDB database to Elasticsearch.

    2. In the Source Table section, select the tables whose data you want to synchronize from the Source Table list. Then, click the Icon icon to move the tables to the Selected Source Table list.
      Select tables from the source
      The Source Table list displays all tables in the selected source. You can choose to synchronize data in some or all tables in the source.
      Notice If a selected table has no primary key, you must customize a primary key when you map the table to a destination Elasticsearch index. This primary key is used to remove duplicate data during synchronization. For example, you can use one field or a combination of several fields as the primary key of the table. For more information, see Step 6 in this topic.
    3. In the Conversion Rule for Table Name section, click Add rule to select a rule.
      Supported options include Conversion Rule for Table Name and Rule for Destination Index Name.
      • Conversion Rule for Table Name: the rule for converting the names of source tables to those of destination Elasticsearch indexes.
      • Rule for Destination Index Name: the rule for adding a prefix and a suffix to the converted names of destination Elasticsearch indexes.
    4. Click Next Step.
  6. Select the destination and configure destination Elasticsearch indexes.
    1. In the Set Destination Index step, specify Destination.
    2. Click Refresh source table and Elasticsearch Index mapping to configure the mappings between the source tables and destination Elasticsearch indexes.
    3. View the mapping progress, source tables, and mapped destination Elasticsearch indexes.
      Progress
      No. Description
      1 The progress of mapping the source tables to destination Elasticsearch indexes.
      Note The mapping may require a long period of time if you synchronize data from a large number of tables.
      2
      • If the tables in the source database contain primary keys, the system removes duplicate data based on the primary keys during the synchronization.
      • If the tables in the source database do not contain primary keys, you can click the Edit icon to customize primary keys. You can use one field or a combination of several fields as the primary keys of the tables. This way, the system removes duplicate data based on the primary keys during the synchronization.
      3 The method used to create an index. Valid values:
      • Create Index: If you select this method, the name of the Elasticsearch index that is automatically created appears in the Elasticsearch Index Name column. You can click the name of the index to change the values of the parameters related to the index.
      • Use Existing Index: If you select this method, select the name of the desired index from the drop-down list in the Elasticsearch Index Name column. Then, you can click View Field Mapping to view the mappings between the source tables and destination Elasticsearch indexes.
      If you select Create Index for Index creation method, you can click the Elasticsearch index name that appears in the Elasticsearch Index Name column to change the values of the parameters related to the destination Elasticsearch index based on your business requirements. Configure parameters for the destination index
      • Dynamic Mapping Status: specifies whether to dynamically synchronize new fields in the source tables to the destination Elasticsearch indexes during synchronization. Valid values:
        • true: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes, and the fields can be searched in the indexes after synchronization. Default value: true.
        • false: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes, but the fields cannot be searched in the indexes after synchronization.
        • strict: If the system detects that the source tables contain new fields, the system does not synchronize the fields to the mapped destination Elasticsearch indexes, and an error is reported. You can view the details of the error in the node logs.
        For more information about dynamic mappings, see the description of the dynamic parameter for open source Elasticsearch.
      • Shards and Replicas: the number of primary shards for the destination Elasticsearch index and the number of replica shards for each primary shard. The shards are distributed on different nodes in an Elasticsearch cluster to support distributed searches. This improves the query efficiency of Elasticsearch. For more information, see Terms.
        Note The values of Shards and Replicas cannot be changed after you specify them and the real-time synchronization solution starts to run. The default values of Shards and Replicas are 1.
      • Partition settings: You can use a column in a source table as a partition key column. This parameter must be used together with the Shards and Replicas parameters. By default, the Enable Partitioning for Elasticsearch Indexes check box is not selected.
      • Data field structure: This section allows you to configure the types and extended attributes of the fields in the mapped destination Elasticsearch indexes. For more information, see Field data types in open source Elasticsearch.
      Note If you do not change the values of the parameters related to the destination Elasticsearch indexes after the indexes are created, the system synchronizes data based on the default values of the parameters.
    4. Click Next Step.
  7. Configure rules to process DDL messages.
    Sources, such as MySQL, PolarDB, and SQL Server, may contain multiple DDL messages. Before you synchronize data, you can configure different rules to process DDL messages based on your business requirements.
    Note The rules apply the first time a real-time synchronization solution is run. If you want to modify the rules in subsequent operations, go to the configuration page of the real-time synchronization solution to perform the operation. For more information, see Manage the real-time synchronization solution.
    1. In the Set Processing Policy for DDL Messages step, configure rules to process DDL messages during data synchronization. Processing rules for DDL messages
      The following table describes the processing rules for different DDL messages.
      DDL message Rule
      CreateTable DataWorks processes a DDL message of the related type based on the following rules after it receives the message:
      • Normal: sends the message to the destination. Then, the destination processes the message. Each destination may process DDL messages based on its own business logic. If you select Normal for CreateTable, DataWorks only forwards the messages.
      • Ignore: ignores the message and does not send it to the destination.
      • Alert: ignores the message and records the alert in real-time synchronization logs. In addition, the alert contains information about the reason that a message is ignored because of a running error.
      • Error: returns an error during the running of the real-time synchronization solution and terminates the real-time synchronization solution.
      DropTable
      AddColumn
      DropColumn
      RenameTable
      RenameColumn
      ChangeColumn
      TruncateTable
    2. Click Next Step.
  8. Configure the resources required by the data synchronization solution.
    In the Set Resources for Solution Running step, configure the parameters. Set Resources for Solution Running
    • Offline Full synchronization
      Parameter Description
      Offline task name rules The name of the batch synchronization node that is used to synchronize the full data of the source. After a data synchronization solution is created, DataWorks first generates a batch synchronization node to synchronize full data, and then generates real-time synchronization nodes to synchronize incremental data.
      Resource Groups for Full Batch Sync Nodes

      The exclusive resource group for Data Integration that is used to run the batch synchronization node.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Full Batch Scheduling
      Parameter Description
      Select scheduling Resource Group

      The resource group for scheduling that is used to run the nodes.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Real-time Incremental synchronization
      Parameter Description
      Select an exclusive resource group for real-time tasks

      The exclusive resource group that is used to run the real-time synchronization nodes.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Channel Settings
      Parameter Description
      Maximum number of connections supported by source read The maximum number of Java Database Connectivity (JDBC) connections that are allowed for the source. Specify an appropriate number based on the resources of the source. Default value: 20.
  9. Click Complete Configuration. The real-time synchronization solution used to synchronize all data in a database is created.

Run the real-time synchronization solution

On the Tasks page, find the newly created data synchronization solution and choose More > Submit and Run in the Operation column to run the data synchronization solution.

View the statuses and running results of the synchronization nodes

  • On the Tasks page, find the solution that is run and click Execution details in the Operation column. Then, you can view the running details of all nodes generated by the synchronization solution.Statuses of the synchronization nodes
  • Find a node whose running details you want to view and click Execution details in the Status column. Then, you can click the link provided in the dialog box that appears to go to the DataStudio page.

Manage the real-time synchronization solution

  • View the configurations of the synchronization solution.

    On the Tasks page, find the newly created synchronization solution and choose More > View Setting. Then, you can view the configurations of the synchronization solution.

  • Modify the synchronization solution.

    On the Tasks page, find the newly created synchronization solution and choose More > Modify Configuration. Then, you can modify the configurations of the synchronization solution.

    For a synchronization solution that is successfully run, you can choose More > Modify Configuration to add or remove source tables, or set AddColumn to Normal to synchronize new fields in the source tables to Elasticsearch in real time.
    • Add or remove source tables for the synchronization solution.

      In the Source Table section of the Set Synchronization Sources and Rules step, add or remove source tables for the synchronization solution. Then, save the modification and run the synchronization solution.

    • Synchronize new fields in the source tables to Elasticsearch in real time.
      • In the Set Processing Policy for DDL Messages step, set AddColumn to Normal. Then, the real-time synchronization solution automatically monitors the column changes in the source tables. If the solution detects that the source tables contain new fields, it automatically synchronizes data in the new fields to the destination Elasticsearch indexes. For more information, see Step 7 in this topic.
      • Set Dynamic Mapping Status to true. Then, the synchronization solution synchronizes new fields in the source tables to the destination Elasticsearch indexes in real time, and the fields can be searched in the future. For more information, see Step 6 in this topic.
  • Change the priority for the batch synchronization solution
    Find the newly created batch synchronization solution and choose More > Change Priority in the Operation column. In the Change Priority dialog box, enter the desired priority and click Confirm. You can set the priority to an integer from 1 to 8. A larger value indicates a higher priority.
    Note If multiple batch synchronization solutions have the same priority, the system runs them based on the order they are committed.
  • Delete the batch synchronization solution.
    Find the batch synchronization solution that you want to delete and choose More > Delete in the Operation column. In the Delete message, click OK.
    Note After you click OK, only the configuration record of the batch synchronization solution is deleted. The synchronization nodes generated by the solution and data tables generated by the synchronization nodes are not affected.