After you configure data sources, network environments, and resource groups, you can create and run a real-time sync solution to synchronize all data in a database. This topic describes how to create a real-time sync solution to synchronize data in specific or all tables in a database to Elasticsearch in offline mode and then synchronize incremental data in the database to Elasticsearch in real time. This topic also describes how to view the statuses of the nodes generated by the real-time sync solution.

Prerequisites

Before you create a data synchronization solution, make sure that the following operations are performed:

Background information

You can use the hot-warm architecture that is provided by Elasticsearch in a fully managed manner to store the real-time data of enterprises. DataWorks provides real-time sync solutions that can be used to synchronize all data in a database to Elasticsearch in real time based on the architecture. You can use such a solution to first synchronize data in specific or all tables in a database to Elasticsearch in offline mode and then synchronize incremental data in the database to Elasticsearch in real time. You can also view the details of the solution, the statuses of the nodes generated by the solution, and data updates in the database in real time. This facilitates subsequent data searches, analysis, and development.

Real-time sync solutions that are used to synchronize all data in a database provide the following benefits:
  • Synchronizes the full data of a database.

    You do not need to create multiple batch data synchronization nodes to synchronize source tables one by one. You can directly create a batch synchronization solution to synchronize some or all of the tables in a database at a time.

  • You can configure synchronization rules in a flexible manner.
    • You can configure synchronization rules for different DDL messages based on your business requirements. For example, if you select Ignore for a DDL message that is specified in the source and used to drop a table in the destination, the system ignores the message and does not drop the table in the destination when the system receives the message.
    • You can add or remove source tables for a sync solution that is running.
    • You can configure synchronization rules for destination Elasticsearch indexes to determine whether to synchronize the incremental data in source tables to destination Elasticsearch indexes based on your business requirements. After the incremental data is synchronized, the incremental data can be searched in destination Elasticsearch indexes.
  • Requires only simple configurations.

    You do not need to perform complex operations, such as creating synchronization nodes, databases, and tables, configuring dependencies for nodes, and configure mappings between sources and destinations. Instead, you need only to configure a batch synchronization solution in a configuration wizard.

  • Large amounts of data can be updated in real time. This improves the efficiency of automated O&M.

Scenarios

If you want the system to monitor data updates in business databases in real time, you can use real-time sync solutions to synchronize all data in the databases. This way, upper-layer applications can search for, analyze, and develop data in real time.

Limits

  • You can use a real-time sync solution to synchronize all data only from a MySQL or PolarDB database to Elasticsearch.

  • A real-time sync solution that is used to synchronize all data in a database can be run only on exclusive resource groups.

Create a real-time sync solution to synchronize all data in a database

  1. Go to the Data Integration page and choose Sync Solutions > Tasks to go to the Tasks page.
    For more information, see Select a data synchronization solution.
  2. On the Tasks page, click New task in the upper-right corner. In the Select Source and Destination section, specify the types of the source and the destination data sources by setting the Select source parameters in the Source and the Destination parts.
  3. In the Create Data Synchronization Solution dialog box, click One-click realtime synchronization to Elasticsearch.
  4. In the Set Synchronization Sources and Rules step, configure basic information such as the solution name for the data synchronization solution.
    In the Basic Configuration section, set the parameters. Basic Configuration
    Parameter Description
    Solution Name The name of the sync solution. The name can be up to 50 characters in length.
    Description The description of the sync solution. The description can be up to 50 characters in length.
    Location If you select Automatic Workflow Creation, DataWorks automatically creates a workflow named in the format of clone_database_Source name+to+Destination name. All sync nodes generated by the sync solution are placed in the Data Integration folder of this workflow.

    If you clear Automatically establish workflow, select a directory from the Select Location drop-down list. All sync nodes generated by the sync solution are placed in the specified directory.

  5. Select a source and configure synchronization rules.
    1. In the Data Source section, specify the Type and Data source parameters.
      Note

      You can use a real-time sync solution to synchronize all data only from a MySQL or PolarDB database to Elasticsearch.

    2. In the Source Table section, select the tables whose data you want to synchronize from the Source Table list. Then, click the Move icon icon to move the tables to the Selected Source Table list.
      Select tables from the source
      The Source Table list displays all tables in the selected source. You can synchronize data in specific or all tables in the source.
      Notice If a selected table has no primary key, you must customize a primary key when you map the table to a destination Elasticsearch index. This primary key is used to remove duplicate data during synchronization. For example, you can use one field or a combination of several fields as the primary key of the table. For more information, see Step 6 in this topic.
    3. In the Conversion Rule for Table Name section, click Add rule to select a rule.
      Supported options include Conversion Rule for Table Name and Rule for Destination Index Name.
      • Conversion Rule for Table Name: the rule for converting the names of source tables to those of destination Elasticsearch indexes.
      • Rule for Destination Index Name: the rule for adding a prefix and a suffix to the converted names of destination Elasticsearch indexes.
    4. Click Next Step.
  6. Select the destination and configure destination Elasticsearch indexes.
    1. In the Set Destination Index step, specify the Destination parameter.
    2. Click Refresh source table and Elasticsearch Index mapping to configure the mappings between the source tables and destination Elasticsearch indexes.
    3. View the mapping progress, source tables, and mapped destination Elasticsearch indexes.
      Mapping progress
      No. Description
      1 The progress of mapping the source tables to destination Elasticsearch indexes.
      Note The mapping may require a long period of time if you synchronize data from a large number of tables.
      2
      • If the tables in the source database contain primary keys, the system removes duplicate data based on the primary keys during the synchronization.
      • If the tables in the source database do not contain primary keys, you can click the Edit icon to customize primary keys. You can use one field or a combination of several fields as the primary keys of the tables. This way, the system removes duplicate data based on the primary keys during the synchronization.
      3 The method that is used to create an index. Valid values:
      • Create Index: If you select this method, the name of the Elasticsearch index that is automatically created appears in the Elasticsearch Index Name column. You can click the name of the index to change the values of the parameters related to the index.
      • Use Existing Index: If you select this method, select the name of the desired index from the drop-down list in the Elasticsearch Index Name column. Then, you can click View Field Mapping to view the mappings between the source tables and destination Elasticsearch indexes.
      If you set the Index creation method parameter to Create Index , you can click the Elasticsearch index name that appears in the Elasticsearch Index Name column to change the values of the parameters related to the destination Elasticsearch index based on your business requirements. Configure parameters for the destination index
      • Dynamic Mapping Status: specifies whether to dynamically synchronize new fields in the source tables to the destination Elasticsearch indexes during synchronization. Valid values:
        • true: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes, and the fields can be searched in the indexes after synchronization. Default value: true.
        • false: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes, but the fields cannot be searched in the indexes after synchronization.
        • strict: If the system detects that the source tables contain new fields, the system does not synchronize the fields to the mapped destination Elasticsearch indexes, and an error is reported. You can view the details of the error in the node logs.
        For more information about dynamic mappings, see the description of the dynamic parameter for open source Elasticsearch.
      • Shards and Replicas: the number of primary shards for the destination Elasticsearch index and the number of replica shards for each primary shard. The shards are distributed on different nodes in an Elasticsearch cluster to support distributed searches. This improves the query efficiency of Elasticsearch. For more information, see Terms.
        Note The values of the Shards and Replicas parameters cannot be changed after you specify them and the real-time sync solution starts to run. The default values of the Shards and Replicas parameters are 1.
      • Partition settings: You can use a column in a source table as a partition key column. This parameter must be used together with the Shards and Replicas parameters. By default, the Enable Partitioning for Elasticsearch Indexes check box is not selected.
      • Data field structure: This section allows you to configure the types and extended attributes of the fields in the mapped destination Elasticsearch indexes. For more information, see Field data types in open source Elasticsearch.
      Note If you do not change the values of the parameters related to the destination Elasticsearch indexes after the indexes are created, the system synchronizes data based on the default values of the parameters.
    4. Click Next Step.
  7. Configure rules to process DDL messages.
    Sources, such as MySQL, PolarDB, may contain multiple DDL messages. Before you synchronize data, you can configure different rules to process DDL messages based on your business requirements.
    Note The rules apply when a real-time sync solution is run for the first time. If you want to modify the rules in subsequent operations, go to the configuration page of the real-time sync solution to perform the operation. For more information, see Manage the real-time sync solution.
    1. In the Set Processing Policy for DDL Messages step, configure rules to process DDL messages during data synchronization. Processing Policy for DDL Messages
      The following table describes the processing rules for different DDL messages.
      DDL message Rule
      CreateTable DataWorks processes a DDL message of the related type based on the following rules after it receives the message:
      • Normal: sends the message to the destination. Then, the destination processes the message. Each destination may process DDL messages based on its own business logic. If you select Normal for CreateTable, DataWorks only forwards the messages.
      • Ignore: ignores the message and does not send it to the destination.
      • Alert: ignores the message and records the alert in real-time synchronization logs. In addition, the alert contains information about the reason indicating that a message is ignored because of a running error.
      • Error: returns an error when the real-time sync solution is running and terminates the real-time sync solution.
      DropTable
      AddColumn
      DropColumn
      RenameTable
      RenameColumn
      ChangeColumn
      TruncateTable
    2. Click Next Step.
  8. Configure the resources required by the sync solution.
    In the Set Resources for Solution Running step, set the parameters as required. Set Resources for Solution Running
    • Offline Sync
      Parameter Description
      Offline task name rules The name of the batch sync node that is used to synchronize the full data of the source. After a sync solution is created, DataWorks first generates a batch sync node to synchronize full data, and then generates real-time sync nodes to synchronize incremental data.
      Resource Groups for Full Batch Sync Nodes

      The exclusive resource group for Data Integration that is used to run the batch sync node.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Scheduling Settings
      Parameter Description
      Select scheduling Resource Group

      The resource group for scheduling that is used to run the nodes.

      Only exclusive resource groups for Data Integration can be used to run sync solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Incremental Sync
      Parameter Description
      Resource Groups for Incremental Batch Sync Nodes

      The exclusive resource group that is used to run the real-time sync nodes.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Channel Settings
      Parameter Description
      Maximum number of connections supported by source read The maximum number of Java Database Connectivity (JDBC) connections that are allowed for the source. Specify an appropriate number based on the resources of the source. Default value: 20.
  9. Click Complete Configuration. The real-time sync solution used to synchronize all data in a database is created.

Run the real-time sync solution

On the Tasks page, find the newly created sync solution and choose More > Submit and Run in the Operation column to run the sync solution.

View the statuses and results of the sync nodes

  • On the Tasks page, find the solution that has been run and click Execution details in the Operation column. Then, you can view the execution details of all nodes generated by the sync solution. Status of the sync nodes
  • Find a node whose execution details you want to view and click Execution details in the Status column. Then, you can click the link provided in the dialog box that appears to go to the DataStudio page.

Manage the real-time sync solution

  • View the configurations of the sync solution.

    On the Tasks page, find the newly created sync solution and choose More > View Configuration. Then, you can view the configurations of the sync solution.

  • Modify the sync solution.

    On the Tasks page, find the newly created sync solution and choose More > Modify Configuration. Then, you can modify the configurations of the sync solution.

    For a sync solution that is successfully run, you can choose More > Modify Configuration to add or remove source tables, or set the AddColumn parameter to Normal to synchronize new fields in the source tables to Elasticsearch in real time.
    • Add or remove source tables for the sync solution.

      In the Source Table section of the Set Synchronization Sources and Rules step, add or remove source tables for the sync solution. Then, save the modification and run the sync solution.

    • Synchronize new fields in the source tables to Elasticsearch in real time.
      • In the Set Processing Policy for DDL Messages step, set the AddColumn parameter to Normal. Then, the real-time sync solution automatically monitors the column changes in the source tables. If the sync solution detects that the source tables contain new fields, the sync solution automatically synchronizes data in the new fields to the destination Elasticsearch indexes. For more information, see Step 7 in this topic.
      • Set the Dynamic Mapping Status parameter to true. Then, the sync solution synchronizes new fields in the source tables to the destination Elasticsearch indexes in real time, and the fields can be searched in the future. For more information, see Step 6 in this topic.
  • Change the priority for the batch synchronization solution
    Find the newly created batch synchronization solution and choose More > Change Priority in the Operation column. In the Change Priority dialog box, enter the desired priority and click Confirm. You can set the priority to an integer from 1 to 8. A larger value indicates a higher priority.
    Note If multiple batch synchronization solutions have the same priority, the system runs them based on the order they are committed.
  • Delete the batch synchronization solution.
    Find the batch synchronization solution that you want to delete and choose More > Delete in the Operation column. In the Delete message, click OK.
    Note After you click OK, only the configuration record of the batch synchronization solution is deleted. The synchronization nodes generated by the solution and data tables generated by the synchronization nodes are not affected.