After you configure data sources, network environments, and resource groups, you can create and run a batch synchronization solution to synchronize all data in a database. This topic describes how to create a batch synchronization solution to synchronize data in some or all tables in a database to Elasticsearch. This topic also describes how to view the statuses of the nodes generated by the batch synchronization solution.

Prerequisites

Before you create a data synchronization solution, make sure that the following operations are performed:

Background information

In most cases, the real-time data of enterprises are stored in big data engines, and a large volume of non-structured log data may be generated for the real-time data. You can use the hot-warm architecture that is provided by Elasticsearch in a fully managed manner to store the log data and offline data of enterprises. DataWorks provides batch synchronization solutions that can be used to synchronize all data in a database to Elasticsearch based on the architecture. You can view the details of the solution and the statuses of the nodes generated by the solution. This makes automated operations and maintenance (O&M) and management more efficient.

You can use a batch synchronization solution to synchronize the full or incremental data in your business database to Elasticsearch. Then, the data can be searched, analyzed, and developed in Elasticsearch. A batch synchronization solution used to synchronize all data in a database has the following benefits:
  • Synchronizes the full data of a database.

    You do not need to create multiple batch data synchronization nodes to synchronize source tables one by one. You can directly create a batch synchronization solution to synchronize some or all of the tables in a database at a time.

  • Supports various data synchronization methods.

    You can use one of the following methods to synchronize data: full data synchronization, incremental data synchronization, and a combination of full and incremental data synchronization. In addition, you can configure properties for your batch synchronization solution.

  • Requires only simple configurations.

    You do not need to perform complex operations, such as creating synchronization nodes, databases, and tables, configuring dependencies for nodes, and configure mappings between sources and destinations. Instead, you need only to configure a batch synchronization solution in a configuration wizard.

  • Reduces costs and improves O&M efficiency.

Limits

  • You can use a batch synchronization solution to synchronize all data only in a MySQL, SQL Server, or PolarDB database to Elasticsearch.
  • A batch synchronization solution used to synchronize all data in a database can be run only on resources in exclusive resource groups for Data Integration.

Create a batch synchronization solution to synchronize all data in a database

  1. Go to the Data Integration page and choose Sync Solutions > Nodes to go to the Task list page.
    For more information, see Go to the Sync Solutions page.
  2. On the Task list page, click New task in the upper-right corner.
  3. In the New synchronization solution dialog box, click One-click batch synchronization to Elasticsearch.
  4. In the Set synchronization sources and rules step, configure basic information such as the solution name for the data synchronization solution.
    In the Basic configuration section, configure the parameters. Basic configuration
    Parameter Description
    Scheme name The name of the data synchronization solution. The name can be a maximum of 50 characters in length.
    Description The description of the data synchronization solution. The description can be a maximum of 50 characters in length.
    Destination task storage location The Automatically establish workflow check box is selected by default. This indicates that DataWorks automatically creates a workflow named in the format of clone_database_Source data source name+to+Destination data source name in the Data Integration directory. All synchronization nodes generated by the data synchronization solution are placed in the directory of this workflow.

    If you clear the Automatically establish workflow check box, select a directory from the Select Location drop-down list. All synchronization nodes generated by the data synchronization solution are placed in the specified directory.

  5. Select a data source as the source and configure synchronization rules.
    1. In the Data source section, specify the Type and Data source parameters.
      Note You can select MySQL, SQL Server, or PolarDB as the source.
    2. In the Source Table section, select the tables whose data you want to synchronize from the Source Table list. Then, click the Icon icon to move the tables to the Selected Source Table list.
      Select tables from the source
      The Source Table list displays all tables in the selected source. You can choose to synchronize data in some or all tables in the source.
      Notice If a selected table has no primary key, you must customize a primary key when you map the table to a destination Elasticsearch index. This primary key is used to remove duplicate data during synchronization. For example, you can use one field or a combination of several fields as the primary key of the table. For more information, see Step 6 in this topic.
    3. In the Conversion Rule for Table Name section, click Add rule to select a rule.
      Supported options include Conversion Rule for Table Name and Rule for Destination Index Name.
      • Conversion Rule for Table Name: the rule for converting the names of source tables to those of destination Elasticsearch indexes.
      • Rule for Destination Index Name: the rule for adding a prefix and a suffix to the converted names of destination Elasticsearch indexes.
    4. Click Next Step.
  6. Select a destination cluster and configure destination Elasticsearch indexes.
    1. In the Set Destination Index step, specify Destination.
    2. Click Refresh source table and Elasticsearch Index mapping to configure the mappings between the source tables and destination Elasticsearch indexes.
    3. View the mapping progress, source tables, and mapped destination Elasticsearch indexes.
      Progress
      No. Description
      1 The progress of mapping the source tables to destination Elasticsearch indexes.
      Note The mapping may require a long time if you want to synchronize data from a large number of tables.
      2
      • If the tables in the source database contain primary keys, the system removes duplicate data based on the primary keys during the synchronization.
      • If the tables in the source database do not contain primary keys, you can click the Edit icon to customize primary keys. You can use one field or a combination of several fields as the primary keys of the tables. This way, the system removes duplicate data based on the primary keys during the synchronization.
      Note In the following cases, you must configure the primary keys:
      • You use an incremental data synchronization method to synchronize data.
      • You use a full data synchronization method to synchronize data and set Write Policy to Update.
      For more information about synchronization methods, see the synchronization methods described in Step 7 in this topic.
      3 The method used to create an index. Valid values:
      • Create Index: If you select this method, the name of the Elasticsearch index that is automatically created appears in the Elasticsearch Index Name column. You can click the name of the index to change the values of the parameters related to the index.
      • Use Existing Index: If you select this method, select the name of the desired index from the drop-down list in the Elasticsearch Index Name column. Then, you can click View Field Mapping to view the mappings between the source tables and destination Elasticsearch indexes.
      If you select Create Index for Index creation method, you can click the Elasticsearch index name that appears in the Elasticsearch Index Name column to change the values of the parameters related to the destination Elasticsearch index based on your business requirements. Configure parameters for the destination index
      • Dynamic Mapping Status: specifies whether to dynamically synchronize new fields in the source tables to the destination Elasticsearch indexes during synchronization. Valid values:
        • true: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes, and the fields can be searched in the indexes after synchronization. Default value: true.
        • false: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes, but the fields cannot be searched in the indexes after synchronization.
        • strict: If the system detects that the source tables contain new fields, the system does not synchronize the fields to the mapped destination Elasticsearch indexes, and an error is reported. You can view the details of the error in the node logs.
        For more information about dynamic mappings, see the description of the dynamic parameter for open source Elasticsearch.
      • Shards and Replicas: the number of primary shards for the destination Elasticsearch index and the number of replica shards for each primary shard. The shards are distributed on different nodes in an Elasticsearch cluster to support distributed searches. This improves the query efficiency of Elasticsearch. For more information, see Terms.
        Note The values of Shards and Replicas cannot be changed after you specify them and the real-time synchronization solution starts to run. The default values of Shards and Replicas are 1.
      • Partition settings: You can use a column in a source table as a partition key column. This parameter must be used together with the Shards and Replicas parameters. By default, the Enable Partitioning for Elasticsearch Indexes check box is not selected.
      • Data field structure: This section allows you to configure the types and extended attributes of the fields in the mapped destination Elasticsearch indexes. For more information, see Field data types in open source Elasticsearch.
      Note If you do not change the values of the parameters related to the destination Elasticsearch indexes after the indexes are created, the system synchronizes data based on the default values of the parameters.
    4. Click Next Step.
  7. Configure synchronization rules.
    1. In the Sync Rules step, select a synchronization method. Sync Rules
      The following table describes the synchronization methods.
      Method Description
      Only One-time Full Sync If you use this method, you need only to perform synchronization operations once to synchronize all data in the source to Elasticsearch.
      Only One-time Incremental Sync If you use this method, you need only to perform synchronization operations once to synchronize incremental data in the source to Elasticsearch based on the specified filter conditions.
      Periodic Full Sync If you use this method, you must configure a scheduling cycle for the batch synchronization solution. Then, the system synchronizes all data in the source to Elasticsearch each time the system runs the solution based on the specified scheduling cycle.
      Periodic Incremental Sync If you use this method, the system synchronizes only incremental data in the source to Elasticsearch each time the system runs the solution based on the specified filter conditions and scheduling cycle.
      Incremental Sync after One-time Full Sync If you use this method, the system first synchronizes all data to Elasticsearch. Then, the system synchronizes only incremental data in the source to Elasticsearch each time the system runs the solution based on the specified filter conditions and scheduling cycle.
    2. Configure parameters for the selected synchronization method.
      The parameters that you need to specify in the Full Sync, Incremental Sync, and Recurrence sections vary based on the synchronization method you selected. The following tables describe the parameters.
      • Full Sync
        The parameters in this section are required only if you set Solution to Only One-time Full Sync, Periodic Full Sync, or Incremental Sync after One-time Full Sync.
        Parameter Description
        Clear Index Data Before Writing
        Valid values:
        • Yes: The original data in the destination Elasticsearch indexes is deleted before data in the source is written to the indexes.
        • No: The original data in the destination Elasticsearch indexes is retained before data in the source is written to the indexes.
        Notice If you set this parameter to Yes, all the original data in the destination Elasticsearch indexes is deleted before data in the source is written to the indexes. Exercise caution when you set this parameter.
        Write Policy
        Valid values:
        • Insert: The system inserts data to the destination Elasticsearch indexes during data synchronization. This is the default value of this parameter.
        • Update: If the primary key field of a source table already exists in a destination Elasticsearch index, the system first deletes a document in the destination Elasticsearch index and then inserts data to the index. Otherwise, the system directly inserts data to the destination Elasticsearch index.

        Batch Size

        The number of data records that can be written to Elasticsearch at a time. Default value: 1000. You can set this parameter to an appropriate value based on actual network conditions and the data volume that you want to synchronize. This can reduce network overheads.

      • Incremental Sync
        The parameters in this section are required only if you set Solution to Only One-time Incremental Sync, Periodic Incremental Sync, or Incremental Sync after One-time Full Sync.
        Parameter Description
        Write Policy
        Valid values:
        • Insert: The system inserts data to the destination Elasticsearch indexes during data synchronization. This is the default value of this parameter.
        • Update: If the primary key field of a source table already exists in a destination Elasticsearch index, the system first deletes a document in the destination Elasticsearch index and then inserts data to the index. Otherwise, the system directly inserts data to the destination Elasticsearch index.

        Batch Size

        The number of data records that can be written to Elasticsearch at a time. Default value: 1000. You can set this parameter to an appropriate value based on actual network conditions and the data volume that you want to synchronize. This can reduce network overheads.

        Incremental Condition The filter conditions that are used to filter data in the source to synchronize only incremental data. You can configure filter conditions based on descriptions in Configure scheduling parameters.
      • Recurrence
        Parameter Description
        Recurrence The scheduling cycle of the batch synchronization solution. Valid values: Minute, Hour, Daily, Weekly, and Monthly. For more information about how to configure a scheduling cycle, see Configure a scheduling cycle.
        Scheduling Period The batch synchronization solution is run only within the scheduling period that you specified.
        Pausing Scheduling If you select Pausing Scheduling, the batch synchronization solution is paused. In this case, the solution starts to run based on the scheduling cycle until you cancel the pausing. You can select this check box if you do not need to run the solution for a period of time.
        Rerun Valid values:
        • Allow Regardless of Running Status: You can set Rerun to this value if the batch synchronization solution can be rerun multiple times and the reruns do not affect synchronization results.
        • Disallow Regardless of Running Status: You can set Rerun to this value if synchronization results can be affected regardless of whether the running of the batch synchronization solution is successful or fails.

          If you set Rerun to this value, the system does not automatically rerun the synchronization solution after the system recovers from an exception.

    3. Click Next Step.
  8. Configure the resources required for the synchronization solution.
    In the Set Resources for Solution Running step, configure the parameters. Set Resources for Solution Running
    • Full Sync
      The parameters in this section are required only if you set Solution to Only One-time Full Sync, Periodic Full Sync, or Incremental Sync after One-time Full Sync in the Sync Rules step.
      Parameter Description
      Offline task name rules The name of the batch synchronization node that is used to synchronize the full data of the source. After the synchronization solution is created, DataWorks generates a batch node to synchronize the full data of the source.
      Resource Group for Full Batch Sync Nodes
      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Incremental Sync
      The parameters in this section are required only if you set Solution to Only One-time Incremental Sync, Periodic Incremental Sync, or Incremental Sync after One-time Full Sync in the Sync Rules step.
      Parameter Description
      Naming Rule for Incremental Sync Nodes The name of the batch synchronization node that is used to synchronize the incremental data of the source. After the synchronization solution is created, DataWorks generates a batch synchronization node to synchronize the incremental data of the source.
      Resource Group for Incremental Batch Sync Nodes
      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Scheduling Settings
      Parameter Description
      Select scheduling Resource Group

      The resource group for scheduling that is used to run the nodes generated by the batch synchronization solution.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
      Maximum number of connections supported by source read The maximum number of Java Database Connectivity (JDBC) connections that are allowed for the source. Specify an appropriate number based on the resources of the source. Default value: 15.
  9. Click Complete configuration. The batch synchronization solution used to synchronize all data in a database is created.

Run the batch synchronization solution

On the Tasks page, find the created data synchronization solution and click Submit and Run in the Operation column to run the solution.

View the statuses and results of the synchronization nodes

  • On the Tasks page, find the solution that is run and click Execution details in the Operation column. Then, you can view the details of all nodes generated by the batch synchronization solution.Statuses of the synchronization nodes
  • Find a node whose details you want to view and click Execution details in the Status column. In the dialog box that appears, click the provided link to go to the DataStudio page.

Manage the real-time synchronization solution

  • View or edit the data synchronization solution.
    On the Tasks page, find the newly created synchronization solution and choose More > View Setting or choose More > Modify Configuration in the Operation column. Then, you can view or modify the configurations of the batch synchronization solution.
    Note You can choose More > Modify Configuration in the Operation column that corresponds to a batch synchronization solution in the Not Running state to edit the batch synchronization solution. If you click Modify Configuration in the Operation column that corresponds to a batch synchronization solution in another state, you can only view information about the solution.
  • Change the priority for the batch synchronization solution
    Find the newly created batch synchronization solution and choose More > Change Priority in the Operation column. In the Change Priority dialog box, enter the desired priority and click Confirm. You can set the priority to an integer from 1 to 8. A larger value indicates a higher priority.
    Note If multiple batch synchronization solutions have the same priority, the system runs them based on the order they are committed.
  • Delete the batch synchronization solution.
    Find the batch synchronization solution that you want to delete and choose More > Delete in the Operation column. In the Delete message, click OK.
    Note After you click OK, only the configuration record of the batch synchronization solution is deleted. The synchronization nodes generated by the solution and data tables generated by the synchronization nodes are not affected.