After you configure data sources, network environments, and resource groups, you can create and run a sync solution. This topic describes how to create a sync solution and view the status of the nodes that are generated by the sync solution.

Prerequisites

Before you create a sync solution, make sure that the following operations are performed:
  • Plan and configure resources
  • Configure a source MySQL data source
  • Configure a source Oracle data source
  • Configure a source PolarDB data source
  • Add data sources
  • To run sync nodes on an exclusive resource group for Data Integration, make sure that the version of the DataX plug-in that is used to run batch sync nodes is 20210726203000 or later and the version of the StreamX plug-in that is used to run real-time sync nodes is 202107121400 or later. Otherwise, you may receive a data format error or fail to run sync nodes for synchronizing incremental data or full data to Kafka.
    View the version of DataX: Go to the Operation Center page and click Patch Data under Cycle Task Maintenance in the left-side navigation pane. Right-click the batch sync node and click View Runtime Log in the directed acyclic graph (DAG) of the node. On the page that appears, search Detail log url in the log area and click the link to go to the page that displays the details of the batch sync node. Then, search for the version information on the page that appears in the format of DataX( ..... ),From Alibaba!.For example, search DataX (20210709_keyindex-20210709144909), From Alibaba !" in the log area to view the version information of DataX, as shown in the third figure in this section. Operation Center pageLog of the batch sync nodeLog of the details of the batch sync node
    View the version of StreamX: Go to the Operation Center page and click Real Time DI under RealTime Task Maintenance in the left-side navigation pane. On the Real Time DI page, click the real-time sync node. Then, click the Log tab and search for the version information in the log area in the format of StreamX( ... ),From Alibaba!.For example, search StreamX (202107290000_20210729121213), From Alibaba ! in the log area to view the version information of StreamX, as shown in the following figure. View the version of StreamX

Create a sync solution

  1. Go to the Data Integration page and choose Sync Solutions > Nodes to go to the Task list page.
    For more information, see Select a data synchronization solution.
  2. On the Task list page, click New task in the upper-right corner.
  3. Configure a sync solution.
    1. Specify a source data source and a destination data source.
      Select data sources from the Select source drop-down lists in the Source and Destination sections.
      Note The One-click real-time synchronization to Kafka sync solution allows you to synchronize data to Kafka only from MySQL, Oracle, and PolarDB data sources.
      Sync solution
    2. In the Select Synchronization Solution section, click One-click real-time synchronization to Kafka.
    3. Click Next Step.
  4. Configure the network connections for data synchronization.
    1. Select data sources from the Connection Name drop-down lists for the source and the destination. If no data source is available in the drop-down list, click Add Data Source to add one. For more information, see Add a MySQL data source, Add an Oracle data source, and Add a PolarDB data source. Configure the network connections for data synchronization
    2. In the resource group section, select an exclusive resource group for Data Integration from the drop-down list. If no exclusive resource group for Data Integration is available in the drop-down list, click Create Exclusive Resource Group for Data Integration to purchase one. In the dialog box that appears, set the Specifications, Number of resources, and Billing cycle parameters and click Confirm purchase to go to the payment page. For more information, see Plan and configure resources.
      Note

      By default, the Regions parameter is configured based on the region where the workspace resides.

      After you purchase an exclusive resource group for Data Integration, it is associated with this workspace by default.

    3. Click Test Connectivity to check the network connections between the exclusive resource group for Data Integration and the data sources. For more information, see Select a network connectivity solution. If the network connectivity test fails, find the cause by following the operations on the Network Connectivity Diagnostic Tool tab.
    4. Click Next Step.
  5. Select a source and configure synchronization rules.
    1. In the Set Synchronization Sources and Rules step, configure basic information such as the solution name for the sync solution.
      In the Basic Configuration section, set the parameters. Basic configuration
      Parameter Description
      Solution Name The name of the sync solution. The name can be up to 50 characters in length.
      Description The description of the sync solution. The description can be up to 50 characters in length.
      Location If you select Automatic Workflow Creation, DataWorks automatically creates a workflow named in the format of clone_database_Source name+to+Destination name. All sync nodes generated by the sync solution are placed in the Data Integration folder of this workflow.

      If you clear Automatic Workflow Creation, you must select a directory from the Select Location drop-down list. All sync nodes generated by the sync solution are placed in the specified directory.

    2. In the Data Source section, select the encoding format for the data source from the Encoding drop-down list.
    3. In the Source Table section, select the tables whose data you want to synchronize from the Source Table list. Then, click the Icon icon to move the tables to the Selected Table list.
      Source tableAll tables in the source data source are listed in the Source Table section. You can select all or specified tables to synchronize them at a time.
    4. In the Mapping Rules for Table Names section, click Add rule to select a rule.
      Supported options include Conversion Rule for Table Name and Rule for Destination Table name.
      • Conversion Rule for Table Name: the rule that is used to convert the names of source tables into those of destination tables.
      • Rule for Destination Table name: the rule that is used to add prefixes and suffixes to the names of destination tables.
    5. Click Next Step.
  6. Configure the destination topic.
    1. By default, the Destination parameter is set to the destination data source that you configure.
    2. Click Refresh source table and Kafka Topic mapping to configure the mappings between the source tables and destination Kafka topics.
    3. View the mapping progress, source tables, and mapped destination topics. View the mapping progress, source tables, and mapped destination topics
      Serial number Description
      1 The progress of mapping the source tables to destination tables.
      Note The mapping may require an extended period of time if you want to synchronize data from a large number of tables.
      2
      • If you select Source tables without primary keys can be synchronized, a source table that does not contain a primary key can be synchronized to the destination. However, duplicate data may exist if you perform data synchronization.
      • If you select Send heartbeat record, the real-time sync node writes a record that contains the current timestamp to Kafka every 5 seconds. This way, you can view the updates of the timestamp for the latest record written to Kafka and check the progress of the data synchronization even if no new records are written to Kafka.
      3
      • If the tables in the source database contain primary keys, the system removes duplicate data based on the primary keys during the synchronization.
      • If you select Source tables without primary keys can be synchronized and the source table does not contain a primary key, click the Edit icon icon to specify a primary key. You can select one or more columns to serve as the primary key. The values of the one or more columns are used to remove duplicate data when you perform data synchronization.
      4 The method that is used to create a destination topic. Valid values: Use Existing Topic and Create Topic.
      5

      The value in the Kafka Topic column varies with the value that you set for Topic creation method.

      • If you set the Topic creation method parameter to Use Existing Topic, you can select the destination topic from the drop-down list in the Kafka Topic column.
      • If you set the Topic creation method parameter to Create Topic, the name of the topic that is automatically created appears in the Kafka Topic column. You can click the automatically created topic to view and modify the name and description of the topic.
      6 You can click Batch Edit Additional Fields in Destination Topic and add fields for multiple Kafka topics in the dialog box that appears. You can also click Edit additional fields in the Actions column to add additional fields for a single Kafka topic.
      Note The Batch Edit Additional Fields in Destination Topic feature takes effect only If you select Create Topic for the Topic creation method parameter.
    4. Click Next Step.
  7. Configure the resources required by the sync solution.
    In the Set Resources for Solution Running step, set the parameters that are described in the following table. Configure the resources required by the sync solution
    • Offline Full synchronization
      Parameter Description
      Offline task name rules The name of the batch synchronization node that is used to synchronize the full data of the source. After a data synchronization solution is created, DataWorks first generates a batch synchronization node to synchronize full data, and then generates real-time synchronization nodes to synchronize incremental data.
      Resource Groups for Full Batch Sync Nodes

      The exclusive resource group for Data Integration that is used to run the batch synchronization node.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Full Batch Scheduling
      Parameter Description
      Select scheduling Resource Group

      The resource group for scheduling that is used to run the nodes.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Real-time Incremental synchronization
      Parameter Description
      Select an exclusive resource group for real-time tasks

      The exclusive resource group that is used to run the real-time synchronization nodes.

      Only exclusive resource groups for Data Integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for Data Integration that you purchased. For more information, see Plan and configure resources.
      Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
    • Channel Settings
      Parameter Description
      Maximum number of connections supported by source read The maximum number of Java Database Connectivity (JDBC) connections that are allowed for the source. Specify an appropriate number based on the resources of the source. Default value: 20.
  8. Click Complete Configuration. The real-time synchronization solution used to synchronize all data in a database is created.

Run the sync solution

On the Solution task list tab of the Tasks page, find the configured sync solution and click Submit and Run in the Operation column to run the sync solution.

View the status and result of the sync nodes

  • On the Solution task list tab of the Tasks page, find the solution that is run and choose Execution details in the Operation column. Then, you can view the running details of all nodes. Execution details
  • Find a node whose execution details you want to view and click Execution details in the Status column. In the dialog box that appears, click the provided link to go to the DataStudio page.

Manage the sync solution

  • View or edit the sync solution.

    On the Solution task list tab of the Tasks page, find the created sync solution and choose More > View Configuration or choose More > Modify Configuration in the Operation column. Then, you can view or modify the configurations of the sync solution.

  • Delete the real-time sync solution.
    Find the real-time sync solution that you want to delete and choose More > Delete in the Operation column. In the Delete message, click OK.
    Note After you click OK, only the configuration record of the real-time sync solution is deleted. The sync nodes generated by the sync solution and data tables generated by the sync nodes are not affected.
  • Change the priority for the batch synchronization solution
    Find the newly created batch synchronization solution and choose More > Change Priority in the Operation column. In the Change Priority dialog box, enter the desired priority and click Confirm. You can set the priority to an integer from 1 to 8. A larger value indicates a higher priority.
    Note If multiple batch synchronization solutions have the same priority, the system runs them based on the order they are committed.

Set the formats of messages written to Kafka

If you run a real-time sync node after you configure a real-time sync solution, the node reads all the existing data from the source database and writes it to the Kafka topics in JSON format. It also reads incremental data and writes the incremental data to Kafka in real time. Besides, it also synchronizes incremental DDL-based data changes from the source database to Kafka in JSON format in real time. For more information about the formats of messages written to Kafka, see Appendix: Message formats.
Note If you run a batch sync node to synchronize data to Kafka, the payload.sequenceId, payload.timestamp.eventTIme, and payload.timestamp.checkpointTime fields are set to -1 in the messages written to Kafka. The messages are in JSON format.