After you prepare data sources, network environments, and resources, you can create a real-time data sync node to synchronize data to DataHub. This topic describes how to create a real-time data sync node and view the status of the node.

Limits

  • DataWorks allows you to use only exclusive resource groups for Data Integration to run real-time data sync nodes.

  • You can run real-time data sync nodes to synchronize data only from PolarDB, ApsaraDB for OceanBase, MySQL, or Oracle to DataHub.

Create a real-time data sync node

  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspaces.
  3. After you select the region in which the workspace that you want to manage resides, find the workspace and click Data Analytics in the Actions column.
  4. Create a workflow.
    If you have a workflow, skip this step.
    1. Move the pointer over the Create icon and select Workflow.
    2. In the Create Workflow dialog box, set the Workflow Name parameter.
    3. Click Create.
  5. Create a real-time data sync node.
    1. On the DataStudio page, move the pointer over the Create icon icon and choose Data Integration > Real-time synchronization.
      Alternatively, find the workflow in which you want to create a real-time data sync node and right-click the Data Integration. From the shortcut menu, choose Create > Real-time synchronization.
    2. In the Create Node dialog box, set the parameters that are described in the following table. Create a real-time data sync node to synchronize data to DataHub
      Parameter Description
      Node Type The type of the node. Default value: Real-time synchronization.
      Sync Method Set the value to Migration to DataHub. In this case, partial or all tables in your desired database are migrated to DataHub.
      Node Name The name of the node. The name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
      Location The directory in which the real-time data sync node is stored.
    3. Click Commit. You are navigated to the configuration tab of the real-time data sync node.
  6. Select a resource group.
    1. On the right side of the configuration tab, click the Basic Configuration tab.
    2. In the panel that appears, select the resource group that you want to use from the Resource Group drop-down list.
      Note

      DataWorks allows you to use only exclusive resource groups for Data Integration to run real-time data sync nodes.

      If no exclusive resource group for Data Integration exists, click Create Exclusive Resource Group for Data Integration to create a resource group. For more information, see Exclusive resource groups for Data Integration.
  7. Select a source and configure synchronization rules.
    1. In the Data Source section, specify the Type and Data source parameters.
      Note You can set the Type parameter only to MySQL, ApsaraDB for OceanBase, or PolarDB.
    2. In the Source Table section, select the tables whose data you want to synchronize from the Source Table list. Then, click the Move icon icon to move the tables to the Selected Source Table list. Source Table
      All tables in the source data source are listed in the Source Table section. You can select all or specified tables to synchronize them at a time.
    3. In the Conversion Rule for Table Name section, click Add rule to select a rule.
      Supported options include Conversion Rule for Table Name and Rule for Destination Topic.
      • Conversion Rule for Table Name: the rule for converting the names of source tables to those of destination topics.
      • Rule for Destination Topic: the rule for adding prefixes and suffixes to destination topics.
    4. Click Next Step.
  8. Select a data source as the destination and configure the formats for the destination topics.
    1. In the Set Destination Topic step, set the Destination, DataHub write mode, and Sharding Strategy parameters.
      If you want to synchronize a source table that does not contain a primary key, you can select Source tables without primary keys can be synchronized.
    2. Optional:Add fields to the destination tables
      If you want to add fields to all the tables to be synchronized, click New field in the Fields In Destination Table section.
    3. Click Refresh source table and DataHub Topic mapping to configure the mappings between the source tables and destination DataHub topics.
    4. View the mapping progress, source tables, and mapped destination topics. View the mapping progress, source tables, and mapped destination topics
      No. Description
      1
      The progress of mapping the source tables to the destination tables.
      Note The mapping may require a long period of time if you want to synchronize data from a large number of tables.
      2
      • If the tables in the source database contain primary keys, the system removes duplicate data based on the primary keys during the synchronization.
      • If the tables in the source database do not contain primary keys, you must perform the following operations:
        • If you select Source tables without primary keys can be synchronized. in the Set Destination Topic step, tables without primary keys can be synchronized. You can click the Edit icon icon to customize primary keys. You can use one field or a combination of several fields that are not primary keys to replace primary keys. This way, the system removes duplicate data based on the primary keys during the synchronization.
        • If you do not select Source tables without primary keys can be synchronized. in the Set Destination Topic step, errors occur when you synchronize tables without primary keys. In this case, you must delete these tables before the data synchronization can continue.
      3 The method that is used to create a destination topic. The message that appears in the DataHub Topic column varies based on the method that you select.
      • If you set the Topic creation method parameter to Create Topic, the name of the topic that is automatically created appears in the DataHub Topic column. You can click the topic name to modify the topic information.
      • If you set the Topic creation method parameter to Use Existing Topic, you must select a topic from the drop-down list in the DataHub Topic column.
    5. Click Next Step.
      If you set the Topic creation method parameter to Create Topic, you must click Start table building in the Create tables automatically dialog box to create DataHub topics.
  9. Configure the resources required by the data sync node.
    1. In the Set Resources for Solution Running step, set the parameters that are described in the following table.
      Parameter Description
      Maximum number of connections supported by source read The maximum number of Java Database Connectivity (JDBC) connections that are allowed for the source. Specify an appropriate number based on the resources of the source. Default value: 15.
      Maximum number of parallel threads allowed to read by destination The maximum number of parallel threads that the sync node uses to read data from the source table or write data to the destination. Maximum value: 32. Specify an appropriate number based on the resources of the source and the destination.
    2. Click Complete Configuration.

Commit and deploy the real-time data sync node

Commit and deploy the MySQL node.
  1. Click the Save icon in the top toolbar to save the node.
  2. Click the Submit icon in the top toolbar to commit the node.
  3. In the Commit Node dialog box, enter your comments in the Change description field.
  4. Click OK.
If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. Click Deploy in the upper-right corner. For more information, see Deploy nodes.

Start the real-time data sync node

  1. Go to the Operation Center page.
    After you commit and deploy the real-time data sync node, click Operation Center in the upper-right corner of the DataStudio page to manage the node on the Real Time DI page.
  2. View the details of a real-time data sync node.
    On the Real Time DI page, find the real-time data sync node that you want to view and click the node name.
  3. Start the real-time data sync node.
    1. Go back to the previous page and click Start in the Operation column of the node that you want to start.
    2. In the Start dialog box, set the parameters as required. Start the real-time data sync node
      Parameter Description
      Whether to reset the site Specifies whether to set the point in time for the next startup. If you select the Reset site parameter, the Start time point and Time zone parameters are required.
      Start time point The date and time for starting the real-time data sync node.
      Time zone The time zone in which the real-time data sync node is run. You can select a time zone from the Time zone drop-down list.
      Failover The maximum number of failovers allowed within the specified time range.
      Note If this parameter is not specified, the system automatically stops the node if the number of failovers exceeds 100 within 5 minutes. This prevents excessive resource consumption caused by the frequent starts of the node.
      Dirty data policy
      • Zero tolerance, not allowed: The real-time data sync node is automatically stopped if the node contains dirty data.
      • No limit: The real-time data sync node can normally run regardless of whether the node contains dirty data.
      • Limited control: The real-time data sync node is automatically stopped if the amount of dirty data contained in the node exceeds a specified value.
    3. Click Confirm.

Manage the real-time data sync node

  • Stop a real-time data sync node that is running.

    Find the real-time data sync node that you want to stop and click Stop in the Operation column. In the message that appears, click Stop.

  • Undeploy a real-time data sync node that is not running.

    Find the real-time data sync node that you want to undeploy and click Undeploy in the Operation column. In the message that appears, click Undeploy.

  • View the alert information of a real-time data sync node.

    Find the real-time data sync node that you want to view and click Alert settings in the Actions column. In the Alert settings dialog box, view the alert events and alert rules.

  • Configure alert rules for a real-time data sync node.
    1. Find the real-time data sync node for which you want to configure alert rules and click Configure Alert Rule in the lower part of the Real Time DI page.
    2. In the New rule dialog box, set the parameters that are described in the following table.
      Parameter Description
      Name The name of the alert rule.
      Description The description of the alert rule.
      Indicators The metric for which an alert is reported. Valid values:
      • Status
      • Business delay
      • Failover
      • Dirty Data
      • Not Supported by DDL Statement
      Threshold The threshold for reporting an alert. Specify the WARNING In and CRITICAL In parameters. The default values of the parameters are 5 minutes.
      Alarm interval The interval at which an alert is reported. The default value is 5 minutes.
      WARNING The method that is used to send alert notifications. You can specify one or more methods. Valid values: Mail, SMS, and DingTalk.
      Note Only Singapore, Malaysia(Kuala Limpur), and Germany(Frankfurt) support the SMS reminding method. To use the SMS reminding method in other regions, submit a ticket to contact DataWorks technical support.
      CRITICAL
      Receiver (Non-DingTalk) The recipient of alert notifications.
    3. Click Confirm.
  • Modifies alert rules for real-time data sync nodes at a time.
    1. Select one or more real-time data sync nodes for which you want to modify alert rules and click Operation alarm in the lower part of the Real Time DI page.
    2. In the Operation alarm dialog box, modify the values of the Type and Indicators parameters.
    3. Click Confirm.