All Products
Search
Document Center

DataWorks:Configure a real-time synchronization node in DataStudio

Last Updated:Aug 16, 2023

After you prepare data sources, network environments, and resources, you can create a real-time synchronization node to synchronize incremental data from a single table or a database in the source to the destination. This topic describes how to create a real-time synchronization node to synchronize incremental data from a single table or a database in a source to a destination in real time and how to view the status of the node.

Prerequisites

  1. The data sources that you want to use are prepared. Before you configure a data synchronization node, you must prepare the data sources from which you want to read data and to which you want to write data. This way, when you configure a data synchronization node, you can select the data sources. For information about the data source types, readers, and writers that are supported by real-time synchronization, see Data source types that support real-time synchronization.

    Note

    For information about the items that you need to understand before you prepare a data source, see Overview.

  2. An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.

  3. Network connections are established between the exclusive resource group for Data Integration and the data sources. For more information, see Establish a network connection between a resource group and a data source.

  4. The data source environments are prepared. You must create an account that can be used to access a database in the source and an account that can be used to access a database in the destination. You must also grant the accounts the permissions required to perform specific operations on the databases based on your configurations for data synchronization. For more information, see Overview.

Go to the DataStudio page

You must go to the DataStudio page in the DataWorks console to create and configure a real-time synchronization node.

Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

Procedure

  1. Step 1: Create a real-time synchronization node

  2. Step 2: Configure a resource group

  3. Step 3: Configure the real-time synchronization node

  4. Step 4: Commit and deploy the real-time synchronization node

Step 1: Create a real-time synchronization node

  1. Create a workflow. For more information, see Create a workflow.

  2. Create a real-time synchronization node.

    1. You can use one of the following methods to create a real-time synchronization node:

      • Method 1: In the Scheduled Workflow pane of the DataStudio page, find the desired workflow in the Business Flow section and click the name of the workflow. Then, right-click Data Integration and choose Create Node > Real-time synchronization.

      • Method 2: In the Scheduled Workflow pane of the DataStudio page, find the desired workflow in the Business Flow section and double-click the name of the workflow. In the Data Integration section of the workflow configuration tab that appears, drag Real-time synchronization to the canvas on the right.

      Real-time synchronization
    2. In the Create Node dialog box, configure the parameters that are described in the following table.

      Parameter

      Description

      Node Type

      The type of the node. Default value: Real-time synchronization.

      Sync Method

      • If you want to create a real-time synchronization node that is used to synchronize incremental data from a single table, set this parameter to End-to-end ETL. This method allows you to synchronize incremental data from one or more tables to a single destination table.

        Note

        If you use this synchronization method, data can be written to only one destination table. If you want to write data to multiple destination tables, you can use one of the following solutions:

        • If you want to filter data, replace strings, or mask data during data synchronization, you can create multiple real-time synchronization nodes and use each of the nodes to synchronize incremental data from a single table in real time.

        • If you want to synchronize incremental data from multiple source tables to multiple destination tables, you can create multiple real-time synchronization nodes. For specific types of data sources, you can also create a real-time synchronization node to synchronize all incremental data from a database.

        • If you want to synchronize full data at a time and then synchronize incremental data in real time to a destination, you can create a data synchronization solution based on your business requirements. For information about how to create a data synchronization solution, see Configure a synchronization task in Data Integration.

      • If you want to create a real-time synchronization node that is used to synchronize all incremental data from a database, select a data synchronization method used to synchronize database changes, such as Migration to MaxCompute.

      Path

      The directory in which the real-time data synchronization node is stored.

      Name

      The node name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).

Step 2: Configure a resource group

You can use only exclusive resource groups for Data Integration to run real-time data synchronization nodes. You can perform the following operations to configure a resource group for a real-time synchronization node: Double-click the name of the created node. In the right-side navigation pane of the configuration tab of the node, click the Basic Configuration tab. On the Basic Configuration tab, select the exclusive resource group for Data Integration that is connected to the data source from the Resource Group drop-down list.

Note

We recommend that you run a real-time synchronization node and a batch synchronization node on different resource groups. If you run the nodes on the same resource group, the two nodes compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the nodes may affect each other. In this case, the batch synchronization node may slow down, or the real-time synchronization node may be delayed. Even worse, out of memory (OOM) errors may occur due to insufficient resources.

Step 3: Configure the real-time synchronization node

Configure a real-time synchronization node to synchronize incremental data from a single table

  1. Configure the source.

    1. In the Input section of the configuration tab of the real-time synchronization node, drag the desired source type to the canvas on the right.

    2. Click the source type name. In the panel that appears, configure the parameters.

      For information about the source types that are supported for a real-time synchronization node used to synchronize incremental data from a single table and how to configure the related sources, see the following topics:

  2. Optional:Configure a data conversion component.

    If you want to convert data during data synchronization, you can configure a data conversion component.

    1. In the Conversion section of the configuration tab of the real-time data synchronization node, drag the desired data conversion component to the canvas on the right.

      The following data conversion components are supported for a real-time synchronization node that is used to synchronize incremental data from a single table:

      • Data filtering: You can use the data filtering component to filter data in a source based on specific rules, such as the field size. Only data that meets the rules is retained.

      • String replacement: You can use the string replacement component to replace field values of the STRING data type.

      • Data masking: You can use the data masking component to mask sensitive data in a single source table specified in a real-time synchronization node and enable the node to store the masked data to a specified database.

    2. Click the component name. In the panel that appears, configure the parameters.

  3. Configure the destination.

    1. In the Output section of the configuration tab of the real-time synchronization node, drag the desired destination type to the canvas on the right.

    2. Click the destination type name. In the panel that appears, configure the parameters.

      For information about the destination types that are supported for a real-time synchronization node used to synchronize incremental data from a single table and how to configure the related destinations, see the following topics:

  4. Connect the source to the destination.

    After the source and destination are configured, you can connect them by drawing lines. This way, data can be synchronized between the data sources based on the configurations.

Configure a real-time synchronization node to synchronize incremental data from a database

  1. Select the tables from which you want to read data and configure mapping rules.

    1. In the Data Source section of the Configure Source and Synchronization Rules step, configure the Type and Data source parameters.

    2. Select the tables from which you want to read data.

      In the Source Table section, all tables in the selected data source are displayed in the Source Table list. You can select all or some tables from the Source Table list and click the Icon icon to move the tables to the Selected Source Table list.

      Important

      If a selected table does not have a primary key, the table cannot be synchronized in real time.

    3. Configure mapping rules for the names of the source tables and the names of the destination tables.

      After you select the source database and table from which you want to synchronize incremental data, the real-time synchronization node automatically writes the data in the database and table to the destination schema and table that are named the same as the source database and table. If no such destination schema or table exists, the system automatically creates the schema or table in the destination. You can configure a mapping rule in the Set Mapping Rules for Table/Database Names section to specify the name of the destination schema or table to which you want to write data. You can specify a destination table name in a mapping rule to write data in multiple source tables to the same table. You can also specify prefixes in a mapping rule to write data to a database whose name starts with a different prefix from the source database or to tables whose names start with a different prefix from the source tables.

      • Conversion Rule for Table Name: This type of mapping rule allows you to use a regular expression to map the names of the destination tables to which you want to write data to the names of source tables.

        • Example 1: Synchronize data from the source tables whose names prefix with doc_ to the destination tables whose names prefix with pre_. Map

        • Example 2: Synchronize data from multiple source tables to the same destination table.

          To synchronize incremental data from table_01, table_02, and table_03 to my_table, you can configure a mapping rule of the Conversion Rule for Table Name type, and set Source to table.* and Target to my_table. Example

      • Rule for Destination Table name: This type of mapping rule allows you to use a built-in variable to specify the names of the destination tables to which you want to write data and add a prefix and a suffix to the names of the destination tables. The following built-in variables are supported:

        • ${db_table_name_src_transed}: the name of the destination table that is mapped based on a mapping rule of the Conversion Rule for Table Name type

        • ${db_name_src_transed}: the name of the destination schema that is mapped based on a mapping rule of the Rule for Conversion Between Source Database Name and Destination Schema Name type

        • ${ds_name_src}: the name of the source

        For example, you can configure pre_${db_table_name_src_transed}_post to convert the table name my_table that is generated in the previous example to pre_my_table_post.

      • Rule for Conversion Between Source Database Name and Destination Schema Name: This type of mapping rule allows you to use a regular expression to specify the names of the destination schemas to which you want to write data.

        Example: Synchronize data from the source schemas whose names start with the prefix doc_ to the destination schemas whose names start with the prefix pre_. Schema

  2. Select a destination and configure destination tables or destination topics.

    1. In the Set Destination Table/Topic step, configure the basic information of the Destination parameter. For example, you can configure the Write Mode and Automatic Partitioning by Time parameters. The required configurations vary based on the data source type. The parameters that are displayed in the DataWorks console prevail.

    2. Click Refresh source table and destination table/topic mapping to map the source tables to destination tables or topics.

      You can specify custom names for destination schemas and destination tables. You can also click Edit additional fields in the Actions column to add additional fields to a destination table or topic and assign constants or variables to the additional fields as values. The required configurations vary based on the data source type. The parameters that are displayed in the DataWorks console prevail.

      Note

      The mapping may require a long period of time if data is synchronized from a large number of tables.

  3. Optional:Configure rules to process DML messages.

    Only some real-time synchronization nodes support processing rules for DML messages. You can configure processing rules for DML messages in the Set Table-level Sync Rule step. This way, if an insert, update, or delete operation is performed on a source table, the real-time synchronization node processes the related DML message based on the processing rule that you configured.

    Note

    The support for synchronization of data changes caused by DML operations varies based on the destination type. The parameters that are displayed in the DataWorks console prevail. For more information, see Supported DML and DDL operations.

  4. Configure rules to process DDL messages.

    DDL operations may be performed on a source table. When you configure a real-time synchronization node to synchronize data in real time, you can configure rules to process different DDL messages based on your business requirements. The support for synchronization of data changes caused by DDL operations varies based on the destination type. For more information, see Supported DML and DDL operations.

    Note

    You can also configure processing rules for a specific destination type. To configure processing rules for a specific destination type, perform the following steps: In the left-side navigation pane of the Data Integration page, choose Configuration > Policy for Processing DDL Messages. On the Processing Policy for DDL Messages in Real-time Sync page, configure the rules. The following table describes the rules that are used to process different types of DDL messages.

    DDL message type

    Processing rule

    CreateTable

    DataWorks processes a DDL message of the related type based on the following rules:

    • Normal: DataWorks sends the DDL message to the destination. Then, the destination processes the DDL message. Different destinations respond to DDL messages in different ways. Therefore, DataWorks only sends the message to the destination.

    • Ignore: DataWorks discards the DDL message and does not deliver the message to the destination.

    • Alert: DataWorks discards the DDL message and generates an alert in real-time synchronization logs. The alert indicates that the message is discarded due to an execution error.

    • Error: DataWorks terminates the real-time synchronization node and sets the node status to Failed.

    DropTable

    AddColumn

    DropColumn

    RenameTable

    RenameColumn

    ChangeColumn

    TruncateTable

  5. Configure the resources that are required to run the real-time synchronization node.

    • You can specify the maximum number of parallel threads that can be used to read data from the source and write data to the destination.

    • You can specify whether dirty data is allowed during data synchronization.

      • If you do not allow the generation of dirty data and dirty data records are generated during data synchronization, the real-time synchronization node fails.

      • If you allow the generation of dirty data and specify the maximum number of dirty data records that are allowed, the number of generated dirty data records determines whether the node fails. If the number of generated dirty data records does not exceed the specified limit, the dirty data is ignored and the node continues to run. If the number of generated dirty data records exceeds the specified limit, the node fails.

  6. Click Complete Configuration.

Step 4: Commit and deploy the real-time synchronization node

  1. Click the Save icon in the top toolbar to save the node.

  2. Click the Submit icon in the top toolbar to commit the node.

  3. In the Commit Node dialog box, configure the Change description parameter.

  4. Click Confirm.

    If you use a workspace in standard mode, you must deploy the node to the production environment after you commit the node. In the top navigation bar, click Deploy. For more information about how to deploy a node, see Deploy nodes.

What to do next

After the real-time synchronization node is configured, you can start and manage the node on the Real Time DI page in Operation Center. To go to the Real Time DI page, perform the following operations: Log on to the DataWorks console and go to the Operation Center page. In the left-side navigation pane of the Operation Center page, choose RealTime Task > Real Time DI. For more information, see O&M for real-time synchronization nodes.