This topic describes how to use Data Integration to synchronize data from tables in sharded databases to Hologres. In this example, data is synchronized from tables in sharded MySQL databases.

Prerequisites

  • A Hologres data source and a MySQL data source are added to DataWorks. You must add data sources to DataWorks. This way, you can select the data sources to synchronize data between the data sources when you configure a data synchronization solution. In this example, the MySQL data source doc_mysql1 is used. For more information about how to add a MySQL data source and a Hologres data source to DataWorks, see Add a MySQL data source and Add a Hologres data source.
    Note For information about the items that you need to understand before you add a data source, see Overview.
  • An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
  • Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.
  • The data source environments are prepared.
    • Hologres: In this example, schemas need to be created in the Hologres data source. To create the schemas, you must grant the account that is configured when you add the Hologres data source to DataWorks the permissions to create schemas in the Hologres data source. For more information, see Overview.
    • MySQL: You must enable the binary logging feature for the MySQL data source. For more information, see Prepare a MySQL environment.

Background information

In actual business scenarios, data cannot be synchronized by using only one or more simple batch or real-time synchronization nodes. Instead, multiple batch synchronization nodes, real-time synchronization nodes, and data processing nodes are required to synchronize data. In this case, complex configurations are required. The synchronization of data from tables in sharded MySQL databases to a single Hologres table requires more complex configurations that involve multiple nodes. O&M operations on the nodes are also complex.

To resolve this issue, DataWorks Data Integration provides data synchronization solutions that can be used to synchronize data between various data sources in different business scenarios. The solutions improve the efficiency of data synchronization.

Precautions

When you synchronize data to Hologres, you can write data only to child partitioned tables.

Sample scenario

  • Scenario: The MySQL data source has the following sharded databases: order_db01, order_db02, and order_db03. The three sharded databases contain 12 physical tables. Two logical tables t_order and t_user exist in the Hologres instance. t_order is an order table, and t_user is a user table. Two physical tables in each sharded database map to one logical table. DatabaseYou need to synchronize full data from the 12 physical tables to the two logical tables in the Hologres data source at a time. You also need to make sure that incremental data in the 12 physical tables can be synchronized to the logical tables in real time to meet requirements such as data analytics in the Hologres data source.

    To mark the source, source sharded database, and source physical table to which a data record in a logical table belongs, the following additional fields need to be added to the logical table: src_datasource, src_database, and src_table. The additional fields and the primary key in the physical table constitute a unique key in the logical table. This ensures the uniqueness of the data record in the logical table. If you want to synchronize full data and incremental data from the source, you must assign values to the additional fields.

  • Requirements in the scenario:
    • Data: Synchronize full data from physical tables in the MySQL data source to the two logical tables in the Hologres data source at a time and incremental data from the physical tables to the two logical tables in real time.
    • Table: Synchronize data from physical tables in sharded MySQL databases to a single table in the Hologres data source and add a prefix to the name of the destination table.
      • Synchronize data from physical tables whose names match the regular expression t_order.* in the MySQL data source to the logical table t_order in the Hologres data source, and synchronize data from physical tables whose names match the regular expression t_user.* in the MySQL data source to the logical table t_user in the Hologres data source.
      • Add the prefix cdo_ to the names of the logical tables.
    • Schema: Synchronize data from all sharded databases whose names match the regular expression order_db.* in the MySQL data source to the schema order_db in the Hologres data source.
    • Field: Add the additional fields src_datasource, src_database, and src_table to the logical tables to record the source, source sharded databases, and source physical tables to which source data belongs.
    The following table describes mappings between the source physical tables and destination logical tables.
    Source sharded databaseSource physical tableDestination logical tableAdditional field
    order_db01t_order_01cdo_t_order

    src_datasource

    src_database

    src_table

    t_order_02
    t_user_01cdo_t_user
    t_user_02
    order_db02t_order_03cdo_t_order
    t_order_04
    t_user_03cdo_t_user
    t_user_04
    order_db03t_order_05cdo_t_order
    t_order_06
    t_user_05cdo_t_user
    t_user_06

Procedure

  1. Step 1: Select a data synchronization solution type
  2. Step 2: Establish network connections
  3. Step 3: Select source tables and configure a mapping rule
  4. Step 4: Configure destination tables
  5. Step 5: Configure rules for processing DML messages
  6. Step 6: Configure rules for processing DDL messages
  7. Step 7: Configure resources required to run the data synchronization solution

Step 1: Select a data synchronization solution type

Log on to the DataWorks console and go to the homepage or Tasks page in Data Integration to create a data synchronization solution. Select the source type MySQL and destination type Hologres and select One-click Real-time Synchronization to Hologres.

Step 2: Establish network connections

Select doc_mysql1 as the source and default Hologres data source as the destination. The default Hologres data source is generated after you associate a Hologres instance with a DataWorks workspace as a compute engine instance. Establish and test network connections between the exclusive resource group for Data Integration and the data sources.

Step 3: Select source tables and configure a mapping rule

  1. In the Basic Configuration section, configure the parameters, such as the Solution Name and Location parameters, based on your business requirements.
  2. In the Data Source section, confirm the information about the source.
  3. In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the Icon icon to add the tables to the Selected Source Table list.

    The Selected Source Table list displays all tables in the source. You can select all or specific tables.

  4. In the Set Mapping Rules for Table/Database Names section, click Add Rule, select a rule type, and then configure a mapping rule of the selected type.
    By default, data is written to the destination schema and table that are named the same as the source database and table. You can specify a destination schema name and a destination table name in a mapping rule to write data in multiple source tables to the same Hologres table. You can also specify a prefix in a mapping rule to write data to Hologres tables whose names start with a different prefix from the source tables. You can use regular expressions to convert the names of the schemas or tables. You can also use built-in variables to add prefixes and suffixes to the names of destination tables. For more information, see the Select source tables and configure a mapping rule section in Configure a data synchronization solution.
    In the preceding sample scenario, data in 12 physical tables in the MySQL data source needs to be synchronized to the two logical tables in the Hologres data source. To synchronize the data, you can configure the following types of mapping rules.Destination table names
    • Conversion Rule for Table Name: This type of mapping rule allows you to map a source table to a destination table by using a regular expression.

      In the preceding figure, two mapping rules of this type are configured. In the mapping rules, Source is set to t_order.* and t_user.*, and Target is set to t_order and t_user. This way, data in the source tables whose names match the regular expression t_order.* is synchronized to the destination logical table t_order, and data in the source tables whose names match the regular expression t_user.* is synchronized to the destination logical table t_user.

    • Rule for Destination Table Name: This type of mapping rule allows you to use a built-in variable and a prefix or suffix to specify the name of the destination table. In the preceding figure, the prefix cdo_ is added to the destination table name that is mapped based on the mapping rule of the Conversion Rule for Table Name type.
    • Rule for Conversion Between Source Database Name and Destination Schema Name: This type of mapping rule allows you to map multiple source databases to a destination schema. This way, data in multiple source databases can be synchronized to the same destination schema. By default, data in a source database is synchronized to the destination schema that is named the same as the source database. In the preceding figure, data in all source databases whose names match the regular expression order_db.* is synchronized to the destination schema order_db.

Step 4: Configure destination tables

  1. Confirm the policy for writing data to Hologres.
    Data Integration supports only the Replay (replay operation log to restore data) policy for writing data to Hologres. This policy indicates that the system performs the same operation on a Hologres table as that performed on a source table. For example, if a data record is inserted to a source table, the system inserts the same data record to the destination table. If data records are updated in or deleted from a source table, the system updates the same data records in or deletes the same data records from the destination table.
  2. Map source tables to destination tables.
    Click Refresh source table and Hologres Table mapping. The system maps the source tables to destination tables based on the mapping rules that you configured in the Set Mapping Rules for Table/Database Names section. If no mapping rule is configured, data in the source tables is synchronized to the destination tables that are named the same as the source tables by default. If no such destination tables exist, the system automatically creates the tables in the destination. You can change the table generation method and add additional fields to a destination table.
    After you click Refresh source table and Hologres Table mapping, Data Integration automatically extracts the source tables whose data you want to synchronize and maps the source tables to the destination tables based on the mapping rules that you configured. Click Refresh source table and Hologres Table mapping. The following figure shows the mappings between the source tables and destination Hologres tables.Results
  3. Add additional fields to multiple destination tables at a time.
    To distinguish source tables, add additional fields to the destination tables.
    1. Select all established mappings and click Batch Edit Additional Fields in Destination Table.
      Note You can add additional fields to a destination table only if you select Create Table from the drop-down list in the Table creation method column of the destination table.
    2. In the Batch Edit Additional Fields in Destination Table dialog box, click New field, add src_datasource, src_database, and src_table as additional fields, and assign values to the fields.
      Note
      Data Integration allows you to assign the following variables to additional fields as values:
      EXECUTE_TIME: the execution time
      UPDATE_TIME: the update time
      DB_NAME_SRC: the name of the source database
      DB_NAME_SRC_TRANSED: the name of the mapped database
      DATASOURCE_NAME_SRC: the name of the source
      DATASOURCE_NAME_DEST: the name of the destination
      DB_NAME_DEST: the name of the destination database
      TABLE_NAME_DEST: the name of the destination table
      TABLE_NAME_SRC: the name of the source table

Step 5: Configure rules for processing DML messages

In this step, you can configure rules for processing DML messages that are generated for insert, update, or delete operations performed on the source. The following types of processing rules are supported:
  • Normal: If you configure this type of processing rule, a DML message from the source is sent to the destination, and the system performs the same operation on the destination as the operation that is performed on the source.
  • Ignore: If you configure this type of processing rule, a DML message from the source is ignored and not sent to the destination. The system does not perform any operation on the destination.
  • Conditional Normal Processing: If you configure this type of processing rule, you can specify a filter condition. The system first filters data based on the filter expression. Data that meets the filter expression is normally processed. Data that does not meet the filter expression is ignored.
Note If no processing rule is configured in this step, the Normal processing rule is used by default.

Step 6: Configure rules for processing DDL messages

DDL operations are performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Rules for processing DDL messages.

Step 7: Configure resources required to run the data synchronization solution

After the data synchronization solution is created, the solution generates batch synchronization nodes and a real-time synchronization node. The batch synchronization nodes synchronize full data in the source in offline mode, and the real-time synchronization node synchronizes incremental data in the source in real time. You must configure resources that are required to run the batch synchronization nodes and real-time synchronization node and the attributes related to these nodes in the Configure Resources step.

You must select the exclusive resource group for Data Integration that you want to use to run the batch synchronization nodes and the exclusive resource group for Data Integration that you want to use to run the real-time synchronization node. In addition, you can click Advanced Configuration in the Offline Full synchronization section and that in the Real-time incremental synchronization section to configure settings such as those related to dirty data, the maximum number of parallel threads, and the maximum number of connections allowed for the source.

Note
  • DataWorks uses resource groups for scheduling to issue batch synchronization nodes to resource groups for Data Integration and runs the nodes on the resource groups for Data Integration. Therefore, a batch synchronization node also occupies the resources of a resource group for scheduling. You are charged fees for using the resource group for scheduling to schedule the batch synchronization nodes. For information about the node issuing mechanism, see Mechanism for issuing nodes.
  • We recommend that you use different resource groups to run batch and real-time synchronization nodes. If you use the same resource group to run batch and real-time synchronization nodes, the nodes compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of nodes may affect each other. In this case, the batch synchronization nodes may slow down, or the real-time synchronization node may be delayed. Out of memory (OOM) errors may also occur due to insufficient resources.

Step 8: Run the data synchronization solution

  1. Go to the Tasks page in Data Integration and find the created data synchronization solution.
  2. Click Submit and Run in the Actions column to run the data synchronization solution.
  3. Click Execution details in the Actions column to view the execution details of the data synchronization solution.

What to do next

After a data synchronization solution is configured, you can manage the solution. For example, you can add tables to or remove tables from the solution, configure alerting and monitoring settings for the nodes that are generated by the solution, and view information about the running of the nodes. For more information, see Perform O&M for a data synchronization solution.