Synchronize data from tables in sharded MySQL databases to Hologres

This topic describes how to use DataWorks Data Integration or Realtime Compute for Apache Flink to synchronize data from tables in sharded MySQL databases to a Hologres table. You can select a service based on your business requirements.

Background information

In actual business scenarios, data cannot be synchronized by using only one or more simple batch or real-time synchronization tasks. Instead, multiple batch synchronization tasks, real-time synchronization tasks, and data processing tasks are required to synchronize data. In this case, complex configurations are required. The synchronization of data from tables in sharded MySQL databases to a single Hologres table requires more complex configurations that involve multiple tasks. O&M operations on the tasks are also complex.

To address these pain points, the Synchronization Task feature in DataWorks Data Integration provides a configuration-based solution for these scenarios. It supports one-click synchronization from various data sources to simplify and accelerate data synchronization. In addition, Realtime Compute for Apache Flink provides powerful real-time data warehousing and data lakehouse ingestion capabilities, allowing you to efficiently write data from multiple sources to Hologres.

Prerequisites

A Hologres instance is purchased. For more information, see Purchase a Hologres instance.
An ApsaraDB RDS for MySQL instance is created. For more information, see Create an ApsaraDB RDS for MySQL instance.
DataWorks is activated, and an exclusive resource group for Data Integration is purchased and configured. This prerequisite must be met if you want to use DataWorks to synchronize data. For more information about how to activate DataWorks, see . For more information about how to purchase an exclusive resource group for Data Integration, see Use an exclusive resource group for Data Integration.
Realtime Compute for Apache Flink is activated. This prerequisite must be met if you want to use Realtime Compute for Apache Flink to synchronize data. For more information, see Activate Realtime Compute for Apache Flink.

Note

Make sure that the activated services reside in the same region.

Use DataWorks to synchronize data from tables in sharded MySQL databases to Hologres

Hologres is deeply integrated with DataWorks. You can use DataWorks Data Integration to synchronize data from multiple source tables to Hologres at a time. To synchronize data from tables in sharded MySQL databases to Hologres at a time, perform the following steps:

Prepare data in sharded MySQL databases.
Prepare data in tables of sharded MySQL databases. In this example, data in three tables of two sharded databases is prepared. The following table describes the details.
Database name
Table name
hmtest1
product_20220420
hmtest1
product_20220421
hmtest2
product_20220422
The three tables have the same schema. However, duplicate data may exist in the tables. The following code shows the DDL statement that is used to create the table product_20220420:
```
CREATE TABLE product_20220420 (
 value_id int8,
 attribute_id int8 not null,
 id_card int8,
 name text,
 potion text,
    ds text,
PRIMARY KEY (`value_id`)
);
```
Create a synchronization task to synchronize data to Hologres in real time.
1. Configure a real-time synchronization task.
  1. In the DataWorks console, go to Data Integration and select Synchronization Task from the left-side navigation pane to create a synchronization task. For more information, see Select a synchronization solution. Select MySQL as the source and Hologres as the destination.
  2. Click Create Data Synchronization Solution.
2. Establish network connections.
  1. Specify a New Node Name.
  2. For Provision Type, select Real-time Database Sync.
  3. Select a resource group, the source and destination data sources, and then click Test Connectivity for All Resource Groups and Data Sources.
    
    If no resource group or data source is available, create a resource group or add required data sources.
  4. When the status for both the source and destination is Network Connected, click OK and then click Next.
3. Select source tables and configure mapping rules.
  1. Specify source tables.
  2. Configure Mapping Rules for Table Names.
    - If a destination table already exists in the Hologres database, select its name and schema in the Destination Table Name and Destination Schema Name columns for the source table.
    - If you have not created a destination table in the Hologres database, you can Customize Mapping Rules for Destination Table Names to define a new table. The system automatically creates the table and completes the mapping.
      
      Click Edit to the right of Customize Mapping Rules for Destination Table Names.
      
      In the Design table name mapping rule dialog box, click Add.
      
      In the Edit Rule dialog box, specify a Rule Name, select Source Table Name, click Actions, and then enter the source table name in the Source field and the new table name in the Destination field.
      
      Click Confirm to apply the rule. In the row for the corresponding source table, select the rule you defined and refresh the mapping. In the Destination Schema Name column, select the corresponding schema.
4. Configure the destination table.
  1. Click Refresh Source Table and Hologres Table Mapping.
    
    Note
    Mappings display relationships between the source and destination tables. If the source tables are mapped to the same destination table, data in the source tables is synchronized to the same destination table.
  2. Add additional fields to the destination table.
    To distinguish source tables, add additional fields to the destination table.
    
    Select all tasks, click Batch Modify, and then select Destination Table Schema - Batch Modify and Add Field.
    
    In the Destination Table Schema - Batch Modify and Add Field dialog box, click Add Field and add two fields: db_name and table_name.
    
    After you add the fields, click Apply and Refresh Mapping.
    
    In this example, DB_NAME_SRC is used as the source database, and TABLE_NAME_SRC is used as the source table.
    
    Optional. Specify an additional field as the primary key.
    
    If you want to synchronize a large amount of data from many source tables, we recommend that you specify an additional field in a destination table as the primary key. Then, you can use the additional field and the primary keys of the source tables to compose a composite key. This prevents data conflicts among primary keys of different source tables. You can also specify an additional field as the distribution key to ensure that duplicate data is synchronized to the same shard. This improves the performance of data synchronization.
    
    Click the icon next to the Destination Table Name and go to the Preview Table Creation Statement dialog box.
    
    In the Table Creation Statement dialog box, modify the table creation statement to specify the table_name additional field as the primary key and the distribution key of the destination table and click Determine.
    
    Note
    We recommend that you specify the table_name field as one of the fields that compose a composite primary key.
    You can also create more indexes on the destination table to improve data synchronization performance. For more information, see CREATE TABLE.
    
    BEGIN; CREATE TABLE IF NOT EXISTS hologres1.product1 ( value_id BIGINT NOT NULL, attribute_id BIGINT NOT NULL, id_card BIGINT, "name" TEXT, potion TEXT, ds TEXT, table_name TEXT NOT NULL, db_name TEXT, PRIMARY KEY(value_id,table_name) ); CALL SET_TABLE_PROPERTY('hologres1.product1', 'distribution_key', '"table_name","value_id"'); CALL SET_TABLE_PROPERTY('hologres1.product1', 'time_to_live_in_seconds', '3153600000'); CALL SET_TABLE_PROPERTY('hologres1.product1', 'orientation', 'column'); CALL SET_TABLE_PROPERTY('hologres1.product1', 'binlog.level', 'none'); CALL SET_TABLE_PROPERTY('hologres1.product1', 'bitmap_columns', '"name","potion","ds"'); CALL SET_TABLE_PROPERTY('hologres1.product1', 'dictionary_encoding_columns', '"name":auto,"potion":auto,"ds":auto,"table_name":auto,"db_name":auto'); COMMIT;
    
    Click Apply and Refresh Mapping.
5. Configure rules for processing DML messages.
  After you configure fields for the destination table, configure rules for processing DML messages. You can configure rules for one or more tables at a time based on your business requirements.
  1. Select all tasks, click Batch Modify, and then select Configure DML Rule.
  2. In the Configure DML Rule dialog box, select Normal treatment as the processing policy.
    
    This configures INSERT, UPDATE, and DELETE operations to use Normal processing.
  3. Click OK to save the DML policy.
6. Configure rules for processing DDL messages.
  1. Configure the Configure DDL Capability for the task based on your business requirements. In this example, the DDL message processing policies are set as follows: Create table is set to Ignore, Drop table is set to Warn, Add column is set to Normal processing, Drop column is set to Ignore, Rename table is set to Error, Rename column is set to Warn, Modify column type is set to Warn, and Truncate table is set to Normal processing.
  2. Click OK.
7. Configure advanced parameters.
  1. Configure advanced parameters based on your business requirements, including settings in Reader Config, Writer Config, and Runtime Config.
  2. Click OK.
8. After the configuration is complete, click Complete.
Run the synchronization task.

After you complete the configuration, find the task on the Synchronization Task page. In the Actions column, click Start. To view the task details, choose More > View.
Query data.

After you run the synchronization task, full data in the source tables is first synchronized to Hologres, and then incremental data in the source tables is synchronized to Hologres in real time. After full data is synchronized, you can query the data in Hologres. In this example, the query result is displayed in the following figure.

For example, run the following statement to query the synchronized data: select * from hologres1.product1;.

The query results show that the additional columns contain the source database and table names. This confirms that data from the sharded source databases and tables has been consolidated into the same table in Hologres.

If incremental data is written to the source tables, the incremental data is synchronized to Hologres in real time. This example describes how to synchronize data from tables in sharded MySQL databases to a Hologres table by using a synchronization task. You can configure a synchronization task to perform other synchronization operations based on your business requirements.

Use Realtime Compute for Apache Flink to synchronize data from tables in sharded MySQL databases to Hologres

For more information about how to use Realtime Compute for Apache Flink to synchronize data from tables in sharded MySQL databases to Hologres, see Real-time database ingestion.

Database name	Table name
hmtest1	product_20220420
hmtest1	product_20220421
hmtest2	product_20220422