Synchronize data from multiple tables in an ApsaraDB for ClickHouse database to Hologres in offline mode - DataWorks

This topic describes how to create and configure a batch synchronization solution to synchronize data from multiple tables in an ApsaraDB for ClickHouse database to Hologres.

Limits

Batch synchronization from ApsaraDB for ClickHouse supports only ApsaraDB for ClickHouse data sources of V20.8 or V21.8.
Batch synchronization from ApsaraDB for ClickHouse supports only exclusive resource groups for Data Integration.

Create an exclusive resource group for Data Integration and establish network connections between the resource group and data sources

Before you run a batch synchronization solution, you must establish network connections between your exclusive resource group for Data Integration and data sources. For more information, see Configure network connectivity.

If your exclusive resource group for Data Integration and a data source reside in the same region, you can use a virtual private cloud (VPC) that resides in the region to establish a network connection between the resource group and data resource. To establish such a network connection, perform the following operations:
1. Associate the exclusive resource group for Data Integration with a VPC and add a custom route for the resource group.
2. Add the required IP address or CIDR block to the IP address whitelist of the data source.
If your exclusive resource group for Data Integration and a data source reside in different regions, you can establish a network connection between the resource group and data resource over the Internet. To establish such a network connection, add the required IP address or CIDR block to the IP address whitelist of the data source.

Step 1: Associate the exclusive resource group for Data Integration with a VPC and add a custom route for the resource group

Note If you establish a network connection between the exclusive resource group for Data Integration and a data source over the Internet, you can skip this step.

Associate the exclusive resource group for Data Integration with a VPC.
1. Go to the Resource Groups page in the DataWorks console, find the exclusive resource group for Data Integration that you want to use, and then click Network Settings in the Actions column.
2. On the VPC Binding tab of the page that appears, click Add Binding. In the Add VPC Binding panel, configure the parameters for associating the resource group with a VPC.
  - VPC: Select the VPC in which the ApsaraDB for ClickHouse or Hologres data source resides.
  - Zone and VSwitch: We recommend that you preferentially select the zone and vSwitch in which the ApsaraDB for ClickHouse or Hologres data source resides. If the zone in which the data source resides is not displayed in the drop-down list, you can randomly select a zone and a vSwitch. However, you must make sure that the selected vSwitch provides a connection to the data source.
  - Security Groups: Select a security group based on your business requirements. The selected security group must meet the following requirements:
    - The HTTP port of the ApsaraDB for ClickHouse cluster is allowed in the inbound rule of the security group. In most cases, the HTTP port of an ApsaraDB for ClickHouse cluster is 8123. You can check whether the port number is allowed on the Security Groups page in the ECS console.
    - The CIDR block of the vSwitch with which the exclusive resource group for Data Integration is associated is included in the CIDR blocks that are specified as the authorization object in the security group rule.
Add a custom route for the exclusive resource group for Data Integration.
Note If you select the zone and vSwitch in which the data source resides in the preceding substep, you can skip this substep. If you select another zone and another vSwitch, you must perform operations in this substep to add a custom route for the exclusive resource group for Data Integration.
1. Go to the Resource Groups page in the DataWorks console, find the exclusive resource group for Data Integration that you want to use, and then click Network Settings in the Actions column.
2. On the VPC Binding tab of the page that appears, find the VPC association record and click Custom Route in the Actions column.
3. In the Custom Route panel, click Add Route. In the Add Route dialog box, configure the parameters to add a custom route for the exclusive resource group for Data Integration.
  - Destination VPC: Select the region and VPC in which the data source resides.
  - Destination VSwitch: Select the vSwitch in which the data source resides.

Step 2: Configure the IP address whitelist for the data source

Obtain required IP addresses or CIDR blocks.
- If you use a VPC to establish a network connection between the exclusive resource group for Data Integration and the data source, you must enter the CIDR block of the vSwitch that is specified when you associate the resource group with a VPC. You can find the resource group on the Exclusive Resource Groups tab of the Resource Groups page in the DataWorks console and click Network Settings in the Actions column to view the CIDR block of the vSwitch.
- If you establish a network connection between the exclusive resource group for Data Integration and the data source over the Internet, you must enter the elastic IP address (EIP) of the resource group in the IP address whitelist. You can find the resource group on the Exclusive Resource Groups tab of the Resource Groups page in the DataWorks console and click View Information in the Actions column to view the EIP of the resource group.
Add the IP address or CIDR block to the IP address whitelist configured for the data source.
1. Add the IP address or CIDR block to the IP address whitelist configured for the ApsaraDB for ClickHouse data source.
  Log on to the ApsaraDB for ClickHouse console. On the Default Instances tab of the Clusters page, find the ApsaraDB for ClickHouse cluster from which you want to read data and click the name of the cluster. In the left-side navigation pane of the page that appears, click Data Security. On the Data Security page, click Create Whitelist Group. In the Create Whitelist Group panel, enter the IP address or CIDR block that you want to add in the IP Addresses field and click OK.
2. Add the IP address or CIDR block to the IP address whitelist configured for the Hologres data source.
  Log on to the Hologres console. In the left-side navigation pane, click Go to HoloWeb to log on to the HoloWeb console. Find the Hologres data source to which you want to write data and go to the Security Center page. In the left-side navigation pane, click IP Address Whitelist. On the IP Address Whitelist page, click Add IP Address to Whitelist. In the Add IP Address to Whitelist dialog box, enter the IP address or CIDR block that you want to add in the IP Address field and click OK.

Prepare data sources

Add an ApsaraDB for ClickHouse data source

In the upper-right corner of the Data Source page in the DataWorks console, click Add data source. In the Add data source dialog box, add an ApsaraDB for ClickHouse data source as prompted. You must configure the following parameters.

Note You can log on to the ApsaraDB for ClickHouse console, find the ApsaraDB for ClickHouse cluster from which you want to read data and then click the name of the cluster to view the following information about the cluster on the Cluster Information page: internal and public endpoints, HTTP port number, vSwitch ID, and zone.

JDBC URL: Configure this parameter in the jdbc:clickhouse://<ip>:<port>/<dbname> format. Before you configure this parameter, take note of the following items:
- <ip>: You must replace this item with the public or internal endpoint of the ApsaraDB for ClickHouse cluster.
  - If you use a VPC to establish a network connection between the exclusive resource group for Data Integration and the ApsaraDB for ClickHouse cluster, you must replace <ip> with the internal endpoint of the cluster.
  - If you establish a network connection between the exclusive resource group for Data Integration and the ApsaraDB for ClickHouse cluster over the Internet, you must replace <ip> with the public endpoint of the cluster.
- <port>: You must replace this item with the HTTP port of the ApsaraDB for ClickHouse cluster. In most cases, the HTTP port number is 8123.
- <dbname>: You must replace this item with the name of the ApsaraDB for ClickHouse database from which you want to read data.
Username and Password: Specify the username and password of the ApsaraDB for ClickHouse database.
Test connectivity: Select the exclusive resource group for Data Integration that you want to use and test the network connectivity between the resource group and the ApsaraDB for ClickHouse data source. Make sure that the network connectivity test is successful.
Note The preceding configurations establish a network connection only between the exclusive resource group for Data Integration and the ApsaraDB for ClickHouse data source. If you want to establish a network connection between a resource group for DataService Studio or resource group for scheduling and the ApsaraDB for ClickHouse data source, you must configure the required network settings and test the network connectivity.

Add a Hologres data source

You can associate a Hologres compute engine with the workspace that you want to use to enable the system to generate a Hologres data source. You can also directly add a Hologres data source to the workspace that you want to use. For more information, see Associate a Hologres compute engine with a workspace or Add a Hologres data source.

Create and configure a synchronization solution

Select a data synchronization solution type.
On the Tasks page in Data Integration, click Create Node. On the Create Data Synchronization Solution page, select ClickHouse as the source and Hologres as the destination for the Data Source field in the Synchronization Type section. The system displays the Hologres Offline synchronization solution type. By default, the Hologres Offline synchronization solution type is selected. You cannot change the type.
Test the network connectivity between the exclusive resource group for Data Integration and the data sources.
In the Network and Resource Configuration section, select the ApsaraDB for ClickHouse data source and the Hologres data source that are added to DataWorks, select the exclusive resource group for Data Integration that is purchased, and then click Test Connectivity for All Resource Groups and Data Sources to test the network connectivity between the two data sources and the resource group. If the system prompts that the network connections between the data sources and resource group are established, click Next.
Select the tables from which you want to read data.
In the Select Data Sources and Tables for Data Synchronization section of the page that appears, select the tables from which you want to read data in the Source Table area on the left side and click the icon to add the selected tables to the Selected Tables area on the right side.

In the Mapping Rules for Destination Tables section, select all items in this section and click Batch Refresh Mapping Results.

You can also select specific items and click Batch Modify to modify the items based on your business requirements. The following table describes the options under Batch Modify.


Option	Description
Value assignment	You can add constants and variables to destination tables.
Customize Mapping Rules for Destination Schema Names	You can concatenate built-in variables and specified strings into a final destination schema name. You can edit built-in variables. For example, you can specify strings as the value of built-in variables.
Customize Mapping Rules for Destination Table Names	You can concatenate built-in variables and specified strings into a final destination table name. You can edit built-in variables. For example, you can specify strings as the value of built-in variables.
Have Primary Key	You can use the primary key information of destination tables to implement automatic mapping. If destination tables are created in a visualized manner, you can click the icon in the Destination Table Name column to edit the table schemas and select primary keys. Then, refresh the mappings. If destination tables contain primary keys, the system uses the new data to overwrite the original data. This indicates that data in all columns of specific rows is completely overwritten. The fields for which column mappings are not configured are forcefully set to NULL.

Modify mappings between field types.

If the destination Hologres tables are in the pending state, the system provides default mappings between data types of fields in ApsaraDB for ClickHouse and Hologres tables. The following table lists the default mappings. You can also click Edit Mapping of Field Data Types in the upper-right corner of the Mapping Rules for Destination Tables section to customize field type mappings. After you customize field type mappings, click Apply and Refresh Mapping.


Category	Data type of fields in ApsaraDB for ClickHouse data source	Data type of fields in Hologres data source
Date	Date	Date
	DateTime	TIMESTAMPTZ
	DateTime(timezone)	TIMESTAMPTZ
	DateTime64	TIMESTAMPTZ
Numeric	Int8	SMALLINT
	Int16	SMALLINT
	Int32	INTEGER
	Int64	BIGINT
	UInt8	INTEGER
	UInt16	INTEGER
	UInt32	BIGINT
	UInt64	BIGINT
	Float32	FLOAT
	Float64	DOUBLE PRECISION
	Decimal(P, S)	DECIMAL
	Decimal32(S)	DECIMAL
	Decimal64(S)	DECIMAL
	Decimal128(S)	DECIMAL
Boolean	None (UInt8 is used instead.)	BOOLEAN
String	String	TEXT

Configure advanced parameters.
You can click Configure Advanced Parameters in the upper-right corner of the configuration page to perform finer-grained configurations for the source and destination for data synchronization. For example, you can configure the maximum number of connections and the parameters related to throttling.
Configure a resource group.
You can click Configure Resource Group in the upper-right corner of the configuration page and modify the exclusive resource group for Data Integration that you want to use to run the data synchronization solution.
After the preceding configuration is complete, click Complete.

Run the data synchronization solution

Go to the Tasks page in Data Integration and find the created data synchronization solution.
Click Submit and Run in the Actions column to run the data synchronization solution.
Click Execution details in the Actions column to view the execution details of the data synchronization solution.

Perform O&M operations for the data synchronization solution

View the status of the data synchronization solution

After the data synchronization solution is created, you can go to the Tasks page to view all data synchronization solutions that are created in the workspace and the basic information of each data synchronization solution.

You can find the desired data synchronization solution and start, stop, modify, or view the details of the data synchronization solution.
You can find the desired data synchronization solution and click Execution details in the Operation column to view the running details of the solution. You can also click different sections on the Execution details page to view the related information.

Rerun the data synchronization solution

If you add or remove tables or modify the destination table schema or table names in some specific cases, you can click Rerun in the Actions column that corresponds to the desired data synchronization solution. The system reruns the data synchronization solution to synchronize changes.