Migrate data from a self-managed ClickHouse cluster to an ApsaraDB for ClickHouse cluster - ApsaraDB for ClickHouse

This topic provides you with a practical guide on how to migrate data from a self-managed ClickHouse cluster to an ApsaraDB for ClickHouse cluster to improve the scalability and reliability of your business. It also offers guidance on how to handle potential issues that you may encounter during the migration process. The migration policy in this topic helps you complete data migration efficiently and securely and maximize the benefits of your cloud environment.

Note

For security purposes, you cannot connect to an ApsaraDB for ClickHouse cluster over the Internet. The backend servers of an ApsaraDB for ClickHouse cluster can connect only to the servers in the same virtual private cloud (VPC). Before you migrate data, verify that the self-managed ClickHouse cluster and the destination ApsaraDB for ClickHouse cluster can communicate with each other.

Overview

The data migration policy involves two main steps: First, migrate metadata to ensure that all table schemas are correctly migrated to the ApsaraDB for ClickHouse cluster. Second, use the remote function of ApsaraDB for ClickHouse to directly migrate data, or export the data and import it to the ApsaraDB for ClickHouse cluster to complete the migration.

Network relationship between the self-managed ClickHouse cluster and the destination ApsaraDB for ClickHouse cluster	Migration method
The self-managed ClickHouse cluster is deployed on an Alibaba Cloud Elastic Compute Service (ECS) instance. The ECS instance and the ApsaraDB for ClickHouse cluster are deployed in the same VPC.	Perform operations in Step 1: Migrate metadata. Use the remote function to migrate data.
The self-managed ClickHouse cluster is deployed on an Alibaba Cloud ECS instance. The ECS instance and the ApsaraDB for ClickHouse cluster are deployed in different VPCs.	Connect the two VPCs. For more information, see Overview of VPC connections. Perform operations in Step 1: Migrate metadata. Use the remote function to migrate data.
The self-managed ClickHouse cluster is not deployed on an Alibaba Cloud ECS instance. For example, the self-managed ClickHouse cluster is deployed in a self-managed data center.	Connect the data center to a VPC. For more information, see Connect a data center to a VPC. Perform operations in Step 1: Migrate metadata. Use the remote function to migrate data.
The self-managed ClickHouse cluster cannot connect to the destination ApsaraDB for ClickHouse cluster.	Perform operations in Step 1: Migrate metadata. Export and import a file to migrate data.
The self-managed ClickHouse cluster cannot connect to the destination ApsaraDB for ClickHouse cluster, but infrastructure such as Spark or Flink is available.	Create Spark or Flink jobs to read data from the self-managed ClickHouse cluster and then write the data to the destination ApsaraDB for ClickHouse cluster.

Step 1: Migrate metadata

To migrate metadata from the self-managed ClickHouse cluster, import the DDL statements that are executed to create tables in the self-managed ClickHouse cluster to the destination ApsaraDB for ClickHouse cluster.

If you need to install clickhouse-client, make sure that the version of clickhouse-client is consistent with the version of the destination ApsaraDB for ClickHouse cluster. For more information about the download link for clickhouse-client, see clickhouse-client.

Query the databases in the self-managed ClickHouse cluster.

clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SHOW databases"  > database.list

The following table describes the parameters in the preceding syntax.

Parameter	Description
old host	The endpoint of the self-managed ClickHouse cluster.
old port	The port that is used by the self-managed ClickHouse cluster.
old user name	The username that is used to log on to the self-managed ClickHouse cluster. The user has DML permissions, including read, write, and set permissions. The user can also be granted DDL permissions.
old password	The password of the user that is used to connect to the self-managed ClickHouse cluster.

Note

system stands for the system database. You do not need to migrate this database.

Query the tables in the self-managed ClickHouse cluster.
```
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SHOW tables from <database_name>"  > table.list
```
The following table describes the parameter in the preceding syntax.
Parameter
Description
database_name
The database name.
You can also use system tables to query the names of all databases and tables in the cluster.
```
SELECT DISTINCT database, name FROM system.tables WHERE database != 'system';
```
Note
Table names that start with .inner. are the internal representations of materialized views. You can skip these tables during migration.

Export the DDL statements that are executed to create tables in the specified database of the self-managed ClickHouse cluster.

clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SELECT concat(create_table_query, ';') FROM system.tables WHERE database='<database_name>' FORMAT TabSeparatedRaw" > tables.sql

Import the DDL statements that are executed to create tables in the self-managed ClickHouse cluster to the destination ApsaraDB for ClickHouse cluster.

Note

Before you import the DDL statements that are executed to create tables, you must create a database to which the tables will be imported in the ApsaraDB for ClickHouse cluster.

clickhouse-client --host="<new host>" --port="<new port>" --user="<new user name>" --password="<new password>"  -d '<database_name>'  --multiquery < tables.sql

The following table describes the parameters in the preceding syntax.

Parameter	Description
new host	The endpoint of the destination ApsaraDB for ClickHouse cluster.
new port	The port that is used by the destination ApsaraDB for ClickHouse cluster.
new user name	The username that is used to log on to the destination ApsaraDB for ClickHouse cluster. The user has DML permissions, including read, write, and set permissions. The user can also be granted DDL permissions.
new password	The password of the user that is used to log on to the destination ApsaraDB for ClickHouse cluster.

Step 2: Migrate data

Use the remote function to migrate data

In the destination ApsaraDB for ClickHouse cluster, execute the following SQL statement to migrate data:

INSERT INTO <new_database>.<new_table> SELECT * FROM remote('old_endpoint', <old_database>.<old_table>, '<username>', '<password>');

Note

For ApsaraDB for ClickHouse V20.8, we recommend that you use the remoteRaw function to migrate data. If data fails to be migrated, you can apply for a minor version update.

INSERT <new_database>.<new_table> SELECT * FROM remoteRaw('old_endpoint', <old_database>.<old_table>, '<username>', '<password>');

The following table describes the parameters in the preceding syntax.

Parameter	Description
new_database	The name of the database in the destination ApsaraDB for ClickHouse cluster.
new_table	The name of the table in the destination ApsaraDB for ClickHouse cluster.
old_endpoint	The endpoint of the self-managed ClickHouse cluster. Specify a VPC endpoint instead of a public endpoint.
old_database	The name of the database in the self-managed ClickHouse cluster.
old_table	The name of the table in the self-managed ClickHouse cluster.
username	The username that is used to log on to the self-managed ClickHouse cluster.
password	The password that is used to log on to the self-managed ClickHouse cluster.

Export and import a file to migrate data

You can export a file from the self-managed ClickHouse cluster and import the file to the destination ApsaraDB for ClickHouse cluster to migrate data.

Export and import a CSV file

Export data to a CSV file from the self-managed ClickHouse cluster.

clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>"  --query="select * from <database_name>.<table_name> FORMAT CSV"  > table.csv

Import the CSV file to the destination ApsaraDB for ClickHouse cluster.

clickhouse-client --host="<new host>" --port="<new port>" --user="<new user name>" --password="<new password>"  --query="insert into <database_name>.<table_name> FORMAT CSV"  < table.csv

Use a pipe in Linux to export and import streaming data

clickhouse-client --host="<old host>" --port="<old port>" --user="<user name>" --password="<password>"  --query="select * from <database_name>.<table_name> FORMAT CSV" | 
clickhouse-client --host="<new host>" --port="<new port>" --user="<user name>" --password="<password>"   --query="INSERT INTO <database_name>.<table_name> FORMAT CSV"

FAQ

Q: What do I do if the error message "Too many partitions for single INSERT block (more than 100)" appears?
A: The number of partitions in a single inserted block specified in a single INSERT operation exceeds the value of max_partitions_per_insert_block. The default value of max_partitions_per_insert_block is 100. ApsaraDB for ClickHouse generates a data part for each write operation. A partition may contain one or more data parts. If a single INSERT operation inserts an excessively large number of partitions, a large number of data parts are generated in ApsaraDB for ClickHouse, which can impose a heavy burden on merge and query operations. To prevent a large number of data parts from being generated, ApsaraDB for ClickHouse imposes certain restrictions.
Solution: Perform the following operations to adjust the number of partitions or the value of max_partitions_per_insert_block:
- Adjust the table schema and the partitioning mode or make sure that the number of partitions inserted at a time does not exceed the upper limit.
- To prevent the number of partitions inserted at a time from exceeding the upper limit, execute the following statement to change the value of max_partitions_per_insert_block based on the data volume to increase the upper limit on the number of partitions that can be inserted at a time:
```
SET GLOBAL ON cluster DEFAULT max_partitions_per_insert_block = XXX;
```
  Note
  In ClickHouse, the recommended default value of max_partitions_per_insert_block is 100. Do not set max_partitions_per_insert_block to an excessively large value. Otherwise, the performance may be affected. You can change the value of max_partitions_per_insert_block to the default value after you import data in batches.
Q: Why does a destination ApsaraDB for ClickHouse cluster fail to connect to a self-managed ClickHouse cluster?
A: The self-managed ClickHouse cluster may have been configured with a firewall or a whitelist. You can view the VPC to which the destination ApsaraDB for ClickHouse cluster belongs in the ApsaraDB for ClickHouse console and add the entire CIDR block of the VPC to the whitelist of the self-managed ClickHouse cluster. If you want to control the CIDR block range in the whitelist to avoid potential security risks, execute the following SQL statement to query the IP addresses of the backend servers of the destination ApsaraDB for ClickHouse cluster and add only these IP addresses to the whitelist of the self-managed ClickHouse cluster.
```
SELECT * FROM system.clusters;
```