This solution combines HBase Snapshot and HBase Replication technologies to migrate both historical and incremental data online while the source HBase cluster remains in service, ensuring no data loss during the migration process.
Solution introduction
The Snapshot-based approach only supports offline migration from HBase, and cannot keep the source cluster online. Since the source cluster continues to run during migration, the following errors may occur:
Incremental data loss: Write or update operations generated in the source cluster may not be synchronized to the target cluster during Snapshot export and restoration.
Data consistency problem: Relying on only Snapshot cannot handle real-time incremental data during migration.
The following solutions can solve these problems:
Establish replication relationship (Peer):
Use HBase Replication to establish table-level replication relationships between source and target clusters. Disable automatic synchronization during the replication.
The source cluster records real-time write and update operations for tables but does not immediately synchronize them to the target cluster.
Migrate historical data:
Use HBase Snapshot to export historical data from the source cluster.
If the storage systems (such as HDFS or OSS-HDFS) of the source and target clusters are interconnected, export the Snapshot to the target cluster. Otherwise, export it to an intermediate path in the source cluster first, then synchronize it to the target cluster.
Synchronize incremental data: After historical data restoration is complete, enable the automatic synchronization feature of the Replication Peer. It will replay incremental data generated during migration, ensuring data consistency between source and target.
Through these steps, this solution completely migrates both historical and incremental data while ensuring continuous operation of the source cluster, avoiding data loss.
Precautions
The compatibility of HBase version :
HBase 1.x version: Set
hbase.replication=trueand restart the primary and backup clusters to enable the Replication feature.HBase 2.x version: The
hbase.replicationparameter has been removed by default. No additional settings are required.
Replication configuration: HBase configures Replication on a column family basis. If you need to migrate data for specific column families, ensure they have the Replication property correctly configured.
Network connectivity:
Ensure network connectivity between the source HBase cluster and the target HBase cluster through CEN, leased line, or VPN.
Establish network connectivity between the source and target clusters in advance, ensuring the following ports are open to the source cluster:
Zookeeper service port: Port
2181of the ECS instance in the target cluster of Zookeeper.HBase service ports: Ports
16010,16020, and16030of the ECS instance where the target cluster of EMR HBase.
Procedure
Step 1: Create a Peer (replication relationship)
Establish a table-level replication relationship between the source and target clusters, and temporarily disable automatic synchronization until the historical Snapshot data synchronization is complete.
Log on to the master node of the source cluster. For more information, see Log on to a cluster.
Run the following command to enter HBase Shell.
hbase shellAdd a Peer (replication relationship).
Run the following command in HBase Shell to add a Peer to the target cluster. Specify the tables to be migrated.
add_peer '${peer_name}', CLUSTER_KEY => "${slave_zk}:${port}:/hbase", TABLE_CFS => { "${table_name}" => [] }Parameters:
${peer_name}: The name of the replication relationship. You can customize it. In this example, it ispeer1.${slave_zk}: The address of the target cluster ZooKeeper. This is typically multiple ZooKeeper nodes' internal IP addresses or hostnames. Its format is{slave_zk1}:2181,{slave_zk2}:2181,{slave_zk3}:2181.${port}: The port of the target cluster ZooKeeper. The default port is 2181.${table_name}: The name of the table to be migrated. In this example, it ist1.
Enable table-level replication.
Enable table-level replication to ensure that the specified tables can synchronize written data to the target cluster.
enable_table_replication 't1'Temporarily disable automatic synchronization.
This command is used to pause the data replication process for the specified peer. After disabled, the source cluster will no longer send new data updates to the target cluster. Existing data will not be deleted or affected.
disable_peer 'peer1'
Step 2: Create a Snapshot
Run the following command in the HBase Shell of the source cluster to create a Snapshot. It will export historical data for the table to be migrated.
snapshot '${table_name}', '${snapshot_name}'Parameters:
${table_name}: The name of the table to be migrated. In this example, it ist1.${snapshot_name}: The custom Snapshot name. In this example, it ist1-snapshot.
Example:
snapshot 't1', 't1-snapshot'
Step 3: Export the Snapshot to the target cluster
Scenario 1: The storage systems of the source and target clusters are interconnected
If the storage systems of the source and target clusters are interconnected, run the following command in the source cluster. It will export the Snapshot directly to the target cluster.
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot ${snapshot_name} -copy-to ${target_rootdir_path}Parameters:
${snapshot_name}: The custom Snapshot name. In this example, it ist1-snapshot.${target_rootdir_path}: The root directory path of the target cluster HBase. Replace it according to your actual situation.OSS-HDFS: You can check the HBase service in the target cluster through the console, view the hbase-site.xml file's hbase.rootdir configuration item to get the detailed path information.

HDFS: You can check the Hadoop-Common service in the target cluster through the console, view the core-site.xml's fs.defaultFS configuration item to get the detailed path information.
Example:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot t1-snapshot -copy-to oss://xxx.cn-hangzhou.oss-dls.aliyuncs.com/hbase/c-9d34bc8fxxx
Scenario 2: The storage systems of the source and target clusters are not interconnected
If the source cluster cannot directly access the storage path of the target cluster, you need to first export the Snapshot to an intermediate path in the source cluster (such as HDFS or OSS). Then synchronize it to the target cluster. This example demonstrates migrating data from HDFS to OSS-HDFS.
Export the Snapshot to an intermediate path.
Run the following command in the source HBase cluster to export the Snapshot to an intermediate path in the source cluster.
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot ${snapshot_name} -copy-to ${src_temp_path}/${table_name}Parameters:
${snapshot_name}: The custom Snapshot name. In this example, it ist1-snapshot.${src_temp_path}: The intermediate path in the source cluster. For example, if the source cluster uses HDFS, you can choose an HDFS path as the intermediate path.${table_name}: The name of the table to be migrated. In this example, it ist1.
Example:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot t1-snapshot -copy-to hdfs:///tmp/hbase-snapshot/t1
Migrate data to the target path.
Run the following command to use the JindoDistCp tool to migrate data from the intermediate path in the source cluster to the path in the target cluster. For more information about JindoDistCp, see JindoDistCp user guide.
Log on to the E-MapReduce (EMR) console and specify the AccessKey pair that is used to access OSS or OSS-HDFS.
Add the following configuration items in the Hadoop-Common service's core-site.xml tab to avoid having to fill them in multiple times when using. For specific operations on adding configuration items, see Manage configuration items.
Parameter
Description
fs.oss.accessKeyId
The AccessKey ID of OSS/OSS-HDFS.
fs.oss.accessKeySecret
The AccessKey Secret of OSS/OSS-HDFS.
In the source HBase cluster, navigate to the directory where
jindo-distcp-tool-*.jaris located.cd /opt/apps/JINDOSDK/jindosdk-current/toolsNoteEMR cluster: EMR-5.6.0 and later versions, EMR-3.40.0 and later versions have JindoDistCp deployed, which can be found in the
/opt/apps/JINDOSDK/jindosdk-current/toolsdirectory asjindo-distcp-tool-*.jar.Non-EMR cluster: You can download JindoSDK (which includes the JindoDistCp tool) yourself. For details, see Download, install, and upgrade JindoSDK.
Run the following command to migrate the Snapshot to the target HBase cluster.
hadoop jar jindo-distcp-tool-*.jar --src ${src_temp_path}/${table_name} --dest ${target_temp_path}/${table_name} --disableChecksum --parallelism 10Parameters:
${src_temp_path}: The intermediate path in the source cluster.${target_temp_path}: The intermediate path in the target cluster.${target_bucket}: The target bucket name.
Example:
hadoop jar jindo-distcp-tool-4.6.11.jar --src hdfs:///tmp/hbase-snapshot/t1 --dest oss://hbase-test.cn-hangzhou.oss-dls.aliyuncs.com/hbase/recv/t1 --disableChecksum --parallelism 10
Import the Snapshot into the target HBase cluster.
Run the following command in the target HBase cluster to import the Snapshot from the target path to the HBase root directory.
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot ${snapshot_name} -copy-from ${target_temp_path} -copy-to ${target_rootdir_path}Parameters:
${target_temp_path}: The temporary path in the target cluster.${target_rootdir_path}: The root directory path of the target HBase cluster.You can check the HBase service in the target cluster through the console, view the hbase-site.xml file's hbase.rootdir configuration item to get the detailed path information.
Example:
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot t1-snapshot -copy-from oss://hbase-test.cn-hangzhou.oss-dls.aliyuncs.com/hbase/recv/t1 -copy-to oss://hbase-target.cn-hangzhou.oss-dls.aliyuncs.com/hbase/c-5418ce2a4xxx
Check the migration result.
You can check if the data in the target path is complete using the following command after the migration.
hdfs dfs -ls ${target_rootdir_path}
Step 4: Restore historical data using Snapshot
Log on to the master node of the target cluster. For more information, see Log on to a cluster.
Run the following command to enter HBase Shell.
hbase shellRun the following commands to restore the Snapshot and enable the table in the target cluster.
restore_snapshot '${snapshot_name}' enable '${table_name}'Parameters:
${snapshot_name}: The custom Snapshot name.${table_name}: The name of the table to be migrated.
Examples:
restore_snapshot 't1-snapshot' enable 't1'
Step 5: Enable incremental data synchronization
Run the following command in the HBase Shell of the source cluster to enable Peer synchronization.
enable_peer '${peer_name}'Example:
enable_peer 'peer1'Step 6: Verify data
Verify whether the data is complete after migration.
Small data volume: Use Scan for verification.
scan '${table_name}'Example:
scan 't1'Medium data volume: Use count for verification.
count '${table_name}'Big data volume: Use get for sample verification.
get '${table_name}', '${rowkey}'The
${rowkey}in the code is the unique identifier for each row in the HBase table.
Step 7: Delete the Snapshot
Run the following command to delete all snapshots related to the target table after the verification to free up storage space.
delete_table_snapshots '${table_name}'Example:
delete_table_snapshots 't1'Step 8: Clean up the Peer
After incremental data synchronization is complete, you need to perform dual-write or cutover operations for the application to ensure that all read and write requests are switched to the target HBase cluster. To avoid duplicate data synchronization, you need to delete the replication relationship (Peer) between the source and target clusters after the switch.
Migrate the client.
Switch the upstream and downstream applications involving the HBase cluster to the target HBase cluster, including applications that read and write HBase through API or command line. To do this, perform the following operations:
Update connection configuration: Modify the application's configuration files or code to switch the HBase connection information (such as Zookeeper address, port, etc.) from the source cluster to the target cluster.
Verify functionality: Ensure that the application can normally read and write to the target cluster. Run necessary functional testing and data validation.
Dual-run or cutover: Based on business requirements, choose to dual-run (read and write to both source and target clusters simultaneously) or directly cut over to the target cluster.
Disable the automatic synchronization feature of the Peer.
Disable the automatic synchronization feature of the specified Peer in the HBase Shell of the source cluster. Ensure that data synchronization stops immediately.
disable_peer '${peer_name}'Example:
disable_peer 'peer1'Delete the Peer.
After disabling the Peer, delete the Peer to disconnect the replication relationship completely between the source and target clusters.
remove_peer '${peer_name}'Example:
remove_peer 'peer1'Verify whether the Peer has been successfully deleted.
Run the following command to list all current Peers to ensure that
peer1in this example has been deleted.list_peers