All Products
Search
Document Center

E-MapReduce:Migrate data from an EMR Hbase cluster online

Last Updated:Jun 09, 2025

This solution combines HBase Snapshot and HBase Replication technologies to migrate both historical and incremental data online while the source HBase cluster remains in service, ensuring no data loss during the migration process.

Solution introduction

The Snapshot-based approach only supports offline migration from HBase, and cannot keep the source cluster online. Since the source cluster continues to run during migration, the following errors may occur:

  • Incremental data loss: Write or update operations generated in the source cluster may not be synchronized to the target cluster during Snapshot export and restoration.

  • Data consistency problem: Relying on only Snapshot cannot handle real-time incremental data during migration.

The following solutions can solve these problems:

  1. Establish replication relationship (Peer):

    • Use HBase Replication to establish table-level replication relationships between source and target clusters. Disable automatic synchronization during the replication.

    • The source cluster records real-time write and update operations for tables but does not immediately synchronize them to the target cluster.

  2. Migrate historical data:

    • Use HBase Snapshot to export historical data from the source cluster.

    • If the storage systems (such as HDFS or OSS-HDFS) of the source and target clusters are interconnected, export the Snapshot to the target cluster. Otherwise, export it to an intermediate path in the source cluster first, then synchronize it to the target cluster.

  3. Synchronize incremental data: After historical data restoration is complete, enable the automatic synchronization feature of the Replication Peer. It will replay incremental data generated during migration, ensuring data consistency between source and target.

Through these steps, this solution completely migrates both historical and incremental data while ensuring continuous operation of the source cluster, avoiding data loss.

Precautions

  • The compatibility of HBase version :

    • HBase 1.x version: Set hbase.replication=true and restart the primary and backup clusters to enable the Replication feature.

    • HBase 2.x version: The hbase.replication parameter has been removed by default. No additional settings are required.

  • Replication configuration: HBase configures Replication on a column family basis. If you need to migrate data for specific column families, ensure they have the Replication property correctly configured.

  • Network connectivity:

    • Ensure network connectivity between the source HBase cluster and the target HBase cluster through CEN, leased line, or VPN.

    • Establish network connectivity between the source and target clusters in advance, ensuring the following ports are open to the source cluster:

      • Zookeeper service port: Port 2181 of the ECS instance in the target cluster of Zookeeper.

      • HBase service ports: Ports 16010, 16020, and 16030 of the ECS instance where the target cluster of EMR HBase.

Procedure

Step 1: Create a Peer (replication relationship)

Establish a table-level replication relationship between the source and target clusters, and temporarily disable automatic synchronization until the historical Snapshot data synchronization is complete.

  1. Log on to the master node of the source cluster. For more information, see Log on to a cluster.

  2. Run the following command to enter HBase Shell.

    hbase shell
  3. Add a Peer (replication relationship).

    Run the following command in HBase Shell to add a Peer to the target cluster. Specify the tables to be migrated.

    add_peer '${peer_name}', CLUSTER_KEY => "${slave_zk}:${port}:/hbase", TABLE_CFS => { "${table_name}" => [] }

    Parameters:

    • ${peer_name}: The name of the replication relationship. You can customize it. In this example, it is peer1.

    • ${slave_zk}: The address of the target cluster ZooKeeper. This is typically multiple ZooKeeper nodes' internal IP addresses or hostnames. Its format is {slave_zk1}:2181,{slave_zk2}:2181,{slave_zk3}:2181.

    • ${port}: The port of the target cluster ZooKeeper. The default port is 2181.

    • ${table_name}: The name of the table to be migrated. In this example, it is t1.

  4. Enable table-level replication.

    Enable table-level replication to ensure that the specified tables can synchronize written data to the target cluster.

    enable_table_replication 't1'
  5. Temporarily disable automatic synchronization.

    This command is used to pause the data replication process for the specified peer. After disabled, the source cluster will no longer send new data updates to the target cluster. Existing data will not be deleted or affected.

    disable_peer 'peer1'

Step 2: Create a Snapshot

Run the following command in the HBase Shell of the source cluster to create a Snapshot. It will export historical data for the table to be migrated.

snapshot '${table_name}', '${snapshot_name}'
  • Parameters:

    • ${table_name}: The name of the table to be migrated. In this example, it is t1.

    • ${snapshot_name}: The custom Snapshot name. In this example, it is t1-snapshot.

  • Example:

    snapshot 't1', 't1-snapshot'

Step 3: Export the Snapshot to the target cluster

Scenario 1: The storage systems of the source and target clusters are interconnected

If the storage systems of the source and target clusters are interconnected, run the following command in the source cluster. It will export the Snapshot directly to the target cluster.

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot ${snapshot_name} -copy-to ${target_rootdir_path}
  • Parameters:

    • ${snapshot_name}: The custom Snapshot name. In this example, it is t1-snapshot.

    • ${target_rootdir_path}: The root directory path of the target cluster HBase. Replace it according to your actual situation.

      • OSS-HDFS: You can check the HBase service in the target cluster through the console, view the hbase-site.xml file's hbase.rootdir configuration item to get the detailed path information.

        image

      • HDFS: You can check the Hadoop-Common service in the target cluster through the console, view the core-site.xml's fs.defaultFS configuration item to get the detailed path information.

  • Example:

    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot t1-snapshot -copy-to oss://xxx.cn-hangzhou.oss-dls.aliyuncs.com/hbase/c-9d34bc8fxxx

Scenario 2: The storage systems of the source and target clusters are not interconnected

If the source cluster cannot directly access the storage path of the target cluster, you need to first export the Snapshot to an intermediate path in the source cluster (such as HDFS or OSS). Then synchronize it to the target cluster. This example demonstrates migrating data from HDFS to OSS-HDFS.

  1. Export the Snapshot to an intermediate path.

    Run the following command in the source HBase cluster to export the Snapshot to an intermediate path in the source cluster.

    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot ${snapshot_name} -copy-to ${src_temp_path}/${table_name}
    • Parameters:

      • ${snapshot_name}: The custom Snapshot name. In this example, it is t1-snapshot.

      • ${src_temp_path}: The intermediate path in the source cluster. For example, if the source cluster uses HDFS, you can choose an HDFS path as the intermediate path.

      • ${table_name}: The name of the table to be migrated. In this example, it is t1.

    • Example:

      hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot t1-snapshot -copy-to hdfs:///tmp/hbase-snapshot/t1
  2. Migrate data to the target path.

    Run the following command to use the JindoDistCp tool to migrate data from the intermediate path in the source cluster to the path in the target cluster. For more information about JindoDistCp, see JindoDistCp user guide.

    1. Log on to the E-MapReduce (EMR) console and specify the AccessKey pair that is used to access OSS or OSS-HDFS.

      Add the following configuration items in the Hadoop-Common service's core-site.xml tab to avoid having to fill them in multiple times when using. For specific operations on adding configuration items, see Manage configuration items.

      Parameter

      Description

      fs.oss.accessKeyId

      The AccessKey ID of OSS/OSS-HDFS.

      fs.oss.accessKeySecret

      The AccessKey Secret of OSS/OSS-HDFS.

    2. In the source HBase cluster, navigate to the directory where jindo-distcp-tool-*.jar is located.

      cd /opt/apps/JINDOSDK/jindosdk-current/tools
      Note
      • EMR cluster: EMR-5.6.0 and later versions, EMR-3.40.0 and later versions have JindoDistCp deployed, which can be found in the /opt/apps/JINDOSDK/jindosdk-current/tools directory as jindo-distcp-tool-*.jar.

      • Non-EMR cluster: You can download JindoSDK (which includes the JindoDistCp tool) yourself. For details, see Download, install, and upgrade JindoSDK.

    3. Run the following command to migrate the Snapshot to the target HBase cluster.

      hadoop jar jindo-distcp-tool-*.jar --src ${src_temp_path}/${table_name} --dest ${target_temp_path}/${table_name} --disableChecksum --parallelism 10 
      • Parameters:

        • ${src_temp_path}: The intermediate path in the source cluster.

        • ${target_temp_path}: The intermediate path in the target cluster.

        • ${target_bucket}: The target bucket name.

      • Example:

        hadoop jar jindo-distcp-tool-4.6.11.jar --src hdfs:///tmp/hbase-snapshot/t1 --dest oss://hbase-test.cn-hangzhou.oss-dls.aliyuncs.com/hbase/recv/t1 --disableChecksum --parallelism 10
    4. Import the Snapshot into the target HBase cluster.

      Run the following command in the target HBase cluster to import the Snapshot from the target path to the HBase root directory.

      hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot ${snapshot_name} -copy-from ${target_temp_path} -copy-to ${target_rootdir_path}
      • Parameters:

        • ${target_temp_path}: The temporary path in the target cluster.

        • ${target_rootdir_path}: The root directory path of the target HBase cluster.

          You can check the HBase service in the target cluster through the console, view the hbase-site.xml file's hbase.rootdir configuration item to get the detailed path information.

      • Example:

        hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot t1-snapshot -copy-from oss://hbase-test.cn-hangzhou.oss-dls.aliyuncs.com/hbase/recv/t1 -copy-to oss://hbase-target.cn-hangzhou.oss-dls.aliyuncs.com/hbase/c-5418ce2a4xxx
  3. Check the migration result.

    You can check if the data in the target path is complete using the following command after the migration.

    hdfs dfs -ls ${target_rootdir_path}

Step 4: Restore historical data using Snapshot

  1. Log on to the master node of the target cluster. For more information, see Log on to a cluster.

  2. Run the following command to enter HBase Shell.

    hbase shell
  3. Run the following commands to restore the Snapshot and enable the table in the target cluster.

    restore_snapshot '${snapshot_name}'
    enable '${table_name}'
    • Parameters:

      • ${snapshot_name}: The custom Snapshot name.

      • ${table_name}: The name of the table to be migrated.

    • Examples:

      restore_snapshot 't1-snapshot'
      enable 't1'

Step 5: Enable incremental data synchronization

Run the following command in the HBase Shell of the source cluster to enable Peer synchronization.

enable_peer '${peer_name}'

Example:

enable_peer 'peer1'

Step 6: Verify data

Verify whether the data is complete after migration.

  • Small data volume: Use Scan for verification.

    scan '${table_name}'

    Example:

    scan 't1'
  • Medium data volume: Use count for verification.

    count '${table_name}'
  • Big data volume: Use get for sample verification.

    get '${table_name}', '${rowkey}'

    The ${rowkey} in the code is the unique identifier for each row in the HBase table.

Step 7: Delete the Snapshot

Run the following command to delete all snapshots related to the target table after the verification to free up storage space.

delete_table_snapshots '${table_name}'

Example:

delete_table_snapshots 't1'

Step 8: Clean up the Peer

After incremental data synchronization is complete, you need to perform dual-write or cutover operations for the application to ensure that all read and write requests are switched to the target HBase cluster. To avoid duplicate data synchronization, you need to delete the replication relationship (Peer) between the source and target clusters after the switch.

  1. Migrate the client.

    Switch the upstream and downstream applications involving the HBase cluster to the target HBase cluster, including applications that read and write HBase through API or command line. To do this, perform the following operations:

    • Update connection configuration: Modify the application's configuration files or code to switch the HBase connection information (such as Zookeeper address, port, etc.) from the source cluster to the target cluster.

    • Verify functionality: Ensure that the application can normally read and write to the target cluster. Run necessary functional testing and data validation.

    • Dual-run or cutover: Based on business requirements, choose to dual-run (read and write to both source and target clusters simultaneously) or directly cut over to the target cluster.

  2. Disable the automatic synchronization feature of the Peer.

    Disable the automatic synchronization feature of the specified Peer in the HBase Shell of the source cluster. Ensure that data synchronization stops immediately.

    disable_peer '${peer_name}'

    Example:

    disable_peer 'peer1'
  3. Delete the Peer.

    After disabling the Peer, delete the Peer to disconnect the replication relationship completely between the source and target clusters.

    remove_peer '${peer_name}'

    Example:

    remove_peer 'peer1'
  4. Verify whether the Peer has been successfully deleted.

    Run the following command to list all current Peers to ensure that peer1 in this example has been deleted.

    list_peers

FAQ

Why does ExportSnapshot report an error?

When using the HBase ExportSnapshot tool to migrate snapshots, setting the -copy-from and -copy-to paths correctly. If the paths are not configured correctly, the transfer may be interrupted or the task may fail due to path parsing failure.

Possible causes

  • Path format errors:

    • Missing protocol header (such as hdfs:// or oss://).

    • Path does not include complete address information (such as NameNode address or Bucket name).

  • Storage service differences: Different storage services (such as HDFS, OSS-HDFS, OSS) have different requirements for path formats.

  • Permission issues: Insufficient access permissions for source or target paths, resulting in inability to read or write data.

Common scenarios and solutions

Scenario

Description

Path format

OSS-HDFS → OSS-HDFS

Migrate snapshots between Alibaba Cloud OSS-HDFS services.

  • Source path: oss://<bucket>.<region>.oss-dls.aliyuncs.com/<path>

  • Target path: Same format as source path.

Note

Ensure the path includes complete protocol header and Bucket name.

HDFS → HDFS

Migrate snapshots between HDFS.

  • Source path: hdfs://<namenode-host>:<port>/<path>

  • Target path: Same format as source path.

HDFS → OSS-HDFS

Migrate snapshots from HDFS to Alibaba Cloud OSS-HDFS.

  • Source path: hdfs://<namenode-host>:<port>/<path>

  • Target path: oss://<bucket>.<region>.oss-dls.aliyuncs.com/<path>

OSS → OSS

Migrate snapshots between Alibaba Cloud OSS (standard OSS).

  • Source path: oss://<bucket>.oss-<region>.aliyuncs.com/<path>

  • Target path: Same format as source path.