Use snapshots to back up and restore an HBase cluster via OSS - E-MapReduce

Use HBase snapshots to back up your E-MapReduce (EMR) HBase cluster data and restore it to another cluster via Object Storage Service (OSS).

How it works

HBase snapshots capture a point-in-time view of a table without copying data, so the operation completes almost instantly and has minimal impact on cluster performance. The snapshot references the underlying HFiles, and as long as the snapshot exists, those files are preserved even if the original data is later deleted.

To move a snapshot between clusters, you export it to an OSS bucket as an intermediate store. The destination cluster then imports the snapshot from OSS and restores the data.

Prerequisites

Before you begin, ensure that you have:

Two Hadoop clusters with the HBase and ZooKeeper services installed. For setup instructions, see Create a cluster.
SSH access to the master node of each cluster. For instructions, see Connect to the master node of an EMR cluster in SSH mode.
An OSS bucket accessible from both clusters via the internal endpoint.

Back up and restore an HBase cluster

Step 1: Prepare test data

Log on to the master node of the source cluster using SSH.
Open HBase Shell.
```
hbase shell
```
Create a table.
```
create 'test','cf'
```

Add data to the table.

put 'test','a','cf:c1',1
put 'test','a','cf:c2',2
put 'test','b','cf:c1',3
put 'test','b','cf:c2',4
put 'test','c','cf:c1',5
put 'test','c','cf:c2',6

Exit HBase Shell.
```
exit
```

Step 2: Create a snapshot

Create a snapshot of the table.

hbase snapshot create -n test_snapshot -t test

Open HBase Shell to verify the snapshot was created.
```
hbase shell
```

List snapshots.

list_snapshots

The output is similar to:

SNAPSHOT                                           TABLE + CREATION TIME
 test_snapshot                                     test (Tue Aug 18 14:35:28 +0800 2020)
1 row(s) in 0.2450 seconds

=> ["test_snapshot"]

Exit HBase Shell.
```
exit
```

Step 3: Export the snapshot to OSS

Export the snapshot to your OSS bucket using the internal endpoint.

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot test_snapshot -copy-to oss://$accessKeyId:$accessKeySecret@$bucket.oss-cn-hangzhou-internal.aliyuncs.com/hbase/snapshot/test

Note Always use the internal endpoint to access OSS.

Step 4: Import the snapshot to the destination cluster

Log on to the master node of the destination cluster using SSH.

Import the snapshot from OSS to the local HDFS.

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot test_snapshot -copy-from oss://$accessKeyId:$accessKeySecret@$bucket.oss-cn-hangzhou-internal.aliyuncs.com/hbase/snapshot/test -copy-to /hbase/

Step 5: Restore data from the snapshot

Open HBase Shell on the destination cluster.
```
hbase shell
```
Restore the table from the snapshot.
```
restore_snapshot 'test_snapshot'
```

Verify the restored data.

scan 'test'

The output is similar to:

ROW                     COLUMN+CELL
 a                      column=cf:c1, timestamp=1472992081375, value=1
 a                      column=cf:c2, timestamp=1472992090434, value=2
 b                      column=cf:c1, timestamp=1472992104339, value=3
 b                      column=cf:c2, timestamp=1472992099611, value=4
 c                      column=cf:c1, timestamp=1472992112657, value=5
 c                      column=cf:c2, timestamp=1472992118964, value=6
3 row(s) in 0.0540 seconds

Step 6: Clone a new table from the snapshot

Use clone_snapshot to create an independent copy of the table without a full data copy.

Clone the snapshot into a new table.
```
clone_snapshot 'test_snapshot','test_2'
```

Verify the data in the new table.

scan 'test_2'

The output is similar to:

ROW                     COLUMN+CELL
 a                      column=cf:c1, timestamp=1472992081375, value=1
 a                      column=cf:c2, timestamp=1472992090434, value=2
 b                      column=cf:c1, timestamp=1472992104339, value=3
 b                      column=cf:c2, timestamp=1472992099611, value=4
 c                      column=cf:c1, timestamp=1472992112657, value=5
 c                      column=cf:c2, timestamp=1472992118964, value=6
3 row(s) in 0.0540 seconds