Restore an abnormal JournalNode from a healthy node - E-MapReduce

If exactly one JournalNode in your cluster is abnormal, you can restore it by syncing data from a healthy JournalNode on another node. This procedure applies only when a single JournalNode is abnormal.

How it works

HDFS high availability relies on a quorum of JournalNodes to replicate edit logs between the active and standby NameNodes. Because edit logs are written to a majority of JournalNodes, a single healthy node retains a complete copy of the data needed to restore the abnormal node.

Prerequisites

Before you begin, make sure you have:

At least one healthy JournalNode in the cluster
SSH access to both the healthy node and the abnormal node
Access to the E-MapReduce (EMR) console to stop and start HDFS components

Applicability

Condition	This procedure applies
Exactly one JournalNode is abnormal	Yes
Two or more JournalNodes are abnormal	No

Restore the abnormal JournalNode

Step 1: Identify a healthy JournalNode

Check the status of all JournalNodes on the web user interface (UI) of HDFS. For more information, see Web UIs of HDFS components.

Confirm that at least one JournalNode shows a healthy status before proceeding.

Step 2: Package the restore data from the healthy node

Log on to the node where the healthy JournalNode resides. For steps, see Log on to a cluster. Select a header or master node when possible.

Switch to the hdfs user.
```
su hdfs
```
Go to the JournalNode data directory.
```
cd /mnt/disk1/hdfs/journal/emr-cluster/
```

Package the restore files, excluding edit logs.

tar --exclude='edits*' -zcvf /tmp/jn-current.tar.gz current

The expected output is:

current/
current/last-writer-epoch
current/VERSION
current/last-promised-epoch
current/paxos/
current/committed-txid

Step 3: Copy the package to the abnormal node

Still on the healthy node, switch to the emr-user to run the copy.

Switch to the emr-user.

Note
If emr-user does not exist — for example, in EMR V3.41.0, EMR V5.7.0, or earlier minor versions — switch to the hadoop user instead.
```
su emr-user
```
```
su hadoop
```
Copy the package to the abnormal node.
```
scp /tmp/jn-current.tar.gz $unhealthy-journal-node:/tmp/
```
Replace $unhealthy-journal-node with the hostname of the node where the abnormal JournalNode resides.

Step 4: Stop the abnormal JournalNode and restore its data

In the EMR console, stop the JournalNode on the HDFS node that is abnormal.
Log on to the abnormal node. For steps, see Log on to a cluster.
Switch to the hdfs user.
```
su hdfs
```
Go to the JournalNode data directory and back up the existing data.

Important
Do not skip this backup. If the restore fails, you can recover from current.bak.
```
cd /mnt/disk1/hdfs/journal/emr-cluster/
mv current current.bak
```
Extract the package to restore the JournalNode data.
```
tar -xvf /tmp/jn-current.tar.gz
```

Step 5: Start the restored JournalNode

In the EMR console, start the JournalNode on the HDFS node.

After it starts, check the logs for errors. For more information, see HDFS service logs.

Step 6: Verify the restoration

Open the HDFS web UI and check the JournalNode status. If data can be written to the JournalNode, the restoration is successful. For more information, see Web UIs of HDFS components.

Once you confirm the JournalNode is healthy, remove the backup directory.

rm -rf /mnt/disk1/hdfs/journal/emr-cluster/current.bak