If exactly one JournalNode in your cluster is abnormal, you can restore it by syncing data from a healthy JournalNode on another node. This procedure applies only when a single JournalNode is abnormal.
How it works
HDFS high availability relies on a quorum of JournalNodes to replicate edit logs between the active and standby NameNodes. Because edit logs are written to a majority of JournalNodes, a single healthy node retains a complete copy of the data needed to restore the abnormal node.
Prerequisites
Before you begin, make sure you have:
-
At least one healthy JournalNode in the cluster
-
SSH access to both the healthy node and the abnormal node
-
Access to the E-MapReduce (EMR) console to stop and start HDFS components
Applicability
| Condition | This procedure applies |
|---|---|
| Exactly one JournalNode is abnormal | Yes |
| Two or more JournalNodes are abnormal | No |
Restore the abnormal JournalNode
Step 1: Identify a healthy JournalNode
Check the status of all JournalNodes on the web user interface (UI) of HDFS. For more information, see Web UIs of HDFS components.
Confirm that at least one JournalNode shows a healthy status before proceeding.
Step 2: Package the restore data from the healthy node
Log on to the node where the healthy JournalNode resides. For steps, see Log on to a cluster. Select a header or master node when possible.
-
Switch to the hdfs user.
su hdfs -
Go to the JournalNode data directory.
cd /mnt/disk1/hdfs/journal/emr-cluster/ -
Package the restore files, excluding edit logs.
tar --exclude='edits*' -zcvf /tmp/jn-current.tar.gz currentThe expected output is:
current/ current/last-writer-epoch current/VERSION current/last-promised-epoch current/paxos/ current/committed-txid
Step 3: Copy the package to the abnormal node
Still on the healthy node, switch to the emr-user to run the copy.
-
Switch to the emr-user.
NoteIf emr-user does not exist — for example, in EMR V3.41.0, EMR V5.7.0, or earlier minor versions — switch to the hadoop user instead.
su emr-usersu hadoop -
Copy the package to the abnormal node.
scp /tmp/jn-current.tar.gz $unhealthy-journal-node:/tmp/Replace
$unhealthy-journal-nodewith the hostname of the node where the abnormal JournalNode resides.
Step 4: Stop the abnormal JournalNode and restore its data
-
In the EMR console, stop the JournalNode on the HDFS node that is abnormal.
-
Log on to the abnormal node. For steps, see Log on to a cluster.
-
Switch to the hdfs user.
su hdfs -
Go to the JournalNode data directory and back up the existing data.
ImportantDo not skip this backup. If the restore fails, you can recover from
current.bak.cd /mnt/disk1/hdfs/journal/emr-cluster/ mv current current.bak -
Extract the package to restore the JournalNode data.
tar -xvf /tmp/jn-current.tar.gz
Step 5: Start the restored JournalNode
In the EMR console, start the JournalNode on the HDFS node.
After it starts, check the logs for errors. For more information, see HDFS service logs.
Step 6: Verify the restoration
Open the HDFS web UI and check the JournalNode status. If data can be written to the JournalNode, the restoration is successful. For more information, see Web UIs of HDFS components.
Once you confirm the JournalNode is healthy, remove the backup directory.
rm -rf /mnt/disk1/hdfs/journal/emr-cluster/current.bak