This topic covers the cause and resolution for NameNode startup failures caused by an edits log error on a JournalNode.
Error message
The following error appears in the NameNode logs:
FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [xxx:8485, xxx:8485, xxx:8485], stream=null)) java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
For information about how to access NameNode logs, see HDFS service logs.
Cause
HDFS uses a Quorum Journal Manager (QJM) to synchronize the edits log across multiple JournalNodes. When HDFS starts, the NameNode must recover unfinalized edit log segments by reaching a quorum of JournalNodes. If one JournalNode is unavailable or has a corrupted edits log, recoverUnfinalizedSegments times out and the NameNode fails to start.
Solution
Log in to the JournalNode and check its startup status and logs. For information about how to locate JournalNode logs, see Deployment topology of HDFS and HDFS service logs.
If the JournalNode logs show no error
The JournalNode process is running normally but the NameNode cannot reach it. Check the security group settings of the JournalNode and test network connectivity between the NameNode and the JournalNode on port 8485.
If the JournalNode logs show an edits log error
The following error in the JournalNode logs indicates a corrupted edits_inprogress file:
org.apache.hadoop.hdfs.server.namenode.FSImage: Caught exception after scanning through 0 ops from /current/edits_inprogress_0000000000000191004 while determining its valid length. Position was 1036288 java.io.IOException: Can't scan a pre-transactional edit log.
Switch to a healthy JournalNode and fix the abnormal JournalNode. For details, see Fix the abnormal JournalNode component on a node.