Hello
In our HA setup, the active namenode keeps crashing once a week or so. The cluster is quite idle without many jobs running and not much user activity.
This is a very new cluster installed just couple of weeks ago, the problem is there from then onwards.
Below is logs from journal nodes. Can someone help us with this please?
2015-08-04 13:00:20,054 INFO server.Journal (Journal.java:updateLastPromisedEpoch(315)) – Updating lastPromisedEpoch from 9 to 10 for client /172.26.44.133
2015-08-04 13:00:20,175 INFO server.Journal (Journal.java:scanStorageForLatestEdits(188)) – Scanning storage FileJournalManager(root=/hadoop/hdfs/journal/HDPPROD)
2015-08-04 13:00:20,220 INFO server.Journal (Journal.java:scanStorageForLatestEdits(194)) – Latest log is EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false)
2015-08-04 13:00:20,891 INFO server.Journal (Journal.java:getSegmentInfo(687)) – getSegmentInfo(523903): EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false) -> startTxId: 523903 endTxId: 523925 isInProgress: true
2015-08-04 13:00:20,891 INFO server.Journal (Journal.java:prepareRecovery(731)) – Prepared recovery for segment 523903: segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true } lastWriterEpoch: 9 lastCommittedTxId: 523924
2015-08-04 13:00:20,956 INFO server.Journal (Journal.java:getSegmentInfo(687)) – getSegmentInfo(523903): EditLogFile(file=/hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903,first=0000000000000523903,last=0000000000000523925,inProgress=true,hasCorruptHeader=false) -> startTxId: 523903 endTxId: 523925 isInProgress: true
2015-08-04 13:00:20,956 INFO server.Journal (Journal.java:acceptRecovery(817)) – Skipping download of log startTxId: 523903 endTxId: 523925 isInProgress: true: already have up-to-date logs
2015-08-04 13:00:20,989 INFO server.Journal (Journal.java:acceptRecovery(850)) – Accepted recovery for segment 523903: segmentState { startTxId: 523903 endTxId: 523925 isInProgress: true } acceptedInEpoch: 10
2015-08-04 13:00:21,791 INFO server.Journal (Journal.java:finalizeLogSegment(584)) – Validating log segment /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903 about to be finalized
2015-08-04 13:00:21,805 INFO namenode.FileJournalManager (FileJournalManager.java:finalizeLogSegment(133)) – Finalizing edits file /hadoop/hdfs/journal/HDPPROD/current/edits_inprogress_0000000000000523903 -> /hadoop/hdfs/journal/HDPPROD/current/edits_0000000000000523903-0000000000000523925
2015-08-04 13:00:22,257 INFO server.Journal (Journal.java:startLogSegment(532)) – Updating lastWriterEpoch from 9 to 10 for client /172.26.44.133
2015-08-04 13:00:23,699 INFO ipc.Server (Server.java:run(2060)) – IPC Server handler 4 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.journal from 172.26.44.135:43678 Call#304302 Retry#0
java.io.IOException: IPC’s epoch 9 is less than the last promised epoch 10
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:414)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:442)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:342)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:148)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2015-08-06 19:13:14,012 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) – I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server az-easthdpmnp02.metclouduseast.comfailed to respond
Any help is greatly appreciated to help solve this problem.
Thank you
Suresh.