Hadoop-HA高可用集群启动nameNode莫名挂掉,排错解决
Hadoop-HA高可用集群启动nameNode莫名挂掉,排错解决
nameNode错误日志
2025-05-21 16:14:12,218 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:12,219 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:12,251 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:13,144 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7009 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2025-05-21 16:14:13,220 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:13,223 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:13,257 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:14,149 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8014 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2025-05-21 16:14:14,227 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:14,238 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:14,260 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:15,151 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9016 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2025-05-21 16:14:15,238 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:15,250 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:15,268 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:16,164 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10029 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2025-05-21 16:14:16,239 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:16,285 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:16,285 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:16,302 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.191.111:8485, 192.168.191.112:8485, 192.168.191.113:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.191.112:8485: Call From node01/192.168.191.111 to node02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.191.113:8485: Call From node01/192.168.191.111 to node03:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.191.111:8485: Call From node01/192.168.191.111 to node01:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefusedat org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:286)at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:485)at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1672)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1705)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:297)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:449)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:399)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:416)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:412)
2025-05-21 16:14:16,367 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2025-05-21 16:14:16,369 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 10180 ms via java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:273)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:225)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1614)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:337)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:449)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:399)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:416)
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:412)Number of suppressed write-lock reports: 0Longest write-lock held interval: 10180.0Total suppressed write-lock held time: 0.0
2025-05-21 16:14:16,414 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interruptedat java.lang.Thread.sleep(Native Method)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:469)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:399)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:416)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:412)
2025-05-21 16:14:16,421 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2025-05-21 16:14:16,466 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2025-05-21 16:14:17,717 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:17,718 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:17,722 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:18,720 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:18,723 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:18,724 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:19,753 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:19,754 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:19,754 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:20,755 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:20,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:20,773 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:21,756 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:21,814 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:21,901 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:22,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:22,958 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:22,973 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:23,789 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:23,990 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:24,001 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:24,798 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:25,030 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:25,049 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:25,814 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:26,057 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:26,086 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:26,862 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:27,103 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:27,148 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:27,191 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.191.111:8485, 192.168.191.112:8485, 192.168.191.113:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.191.111:8485: Call From node01/192.168.191.111 to node01:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.191.113:8485: Call From node01/192.168.191.111 to node03:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.191.112:8485: Call From node01/192.168.191.111 to node02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefusedat org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:286)at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:204)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:443)at org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:616)at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:613)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1602)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1223)at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1887)at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1746)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1723)at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
2025-05-21 16:14:27,196 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.191.111:8485, 192.168.191.112:8485, 192.168.191.113:8485], stream=null))
2025-05-21 16:14:27,426 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node01/192.168.191.111
************************************************************/
出现报错的原因
我们在执行start-dfs.sh的时候,默认启动顺序是namenode => datanode => journalnode => zkfc,如果journalnode和namenode不在一台机器启动的话,很容易因为网络延迟问题导致NN无法连接JN,无法实现选举,最后导致刚刚启动的namenode会突然挂掉一个主的,留下一个standy的,虽然有NN启动时有重试机制等待JN的启动,但是由于重试次数限制,可能网络情况不好,导致重试次数用完了,也没有启动成功。
-
A: 此时需要手动启动主的那个namenode,避免了网络延迟等待journalnode的步骤,一旦两个namenode连入journalnode,实现了选举,则不会出现失败情况。
-
B: 先启动JournalNode然后再运行start-dfs.sh,
-
C: 把nn对jn的容错次数或时间调成更大的值,保证能够对正常的启动延迟、网络延迟能容错。
解决方法
- 在hdfs-site.xml中加入,nn对jn检测的重试次数,默认为10次,每次1000ms,故网络情况差需要增加,这里设置为30次
<property><name>ipc.client.connect.max.retries</name><value>30</value></property>
- 先启动journalnode,再启动dfs
封装一个简单的集群启动脚本
- 按照先启动zookeeper,再启动Hadoop的先后顺序,在启动启动的spark等等。
集群启动脚本
#!/bin/bashif [ $# -lt 1 ]
thenecho "No Args Input..."exit ;
fi# 启动Zookeeper集群
start_zookeeper() {echo " =================== 启动 zookeeper 集群 ==================="# 在 node01, node02, node03 上启动 zookeeperfor node in node01 node02 node03doecho " --------------- 启动 $node 的 zookeeper ---------------"ssh $node "/opt/yjx/zookeeper-3.4.5/bin/zkServer.sh start"done# 检查每个节点的 zookeeper 状态echo " --------------- 检查 zookeeper 状态 ---------------"for node in node01 node02 node03doecho "检查 $node 的 zookeeper 状态"ssh $node "/opt/yjx/zookeeper-3.4.5/bin/zkServer.sh status"done
}# 关闭Zookeeper集群
stop_zookeeper() {echo " =================== 关闭 zookeeper 集群 ==================="# 在 node01, node02, node03 上关闭 zookeeperfor node in node01 node02 node03doecho " --------------- 检查 $node 上 Zookeeper 是否已经启动 ---------------"# 检查端口2181是否有进程占用process_check=$(ssh $node "netstat -tuln | grep 2181")if [ -z "$process_check" ]; thenecho "$node 上的 Zookeeper 没有运行,跳过停止操作"elseecho "$node 上的 Zookeeper 正在运行,执行停止操作"ssh $node "/opt/yjx/zookeeper-3.4.5/bin/zkServer.sh stop"fidone
}case $1 in
"start")# 先启动 Zookeeper,再启动 Hadoopstart_zookeeperecho " =================== 启动 hadoop 集群 ==================="#ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/hadoop-daemon.sh start journalnode"#ssh node02 "/opt/yjx/hadoop-3.1.2/sbin/hadoop-daemon.sh start journalnode"#ssh node03 "/opt/yjx/hadoop-3.1.2/sbin/hadoop-daemon.sh start journalnode"echo " --------------- 启动 journalnode ---------------"ssh node01 hdfs --daemon start journalnodessh node02 hdfs --daemon start journalnodessh node03 hdfs --daemon start journalnodeecho " --------------- 启动 hdfs ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/start-dfs.sh"echo " --------------- 启动 yarn ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/start-yarn.sh"echo " --------------- 启动 historyserver ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/bin/mapred --daemon start historyserver";;"stop")# 先关闭 Hadoop,再关闭 Zookeeperecho " =================== 关闭 hadoop 集群 ==================="echo " --------------- 关闭 historyserver ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/bin/mapred --daemon stop historyserver"echo " --------------- 关闭 yarn ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/stop-yarn.sh"echo " --------------- 关闭 hdfs ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/stop-dfs.sh"# 关闭 Zookeeper 集群stop_zookeeper ;;*)echo "Input Args Error..." ;;
esac
相关文章:
Hadoop-HA高可用集群启动nameNode莫名挂掉,排错解决
Hadoop-HA高可用集群启动nameNode莫名挂掉,排错解决 nameNode错误日志 2025-05-21 16:14:12,218 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCoun…...
从零基础到最佳实践:Vue.js 系列(2/10):《模板语法与数据绑定》
Vue.js 模板语法与数据绑定:从基础到实践 关键点 Vue.js 的模板语法使用 HTML 结合特殊指令(如 v-bind、v-on),实现动态界面。插值({{ }})显示数据,指令控制 DOM 行为,双向绑定简化…...
第二章:Android常用UI控件
1、介绍: 控件是界面组成的主要元素,界面中的控件有序排放和完美组合,便可在用户眼前呈现出丰富多彩的页面。 2、常用控件: 一、TextView控件: 在使用手机时,经常会看见一些文本信息,这些文本…...
LeetCode 1004. 最大连续1的个数 III
LeetCode 1004题 “最大连续1的个数 III” 是一道关于数组和滑动窗口的问题。题目描述如下: 题目描述 给定一个由若干 0 和 1 组成的数组 nums,以及一个整数 k。你可以将最多 k 个 0 翻转为 1。返回经过翻转操作后,数组中连续 1 的最大个数…...
Flink CDC 3.4 发布, 优化高频 DDL 处理,支持 Batch 模式,新增 Iceberg 支持
引言 Apache Flink 社区很开心地宣布,在经过4个月的版本开发之后,Flink CDC 3.4.0 版本已经正式发布。Flink CDC 是流行的流式数据集成框架,CDC 3.4.0 版本强化了框架对于高频表结构变更的支持,框架支持了 batch 执行模式&#x…...
NIFI的处理器:JSLTTransformJSON 2.4.0
该处理器使用JSLT转换FlowFile JSON有效负载的格式。使用转换后的内容创建新的FlowFile,并将其路由到“成功”关系。如果JSLT转换失败,则将原始FlowFile路由到“失败”关系。 需要注意的是,编译JSLT转换可能相当昂贵。理想情况下,…...
k8s-ServiceAccount 配置
在 Kubernetes 中 ServiceAccount 是一种为 Pod 提供身份认证的机制,允许 Pod 以特定的身份访问 Kubernetes API 服务器。 **Role(角色)**是 Kubernetes 中定义权限的资源对象,它只能在特定的命名空间内生效。Role 用于定义一组权…...
Python Lambda 表达式
在 Python 编程中,Lambda 表达式是一个非常强大且实用的工具,它就像一把瑞士军刀,能在各种场景下帮助我们写出简洁、优雅的代码。接下来,就让我们一起深入探索 Python Lambda 表达式的奥秘。 一、Lambda 表达式的基础认知 1…...
【ffmpeg】ffprobe基本用法
ffprobe 是 FFmpeg 工具集中的一个强大命令行工具,主要用于分析多媒体文件(如视频、音频等)的格式和内容信息。它可以提取文件的元数据、编解码器信息、流详情、帧信息等,而无需对文件进行转码或修改。 基本用法 ffprobe [选项] …...
Java 代码生成工具:如何快速构建项目骨架?
Java 代码生成工具:如何快速构建项目骨架? 在 Java 项目开发过程中,构建项目骨架是一项繁琐但又基础重要的工作。幸运的是,Java 领域有许多代码生成工具可以帮助我们快速完成这一任务,大大提高开发效率。 一、代码生…...
模板初阶【C++】
一、 泛型编程 前言: 我们经常会用到数据的交换,C中的函数重载可以完成 //函数重载 void swap(int& x,int& y) {int tmp x;x y;y tmp; }void swap(double& x, double& y) {double tmp x;x y;y tmp; }void swap(char& x, ch…...
URL 类知识点详解
URL 类知识点详解 1. 基本概念与位置 所属包: java.net.URL核心功能: 表示统一资源定位符(Uniform Resource Locator),用于标识和定位网络资源(如网页、文件、API接口)。支持多种协议:HTTP、HTTPS、FTP、file(本地文件)等。不可变类:一旦创建,内容不可修改(线程安全…...
如何使用redis做限流(golang实现小样)
在实际开发中,限流(Rate Limiting)是一种保护服务、避免接口被恶意刷流的常见技术。常用的限流算法有令牌桶、漏桶、固定窗口、滑动窗口等。由于Redis具备高性能和原子性操作,常常被用来实现分布式限流。 下面给出使用Golang结合Redis实现简单限流的几种常见方式(以“固定…...
OpenHarmony外设驱动使用 (九),Pin_auth
OpenHarmony外设驱动使用 (九) Pin_auth 概述 功能简介 口令认证是端侧设备不可或缺的一部分,为设备提供一种用户认证能力,可应用于设备解锁、支付、应用登录等身份认证场景。用户注册口令后,口令认证模块就可为设备…...
MySQL基础(InnoDB)
✅ InnoDB:支持事务、行级锁、外键。 为什么要用事务? 安全:如果中途发现错误(比如改错分数),可以一键撤销所有操作,就像游戏里的“回档”功能! 原子…...
自建srs实时视频服务器支持RTMP推流和拉流
文章目录 一、整体示意图二、服务器端1.srs简介及架构2.docker方式安装3.k8s方式安装4.端口 三、推流端1.OBS Studio2.ffmpeg推流3.streamlabs苹果手机4.twire安卓手机5.网络推流摄像头 四、拉流端1.vlc2.srs 参考awesome系列:https://github.com/juancarlospaco/aw…...
C++性能优化的7大核心策略与实战案例
在大型C项目中,性能优化需从语言特性、系统架构、硬件特性等多维度切入。以下是经过验证的关键技术路径👇 🔧 一、内存管理的极致控制 问题:频繁的动态内存分配会导致性能抖动和内存碎片,尤其在实时系统中可能…...
《国家高等教育智慧平台:重塑学习新时代》
时代之需:平台应运而生 在数字化浪潮席卷全球的当下,高等教育领域也在经历着深刻的变革。数字化技术的迅猛发展,正以前所未有的力量重塑着高等教育的形态。从在线课程的兴起,到虚拟实验室的应用,再到智能化教学工具的普…...
【Django】Django DRF 中如何手动调用分页器返回分页数据(APIView,action场景)
📦 Django DRF 中如何手动调用分页器返回分页数据(APIView,action场景) 在使用 Django REST Framework(DRF)时,很多人习惯了用 GenericAPIView 或 ViewSet 自动帮我们处理分页。但在某些场景中…...
遨游科普:三防平板有哪些品牌?哪个品牌值得推荐?
在工业数字化与户外作业场景日益多元化的今天,三防平板凭借其卓越的防护性能与功能集成能力,成为电力巡检、地质勘探、应急救援等领域不可或缺的智能终端。所谓“三防”,即防尘、防水、防摔,国际标准IP68与军用标准MIL-STD-810H的…...
Flannel后端为UDP模式下,分析数据包的发送方式(一)
Flannel 使用的是 UDP 模式,分析发往 10.244.2.5 的数据包会从哪个网卡发出。 路由表 以下是提供的路由表: Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 192.168.1.1 …...
华为鸿蒙电脑发布,折叠屏怎么选?
1⃣屏幕特性: 分辨率:高分辨率能保证图像和文字的清晰细腻 屏幕材质:OLED 屏幕通常具有更好的对比度、色彩表现和更广的色域 刷新率:支持自适应刷新率的屏幕可以根据不同的使用场景自动调整刷新率,在保证流畅度的同时优…...
将VMware上的虚拟机和当前电脑上的Wifi网卡处在同一个局域网下,实现同一个局域网下实现共享
什么是桥接模式:桥接模式(Bridging Mode)是一种网络连接模式,常用于虚拟机环境中,将虚拟机的虚拟网络适配器直接连接到主机的物理网络适配器上,使虚拟机能够像独立的物理设备一样直接与物理网络通信 1.打开…...
论文阅读:Auto-Encoding Variational Bayes
对图像生成论文自编码变分贝叶斯Auto-Encoding Variational Bayes原理理解和记录 Abstract 我们如何在有向概率模型中,在具有难以处理的后验分布的连续潜在变量z和大型数据集的存在下,执行有效的推理和学习? 我们介绍了一种随机变分推理和学…...
API面临哪些风险,如何做好API安全?
API面临的风险 API(应用程序编程接口)在现代软件开发和集成中扮演着至关重要的角色,但同时也面临着多种安全风险,主要包括以下几个方面: 数据泄露风险: API通常涉及敏感数据的传输和交换,如用…...
C# Prism框架详解:构建模块化WPF应用程序
1. Prism框架简介 Prism是一个用于构建松散耦合、可测试和可维护的WPF桌面应用程序的框架。它最初由微软模式与实践团队开发,现在由社区维护,是构建企业级WPF应用程序的首选框架之一。 Prism框架的核心优势: 模块化设计:将应用…...
【工具教程】图片识别内容改名,图片指定区域识别重命名,批量识别单据扫描件批量改名,基于WPF和腾讯OCR的实现方案
基于WPF和腾讯OCR的图片指定区域识别与批量重命名实现方案 一、应用场景 电商商品管理 电商平台每天需处理大量商品图片,原始文件名无规律(如IMG_001.jpg)。通过指定图片中商品名称、颜色、尺码等区域,OCR识别后自动重命名…...
数独求解器3.0 增加latex格式读取
首先说明两种读入格式 latex输入格式说明 \documentclass{article} \begin{document}This is some text before oku.\begin{array}{|l|l|l|l|l|l|l|l|l|} \hline & & & & 5 & & 2 & 9 \\ \hline& & 5 & 1 & & 7…...
WPF核心类继承树结构
WPF(Windows Presentation Foundation)的类继承结构非常庞大而复杂,以下是最核心的继承树结构,按照主要功能区域展示: 基础对象层级 Object └── DispatcherObject└── DependencyObject├── Freezable│ ├── Animatable│ │ …...
Mysql的binlog日志
环境准备 [rootmysql152 ~]# yum install -y mysql-server mysql [rootmysql152 ~]# systemctl enable mysqld --now1.查看正在使用的binlog日志文件 mysql> show master status; ---------------------------------------------------------------------------- | File …...
Java 安全SPEL 表达式SSTI 模版注入XXEJDBCMyBatis 注入
https://github.com/bewhale/JavaSec https://github.com/j3ers3/Hello-Java-Sec https://mp.weixin.qq.com/s/ZO4tpz9ys6kCIryNhA5nYw #Java 安全 -SQL 注入 -JDBC&MyBatis -JDBC 1 、采用 Statement 方法拼接 SQL 语句 2 、 PrepareStatement 会对 SQL 语…...
TypeScript 泛型讲解
如果说 TypeScript 是一门对类型进行编程的语言,那么泛型就是这门语言里的(函数)参数。本章,我将会从多角度讲解 TypeScript 中无处不在的泛型,以及它在类型别名、对象类型、函数与 Class 中的使用方式。 一、泛型的核…...
BERT、GPT-3与超越:NLP模型演进全解析
自然语言处理(NLP)领域近年来经历了前所未有的变革,从早期的统计方法到如今的深度学习大模型,技术的进步推动了机器理解、生成和交互能力的飞跃。其中,BERT和GPT-3作为两个里程碑式的模型,分别代表了不同的…...
RISC-V IDE MRS2 开发笔记一:volatile关键字的使用
RISC-V IDE MRS2 开发笔记一:volatile关键字的使用 一、volatile是什么 二、GCC 中 volatile 的行为 2.1禁止编译器优化 2.2 不等于内存屏障 2.3 GCC扩展行为 三、什么时候需要 volatile 3.1防止编译器优化掉“有效代码” 3.2 访问硬件寄存器 3.3 中断服务…...
25、工业防火墙 - 工控网络保护 (模拟) - /安全与维护组件/industrial-firewall-dcs-protection
76个工业组件库示例汇总 工业防火墙 - 工控网络保护 (模拟) 概述 这是一个交互式的 Web 组件,旨在模拟工业防火墙在保护关键工控网络(特别是 DCS - 分布式控制系统)免受网络攻击(如勒索软件传播)方面的核心功能。组件通过可视化简化的网络拓扑、模拟网络流量、应用防火…...
LAN(局域网)和WAN(广域网)
你的问题非常清晰!我来用一个直观的比喻实际拓扑图帮你彻底理解LAN(局域网)和WAN(广域网)如何协同工作,以及路由器在其中的位置。你可以把整个网络想象成一座城市: 1. 比喻:城市交通…...
ArcGIS Pro 3.4 二次开发 - Arcade
环境:ArcGIS Pro SDK 3.4 .NET 8 文章目录 Arcade1 基本查询1.1 基本查询1.2 使用要素进行基本查询1.3 使用 FeatureSetByName 检索要素1.4 使用过滤器检索要素1.5 使用数学函数计算基本统计量1.6 使用 FeatureSet 函数的 Filter 和 Intersects 2 评估表达式2.1 评…...
PCB智能报价系统——————仙盟创梦IDE
软件署名 代码贡献: 紫金电子科技有限公司 文案正路:cybersnow 正文 对企业的竞争力有着深远影响。传统的 PCB 报价方式往往依赖人工核算,不仅耗时较长,还容易出现误差。随着科技的发展,PCB 自动报价系统应运而生&a…...
灾备认证助力构建数据资产安全防线
信息安全保障人员(CISAW)-灾难备份与恢复认证 1.权威认证体系:技术护城河 在数字化进程加速的背景下,数据资产已成为政府与企业的核心资源,容灾备份能力成为保障业务连续性的关键。特别是近年来,因灾备缺…...
[特殊字符] 遇见Flask
一、初识Flask:像风一样自由 想象一下,你手里有一盒乐高积木——没有说明书,但每一块都精致小巧,任你组合成城堡、飞船,甚至整个宇宙。Flask就是这样一个存在。它不像Django那样“手把手教你搭房子”,而是…...
Axure高级交互设计:中继器嵌套动态面板实现超强体验感台账
亲爱的小伙伴,在您浏览之前,烦请关注一下,在此深表感谢!如有帮助请订阅专栏! Axure产品经理精品视频课已登录CSDN可点击学习https://edu.csdn.net/course/detail/40420 课程主题:中继器嵌套动态面板 主要内容:中继器内部嵌套动态面板,实现可移动式台账,增强数据表现…...
告别手动绘图!2分钟用 AI 生成波士顿矩阵
波士顿矩阵作为经典工具,始终是企业定位产品组合、制定竞争策略的核心方法论。然而,传统手动绘制矩阵的方式,往往面临数据处理繁琐、图表调整耗时、团队协作低效等痛点。 随着AI技术的发展,这一现状正在被彻底改变。boardmix博思白…...
iframe加载或者切换时候,短暂的白屏频闪问题解决
问题描述 iframe加载或者是切换iframe链接的时候,会有短暂的白屏,这个时候是在加载,目前没有想到避免的问题,应该是浏览器层面的,所以解决方法之一就是,用页面的主题背景色来遮盖一下,当他加载…...
Python数据可视化高级实战之一——绘制GE矩阵图
目录 一、课程概述 二、GE矩阵? 三、GE 矩阵图的适用范围 五、GE 矩阵的评估方法 (一)市场吸引力的评估要素 二、企业竞争实力的评估要素 三、评估方法与实践应用 1. 定量与定性结合法 2. 数据来源 六、GE矩阵的图形化实现 七、总结:GE 矩阵与 BCG 矩阵的对比分析 (一)GE…...
量子计算与云计算的融合:技术前沿与应用前景
目录 引言 量子计算基础 量子计算的基本原理 量子计算的优势与挑战 量子计算的发展阶段 云计算基础 云计算的基本概念 云计算的应用领域 云计算面临的挑战 量子计算与云计算的结合 量子云计算的概念与架构 量子云计算的服务模式 量子云计算的优势 量子云计算的发展…...
QMK固件RGB矩阵照明功能详解 - 打造你的专属炫彩键盘
QMK固件RGB矩阵照明功能详解 - 打造你的专属炫彩键盘 🌈 大家好!今天我要详细讲解QMK固件中的RGB矩阵照明功能,让你轻松打造一个真正炫彩的机械键盘!本文从基础原理到实战配置,手把手教你如何配置各种绚丽的灯光效果,即使你是小白也能轻松上手!文中所有代码都配有详细的…...
Rust 学习笔记:关于泛型的练习题
Rust 学习笔记:关于泛型的练习题 Rust 学习笔记:关于泛型的练习题问题 1下面代码能否通过编译?若能,输出是?下面代码能否通过编译?若能,输出是? Rust 学习笔记:关于泛型的…...
Panasonic松下焊接机器人节气
Panasonic松下焊接机器人节气装置 一、工作原理 松下焊接机器人节气装置的工作原理主要是通过智能控制技术,实现对焊接过程中气体流量的精确调节。例如,在焊接的不同阶段,根据焊接电流的大小自动调整气体的供给量。当焊接电流较强时&#x…...
2023 睿抗机器人开发者大赛CAIP-编程技能赛-本科组(国赛) 解题报告 | 珂学家
前言 题解 2023 睿抗机器人开发者大赛CAIP-编程技能赛-本科组(国赛)。 vp了下,题目挺好的,难度也适中,但是彻底红温了。 第二题,题意不是那么清晰, M i n ( K 1 , K 2 ) Min(K_1, K_2) Min(K1,K2)容易求&#x…...
LeetCode 3355.零数组变换 I:差分数组
【LetMeFly】3355.零数组变换 I:差分数组 力扣题目链接:https://leetcode.cn/problems/zero-array-transformation-i/ 给定一个长度为 n 的整数数组 nums 和一个二维数组 queries,其中 queries[i] [li, ri]。 对于每个查询 queries[i]&am…...