Quantcast
Channel: Hortonworks » All Replies
Viewing all articles
Browse latest Browse all 3435

Ambari agent can't register after machine restart (heartbeat lost)

$
0
0

Hi all!

I’m setting up a test cluster of two machines. The installation process was completed successfully – I installed, all the master services on the machine running ambari server (hdfs, yarn, hbase, zookeeper), and all the worker services and clients on the other (worker) machine.

After having restarted both the machines, I could start all the master services except HBase, but the ambari-agent can’t seem to register.

This is part of the ambari-server log just after restarting the ambari-agent:

23 Apr 2015 12:06:03,272  WARN [qtp-ambari-agent-242] SecurityFilter:103 - Request https://ambariserver2:8440/ca doesn't match any pattern.
23 Apr 2015 12:06:03,272  WARN [qtp-ambari-agent-242] SecurityFilter:62 - This request is not allowed on this port: https://ambariserver2:8440/ca
23 Apr 2015 12:06:06,281  INFO [qtp-ambari-agent-242] HeartBeatHandler:877 - agentOsType = centos6
23 Apr 2015 12:06:06,306  INFO [qtp-ambari-agent-242] HostImpl:277 - Received host registration, host=[hostname=datanode04,fqdn=datanode04.bi.internal,domain=bi.internal,architecture=x86_64,processorcount=2,phys
icalprocessorcount=2,osname=centos,osversion=6.6,osfamily=redhat,memory=3914964,uptime_hours=1,mounts=(available=41691044,mountpoint=/,used=2883644,percent=7%,size=46967160,device=/dev/mapper/vg_datanode04-lv_ro
ot,type=ext4)(available=1957480,mountpoint=/dev/shm,used=0,percent=0%,size=1957480,device=tmpfs,type=tmpfs)(available=436586,mountpoint=/boot,used=25466,percent=6%,size=487652,device=/dev/vda1,type=ext4)]
, registrationTime=1429783566280, agentVersion=2.0.0
23 Apr 2015 12:06:48,709  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:146 - Heartbeat lost from host datanode04
23 Apr 2015 12:06:48,710  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component METRICS_MONITOR on datanode04
23 Apr 2015 12:06:48,712  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component HBASE_REGIONSERVER on datanode04
23 Apr 2015 12:06:48,714  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component DATANODE on datanode04
23 Apr 2015 12:06:48,715  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component NODEMANAGER on datanode04
23 Apr 2015 12:07:06,945  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert datanode_webui for an invalid service HDFS and component DATANODE on host datanode04.bi.internal
23 Apr 2015 12:07:06,949  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert datanode_process for an invalid service HDFS and component DATANODE on host datanode04.bi.internal
23 Apr 2015 12:07:06,953  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert hbase_regionserver_process for an invalid service HBASE and component HBASE_REGIONSERVER on host datanode04.
bi.internal
23 Apr 2015 12:07:06,956  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert yarn_nodemanager_webui for an invalid service YARN and component NODEMANAGER on host datanode04.bi.internal
23 Apr 2015 12:07:06,961  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert ams_metrics_monitor_process for an invalid service AMBARI_METRICS and component METRICS_MONITOR on host data
node04.bi.internal
23 Apr 2015 12:07:06,964  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert yarn_nodemanager_health for an invalid service YARN and component NODEMANAGER on host datanode04.bi.internal
23 Apr 2015 12:07:48,772  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:146 - Heartbeat lost from host datanode04
23 Apr 2015 12:07:48,773  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component METRICS_MONITOR on datanode04
23 Apr 2015 12:07:48,775  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component HBASE_REGIONSERVER on datanode04
23 Apr 2015 12:07:48,777  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component DATANODE on datanode04
23 Apr 2015 12:07:48,779  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component NODEMANAGER on datanode04
23 Apr 2015 12:08:07,636  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert datanode_webui for an invalid service HDFS and component DATANODE on host datanode04.bi.internal
23 Apr 2015 12:08:07,639  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert datanode_process for an invalid service HDFS and component DATANODE on host datanode04.bi.internal
23 Apr 2015 12:08:07,643  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert hbase_regionserver_process for an invalid service HBASE and component HBASE_REGIONSERVER on host datanode04.
bi.internal
23 Apr 2015 12:08:07,645  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert datanode_storage for an invalid service HDFS and component DATANODE on host datanode04.bi.internal
23 Apr 2015 12:08:07,653  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert yarn_nodemanager_webui for an invalid service YARN and component NODEMANAGER on host datanode04.bi.internal
23 Apr 2015 12:08:07,659  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert ams_metrics_monitor_process for an invalid service AMBARI_METRICS and component METRICS_MONITOR on host data
node04.bi.internal
23 Apr 2015 12:08:07,661  WARN [alert-event-bus-2] AlertReceivedListener:302 - Unable to process alert yarn_nodemanager_health for an invalid service YARN and component NODEMANAGER on host datanode04.bi.internal
23 Apr 2015 12:08:48,840  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:146 - Heartbeat lost from host datanode04
23 Apr 2015 12:08:48,841  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component METRICS_MONITOR on datanode04
23 Apr 2015 12:08:48,843  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component HBASE_REGIONSERVER on datanode04
23 Apr 2015 12:08:48,845  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component DATANODE on datanode04
23 Apr 2015 12:08:48,847  WARN [ambari-hearbeat-monitor] HeartbeatMonitor:161 - Setting component state to UNKNOWN for component NODEMANAGER on datanode04
23 Apr 2015 12:09:08,301  WARN [alert-event-bus-1] AlertReceivedListener:302 - Unable to process alert datanode_process for an invalid service HDFS and component DATANODE on host datanode04.bi.internal
23 Apr 2015 12:09:08,304  WARN [alert-event-bus-1] AlertReceivedListener:302 - Unable to process alert hbase_regionserver_process for an invalid service HBASE and component HBASE_REGIONSERVER on host datanode04.
bi.internal
23 Apr 2015 12:09:08,308  WARN [alert-event-bus-1] AlertReceivedListener:302 - Unable to process alert ams_metrics_monitor_process for an invalid service AMBARI_METRICS and component METRICS_MONITOR on host data
node04.bi.internal
23 Apr 2015 12:09:08,312  WARN [alert-event-bus-1] AlertReceivedListener:302 - Unable to process alert datanode_webui for an invalid service HDFS and component DATANODE on host datanode04.bi.internal
23 Apr 2015 12:09:08,317  WARN [alert-event-bus-1] AlertReceivedListener:302 - Unable to process alert yarn_nodemanager_webui for an invalid service YARN and component NODEMANAGER on host datanode04.bi.internal
23 Apr 2015 12:09:08,319  WARN [alert-event-bus-1] AlertReceivedListener:302 - Unable to process alert yarn_nodemanager_health for an invalid service YARN and component NODEMANAGER on host datanode04.bi.internal

and this is the log on the ambari agent, it seems to perform heartbeats and receive acknowledgements:

INFO 2015-04-23 12:18:55,305 HostCheckReportFileHandler.py:91 - Host check report at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:18:55,305 HostCheckReportFileHandler.py:141 - Removing old host check file at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:18:55,306 HostCheckReportFileHandler.py:146 - Creating host check file at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:18:55,311 Controller.py:125 - Registering with datanode04.bi.internal (10.0.10.30) (agent='{"hardwareProfile": {"kernel": "Linux", "domain": "bi.internal", "physicalprocessorcount": 2, "kernel
release": "2.6.32-504.el6.x86_64", "uptime_days": "0", "memorytotal": 3914964, "swapfree": "3.87 GB", "memorysize": 3914964, "osfamily": "redhat", "swapsize": "3.87 GB", "processorcount": 2, "netmask": "255.255.
255.0", "timezone": "CET", "hardwareisa": "x86_64", "memoryfree": 3682172, "operatingsystem": "centos", "kernelmajversion": "2.6", "kernelversion": "2.6.32", "macaddress": "02:01:2B:0C:B7:3E", "operatingsystemre
lease": "6.6", "ipaddress": "10.0.10.30", "hostname": "datanode04", "uptime_hours": "1", "fqdn": "datanode04.bi.internal", "id": "root", "architecture": "x86_64", "selinux": true, "mounts": [{"available": "41690
904", "used": "2883784", "percent": "7%", "device": "/dev/mapper/vg_datanode04-lv_root", "mountpoint": "/", "type": "ext4", "size": "46967160"}, {"available": "1957480", "used": "0", "percent": "0%", "device": "
tmpfs", "mountpoint": "/dev/shm", "type": "tmpfs", "size": "1957480"}, {"available": "436586", "used": "25466", "percent": "6%", "device": "/dev/vda1", "mountpoint": "/boot", "type": "ext4", "size": "487652"}], 
"hardwaremodel": "x86_64", "uptime_seconds": "4480", "interfaces": "eth0,lo"}, "currentPingPort": 8670, "prefix": "/var/lib/ambari-agent/data", "agentVersion": "2.0.0", "agentEnv": {"transparentHugePage": "alway
s", "hostHealth": {"agentTimeStampAtReporting": 1429784335306, "activeJavaProcs": [], "liveServices": [{"status": "Unhealthy", "name": "ntpd", "desc": "ntpd is stopped\\n"}]}, "iptablesIsRunning": true, "reverse
Lookup": true, "alternatives": [], "umask": "18", "stackFoldersAndFiles": [{"type": "directory", "name": "/etc/hadoop"}, {"type": "directory", "name": "/etc/hbase"}, {"type": "directory", "name": "/etc/zookeeper
"}, {"type": "directory", "name": "/var/run/hadoop"}, {"type": "directory", "name": "/var/run/hbase"}, {"type": "directory", "name": "/var/run/zookeeper"}, {"type": "directory", "name": "/var/run/hadoop-yarn"}, 
{"type": "directory", "name": "/var/run/hadoop-mapreduce"}, {"type": "directory", "name": "/var/log/hadoop"}, {"type": "directory", "name": "/var/log/hbase"}, {"type": "directory", "name": "/var/log/zookeeper"},
 {"type": "directory", "name": "/var/log/hadoop-yarn"}, {"type": "directory", "name": "/var/log/hadoop-mapreduce"}, {"type": "directory", "name": "/usr/lib/hadoop"}, {"type": "directory", "name": "/usr/lib/flume
"}, {"type": "directory", "name": "/usr/lib/storm"}, {"type": "directory", "name": "/var/lib/hadoop-hdfs"}, {"type": "directory", "name": "/var/lib/hadoop-yarn"}, {"type": "directory", "name": "/var/lib/hadoop-m
apreduce"}, {"type": "directory", "name": "/tmp/hadoop-hdfs"}, {"type": "directory", "name": "/hadoop/hbase"}, {"type": "directory", "name": "/hadoop/zookeeper"}, {"type": "directory", "name": "/hadoop/hdfs"}, {
"type": "directory", "name": "/hadoop/yarn"}], "existingUsers": [{"status": "Available", "name": "mapred", "homeDir": "/home/mapred"}, {"status": "Available", "name": "hbase", "homeDir": "/home/hbase"}, {"status
": "Available", "name": "ambari-qa", "homeDir": "/home/ambari-qa"}, {"status": "Available", "name": "zookeeper", "homeDir": "/home/zookeeper"}, {"status": "Available", "name": "hdfs", "homeDir": "/home/hdfs"}, {
"status": "Available", "name": "yarn", "homeDir": "/home/yarn"}]}, "timestamp": 1429784335182, "hostname": "datanode04.bi.internal", "responseId": -1, "publicHostname": "datanode04.bi.internal"}')
INFO 2015-04-23 12:18:55,312 NetUtil.py:60 - Connecting to https://ambariserver2:8440/connection_info
INFO 2015-04-23 12:18:55,704 security.py:93 - SSL Connect being called.. connecting to the server
INFO 2015-04-23 12:18:55,939 security.py:55 - SSL connection established. Two-way SSL authentication is turned off on the server.
INFO 2015-04-23 12:18:55,974 Controller.py:149 - Registration Successful (response id = 0)
INFO 2015-04-23 12:18:55,974 Controller.py:153 - Got status commands on registration.
WARNING 2015-04-23 12:18:55,974 AlertSchedulerHandler.py:92 - There are no alert definition commands in the heartbeat; unable to update definitions
INFO 2015-04-23 12:18:55,974 Controller.py:350 - Registration response from ambariserver2 was OK
INFO 2015-04-23 12:18:55,975 Controller.py:355 - Resetting ActionQueue...
INFO 2015-04-23 12:19:05,986 Heartbeat.py:75 - Building Heartbeat: {responseId = 0, timestamp = 1429784345986, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:19:06,109 HostCheckReportFileHandler.py:91 - Host check report at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:19:06,110 HostCheckReportFileHandler.py:141 - Removing old host check file at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:19:06,110 HostCheckReportFileHandler.py:146 - Creating host check file at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:19:06,392 Controller.py:239 - Heartbeat response received (id = 1)
INFO 2015-04-23 12:19:06,393 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:19:16,394 Heartbeat.py:75 - Building Heartbeat: {responseId = 1, timestamp = 1429784356394, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:19:16,442 Controller.py:239 - Heartbeat response received (id = 2)
INFO 2015-04-23 12:19:16,443 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:19:26,443 Heartbeat.py:75 - Building Heartbeat: {responseId = 2, timestamp = 1429784366443, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:19:26,490 Controller.py:239 - Heartbeat response received (id = 3)
INFO 2015-04-23 12:19:26,490 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:19:36,491 Heartbeat.py:75 - Building Heartbeat: {responseId = 3, timestamp = 1429784376491, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:19:36,543 Controller.py:239 - Heartbeat response received (id = 4)
INFO 2015-04-23 12:19:36,543 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:19:46,544 Heartbeat.py:75 - Building Heartbeat: {responseId = 4, timestamp = 1429784386544, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:19:46,598 Controller.py:239 - Heartbeat response received (id = 5)
INFO 2015-04-23 12:19:46,599 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:19:52,816 scheduler.py:509 - Running job "14372191-75df-4c7a-8825-9c849a8f57e8 (trigger: interval[0:01:00], next run at: 2015-04-23 12:19:52.813771)" (scheduled at 2015-04-23 12:19:52.813771)
INFO 2015-04-23 12:19:52,821 scheduler.py:509 - Running job "03028a48-e699-43f5-b6c4-ff2ea805da03 (trigger: interval[0:01:00], next run at: 2015-04-23 12:19:52.815123)" (scheduled at 2015-04-23 12:19:52.815123)
INFO 2015-04-23 12:19:52,823 scheduler.py:527 - Job "14372191-75df-4c7a-8825-9c849a8f57e8 (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.813771)" executed successfully
INFO 2015-04-23 12:19:52,823 scheduler.py:509 - Running job "4c96efd5-3c3c-404b-bf7a-1b3ffaef276c (trigger: interval[0:01:00], next run at: 2015-04-23 12:19:52.815760)" (scheduled at 2015-04-23 12:19:52.815760)
INFO 2015-04-23 12:19:52,831 scheduler.py:509 - Running job "3de93e99-bedb-4de9-8039-96f713cff992 (trigger: interval[0:01:00], next run at: 2015-04-23 12:19:52.816300)" (scheduled at 2015-04-23 12:19:52.816300)
INFO 2015-04-23 12:19:52,842 scheduler.py:509 - Running job "3e2ce1a4-4b87-43d8-ab01-34be1c07e6e9 (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.816938)" (scheduled at 2015-04-23 12:19:52.816938)
INFO 2015-04-23 12:19:52,843 scheduler.py:509 - Running job "56ac6ab3-d457-49fa-b3b8-717a7a052458 (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.817534)" (scheduled at 2015-04-23 12:19:52.817534)
INFO 2015-04-23 12:19:52,845 scheduler.py:509 - Running job "59196fab-5960-4573-b51d-3553a0cf04ef (trigger: interval[0:01:00], next run at: 2015-04-23 12:19:52.818116)" (scheduled at 2015-04-23 12:19:52.818116)
INFO 2015-04-23 12:19:52,851 scheduler.py:527 - Job "3e2ce1a4-4b87-43d8-ab01-34be1c07e6e9 (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.816938)" executed successfully
INFO 2015-04-23 12:19:52,852 scheduler.py:527 - Job "4c96efd5-3c3c-404b-bf7a-1b3ffaef276c (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.815760)" executed successfully
INFO 2015-04-23 12:19:52,853 scheduler.py:527 - Job "3de93e99-bedb-4de9-8039-96f713cff992 (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.816300)" executed successfully
INFO 2015-04-23 12:19:52,858 scheduler.py:527 - Job "03028a48-e699-43f5-b6c4-ff2ea805da03 (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.815123)" executed successfully
INFO 2015-04-23 12:19:52,861 scheduler.py:527 - Job "56ac6ab3-d457-49fa-b3b8-717a7a052458 (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.817534)" executed successfully
INFO 2015-04-23 12:19:52,863 scheduler.py:527 - Job "59196fab-5960-4573-b51d-3553a0cf04ef (trigger: interval[0:01:00], next run at: 2015-04-23 12:20:52.818116)" executed successfully
INFO 2015-04-23 12:19:56,599 Heartbeat.py:75 - Building Heartbeat: {responseId = 5, timestamp = 1429784396599, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:19:56,617 Controller.py:239 - Heartbeat response received (id = 6)
INFO 2015-04-23 12:19:56,617 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:20:06,618 Heartbeat.py:75 - Building Heartbeat: {responseId = 6, timestamp = 1429784406618, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:20:06,771 HostCheckReportFileHandler.py:91 - Host check report at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:20:06,772 HostCheckReportFileHandler.py:141 - Removing old host check file at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:20:06,773 HostCheckReportFileHandler.py:146 - Creating host check file at /var/lib/ambari-agent/data/hostcheck.result
INFO 2015-04-23 12:20:07,149 Controller.py:239 - Heartbeat response received (id = 7)
INFO 2015-04-23 12:20:07,150 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:20:17,151 Heartbeat.py:75 - Building Heartbeat: {responseId = 7, timestamp = 1429784417150, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:20:17,198 Controller.py:239 - Heartbeat response received (id = 8)
INFO 2015-04-23 12:20:17,198 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:20:27,199 Heartbeat.py:75 - Building Heartbeat: {responseId = 8, timestamp = 1429784427199, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:20:27,246 Controller.py:239 - Heartbeat response received (id = 9)
INFO 2015-04-23 12:20:27,246 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:20:37,247 Heartbeat.py:75 - Building Heartbeat: {responseId = 9, timestamp = 1429784437247, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:20:37,295 Controller.py:239 - Heartbeat response received (id = 10)
INFO 2015-04-23 12:20:37,296 Controller.py:283 - No commands sent from ambariserver2
INFO 2015-04-23 12:20:47,297 Heartbeat.py:75 - Building Heartbeat: {responseId = 10, timestamp = 1429784447296, commandsInProgress = False, componentsMapped = False}
INFO 2015-04-23 12:20:47,345 Controller.py:239 - Heartbeat response received (id = 11)


Viewing all articles
Browse latest Browse all 3435

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>