AES expects monitored nodes to send a HLTH0001 message every two minutes. The HLTH0001 messages are stored as PegaAES-Data-NodeStats objects and used to update PegaAES-Data-NodeHealth object. There is an agent that in turn checks node health and other data to assess if a node is healthy. If the 'last health' was received more than five minutes ago, AES assumes that a node has failed and triggers the health change notification.
So there are three main failure scenarios to consider
a- monitored node is not sending health status messages to AES
b- monitored node sends health status messages to AES but the messages never arrive at AES server
c- monitored node sends health status messages but the AES SOAP service fails and the messages do not get persisted.
a- check AES logs for times in question. Do you see exceptions? Do you see an odd pattern of PEGA0011 (slow service) alerts?
a,b- on monitored node(s) having issues, enable debugging for classes httpclient.wire.header and httpclient.wire.message.
(1) Look for HLTH0001 messages being POSTed to AES. are health status messages being sent?
(2) Look for the HTTP response to the POST. Did AES server acknowledge messages? Is status HTTP 200 (processed) or an error status?
Which Pega version is monitored? Is your AES integration implemented through customized prlogging.xml file or via the 'dynamic appender' landing page?