Following a discussion in the LabTechGeek channel, I was surprised to find that a number of people had systems that were not checking in to the Automate server properly. Some of you may know, but there are two types of check-in done by your remote agents:
- A normal check-in, done over Port 443, numerous things are sent during this check-in. It occurs every 30 seconds for Master Agents and every 5 minutes for normal agents. The time until the next full check-in is seen in the bottom right hand corner of the agent window:
This check-in is initiated by the normal Automate Service
- An “I’m still here” check-in that is called Heartbeat. This is a “pulse” the agent sends in every five seconds, over UDP Port 75, that is incredibly small with almost no data in it, that is essentially an “I am still here” check-in. It is indicated by the value here showing an update every five seconds:
If your Heartbeat is working properly, this should update every five seconds. This check-in is actually sent from the LabTech Monitoring Watchdog Service (the second Automate service). If you find only one or two machines are having an issue, look at this service on the affected agents.
You can get away with the Heartbeat check-in not working at all. Automate will still function, to a degree, when the second check-in is broken. Why is it so important? When Heartbeat is broken –
- Your offline server monitors are potentially delayed, not as accurate and can trigger erroneously. By default the offline server monitor takes into account both the last full check-in and the last heartbeat sent in.
- When you open an agent window, a special process occurs within Automate that sets a flag so that in the next Heartbeat check-in, Automate gives a response to the remote agent that says “I want you to properly check in, we have stuff for you to do!” The agent then immediately does a full-check in, and the response from that tells your remote agents to enter a mode called FasTalk. Once this is on, the remote agent does a full check-in every five seconds.
- When point 2 is not working properly, your next check-in time will not go down even when you open an agent. This can delay for up to five minutes:
- Commands/Scripts sent to the agent
- Interactive prompts like the command prompt
- Service restarts, process ending
- File explorer, registry explorer and more
- Basically anything that involves asking the remote agent to do anything
- Certain remediation scripts won’t work that check first to see if the machine is online (IE the Autofix to restart a service that is stopped will never trigger (thanks for pointing this out @Klaymore!))
- It also means that the performance data in the bottom right of the agent is not up to date (these are pretty much the only values sent in during the 5 second Heartbeat check-in)
When this is not working it is a significant impact on your engineers who are working. No-one wants to interact with a tool that takes five minutes to respond to whatever you tell it to do. Fortunately there are some simple steps to identify whether this is working or not, and how to fix it if it is not.
How can I tell if this is broken or not enabled?
When you open an agent in Automate, you should see this appear in the top right:
Followed 10-20 seconds later in the top right by this:
You should then notice that in the bottom right hand corner, the countdown (which was likely anywhere from 0 to 5 minutes) jump down to a repeating value of 5 seconds (this means your machine is checking in every 5 seconds and ready for your commands and will process them in a maximum time of 5 seconds). If your system is already doing that when you open agents, the rest of this does not apply to you – your heartbeat is enabled, and working properly.
Mine does not do that! What is this wizardry?
Now we need to do some troubleshooting! This is either going to be one of three things, either Heartbeat is turned off in the configuration, there is a firewall rule wrong somewhere and the port is not open or the URL for your redirector config is wrong or potentially all three of them.
Using SQL to find out if Heartbeat is working.
SELECT COUNT(*) AS `NumberOfRecentHeartbeats`, (SELECT COUNT(*) FROM v_xr_computers WHERE ComputerDateLastContact > DATE_SUB(CURRENT_DATE(), INTERVAL 24 HOUR)) AS NumberOfComputerCheckins FROM HeartBeatComputers WHERE LastHeartbeatTime > DATE_SUB(CURRENT_DATE(), INTERVAL 24 HOUR)
Run the above on your Automate server. This pulls results from the Heartbeat table (where last heartbeat sent for each computer is stored) and the computers table. The results you get back count the number of recent heartbeats and the number of full computer check-ins. If you have 1500 agents that are online, I would expect both numbers to be relatively similar around 1500 (small variation is not an issue). If your Heartbeat count is 0 or significantly lower than your number of computer check-ins then you have a Heartbeat problem that is likely network related.
Using Logs to find out if Heartbeat is working
On a Remote agent (nothing on the same physical network your Automate server is on), browse to C:\Windows\LTSVC\LTSVCMon.txt – Heartbeat errors are logged in there.
Looks like my Heartbeat is broken
This is almost always one of three things:
- Windows Firewall on your Automate Server does not have UDP port 75 Unblocked
- Your Router/Firewall is not port forwarding UDP port 75
- Something is blocking the traffic on UDP port 75 (AV/Security Appliance etc)
Heartbeat has to be turned on for this to work too!
Go to Automation > Templates > Agent Templates and open the Default Template and go to the Agent Settings tab – ensure Heartbeat is turned on:
Go to System > Configuration > Dashboard, and open the Config then System Tab and make sure the following is ticked
Go to System > Configuration > Dashboard, and open the Config then Control Center Tab and make sure the following is ticked
Go to System > Configuration > Dashboard, and open the Config then System Tab. In the “Redirector Config” section ensure your Hostname is not “Localhost”, it should be the URL of your automate server, IE: automate.joebloggs.com. Leave Port and Passcode. The port should be set to 70!!
That’s it – if you’ve done all the above and it’s still not working come speak to us in the LabTechGeek Slack, and ping me (@gavsto). Hope this has been helpful!
It’s still not working!
Last step, the service that actually receives the Heartbeat packets is the LTRedirector service on your automate server – make sure it is started!