A physical host we are managing using Cloudmin Pro is repeatedly reporting the status as "Webmin down", and we can't figure out why.
Last status Webmin down (Last changed at 14/Aug/2020 10:45) Detailed status error Error reading response length from fastrpc.cgi : Connection reset by peer
Change time Old status New status Changed by
14/Aug/2020 10:45 Webmin Webmin down Monitoring
14/Aug/2020 10:40 Webmin down Webmin Monitoring
14/Aug/2020 10:35 Webmin Webmin down Monitoring
14/Aug/2020 10:30 Webmin down Webmin Monitoring
14/Aug/2020 10:25 Webmin Webmin down Monitoring
14/Aug/2020 10:20 Webmin down Webmin Monitoring
...
We can't find any errors logged on the physical machine and Webmin loads up fine when we access it in a browser. CPU load is low, there is free RAM, and no other hosts on the same network are exhibiting the same problem.
Can you please advise how we might debug what's going on here?
Thanks
Chris
Comments
Submitted by JamieCameron on Fri, 08/14/2020 - 21:22 Comment #1
Is there any firewall that could be blocking ports 10000 - 10100 between the Cloudmin master and the host system, or the master and a VM?
Submitted by chriswik on Mon, 08/17/2020 - 06:37 Pro Licensee Comment #2
No firewall rules that could be causing this from what I can see
This is an intermittent problem - if Webmin is restarted, Cloudmin then reports a 'Webmin' status for a period of time before falling back in to the same intermittent pattern.
Our Nagios monitoring system monitors the status of Cloudmin managed servers/VMs, and reports if any are not in Webmin, SSH or Alive status, and this is what our Nagios log looks like for the past few hours:
August 17, 2020 11:00
Service Warning[2020-08-17 11:35:18] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;HARD;2;WARNING - redacted-hostname.localdomain (Webmin Down)
Service Warning[2020-08-17 11:25:12] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;SOFT;1;WARNING - redacted-hostname.localdomain (Webmin Down)
Service Ok[2020-08-17 11:05:06] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;OK;SOFT;2;OK - No systems down
August 17, 2020 10:00
Service Warning[2020-08-17 10:54:59] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;SOFT;1;WARNING - redacted-hostname.localdomain (Webmin Down)
Service Ok[2020-08-17 10:44:53] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;OK;HARD;2;OK - No systems down
Service Warning[2020-08-17 10:34:47] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;HARD;2;WARNING - redacted-hostname.localdomain (Webmin Down)
Service Warning[2020-08-17 10:24:41] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;SOFT;1;WARNING - redacted-hostname.localdomain (Webmin Down)
Service Ok[2020-08-17 10:14:35] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;OK;HARD;2;OK - No systems down
Service Warning[2020-08-17 10:04:29] SERVICE ALERT: cloudmin.anu.net;CLOUDMIN-DOWN;WARNING;HARD;2;WARNING - redacted-hostname.localdomain (Webmin Down)
And this is from Cloudmin's Status change history:
17/Aug/2020 13:15 Webmin Webmin down Monitoring
17/Aug/2020 13:00 Webmin down Webmin Monitoring
17/Aug/2020 12:25 Webmin Webmin down Monitoring
17/Aug/2020 12:20 Webmin down Webmin Monitoring
17/Aug/2020 12:15 Webmin Webmin down Monitoring
17/Aug/2020 12:10 Webmin down Webmin Monitoring
17/Aug/2020 12:05 Webmin Webmin down Monitoring
17/Aug/2020 12:00 Webmin down Webmin Monitoring
17/Aug/2020 11:45 Webmin Webmin down Monitoring
17/Aug/2020 11:40 Webmin down Webmin Monitoring
This is happening almost 24-7.
Submitted by JamieCameron on Sat, 08/22/2020 - 14:29 Comment #3
How loaded is the remote Webmin system when this happens? We got another report recently of a user seeing very high system load when Cloudmin was doing a status check.
Submitted by chriswik on Mon, 08/24/2020 - 06:32 Pro Licensee Comment #4
Load avg is consistently < 1
We also bumped RAM on Xen dom0 from 1GB to 2GB recently to see if that made any difference, even though we weren't seeing any out of memory errors logged. It hasn't helped.
It's a very new server with NVMe SSD, fast CPUs and light load. We can safely rule out network issues as lots of other servers on the same network are working just fine. We're also not experiencing the same issue with Webmin running on Xen VMs on this server, so that rules out an issue with the NICs on the server.
Submitted by JamieCameron on Sat, 08/29/2020 - 19:06 Comment #5
Would it be possible to do a packet capture (with tcpdump) when this happens? I'd be interested to see what connections were being made, or at least attempted.