6

I have a few servers running Ubuntu 22.04 for some computational tasks. Some of the machines silently freeze at random points, perhaps once per week or so. In this case, it looks like the machine is still running from the outside, but ssh <machine> is not possible (the connection is refused).

Even after rebooting, there is no trace of what could have happened:

  • journalctl -b -1 does not show anything. The log just stops at the time of the failure. Cronjobs are not run anymore.
  • I have a cronjob saving various stats like CPU / GPU / memory utilization, temperatures, etc. Sometimes freezing occurs when the machines are completely idle. If a computational task is running during the crash, there are no suspicious stats (e.g., there is enough memory left, temperatures are fine, etc.).
  • /var/log/kern.log shows nothing at the time of the failure.

The only potential lead I have is that sometimes after rebooting the GPU is not detected anymore and has to be restarted (I use the drivers recommended by ubuntu-drivers which work fine).

How can I figure out what happens to the machine in this case, or, at least, what I can try to mitigate the issue?

6
  • Is it possible a power interruption (or failure) has occurred? Commented 2 days ago
  • @PeterBill Only one of multiple machines freezes at a time, so I think not. Commented 2 days ago
  • 1
    Do you have a physical console? What is the behavior there? Commented yesterday
  • I think you have low chances of mitigating the issue if you don't know what's causing it, but this might be getting into OSdev territory. If it turns out that it's just the SSH daemon deadlocking or something, then getting a shell might be enough to debug it, and either a physical shell or a hypervisor layer might work there. Otherwise, you might have to take a memory snapshot and analyse that manually, trying to track down a bug in the OS somewhere. JTAG would allow that, or again a hypervisor layer... Commented yesterday
  • 2
    "the machine is still running from the outside, but ssh <machine> is not possible (the connection is refused)." - How is the connection refused exactly? Do you actually get "connection refused", or do you get something else? Also, how have you determined that the machines appears alive from the outside? Please update your question with the exact diagnostics you've performed. Details can matter a great deal. Commented yesterday

2 Answers 2

6

Sometimes logging to another machine over the network works when logging to hard disk doesn't (because udp is fast, and the network stack sometimes dies last or at least after the kernel's various filesystem and disk IO layers).

So if you have one machine which either doesn't crash or crashes less frequently than others (if you've set up a cluster with slurm or similar, this would probably be the main cluster management node rather than the compute nodes), try setting it up to receive logs over the LAN from the other servers, and configure at least some of the other servers (the ones that crash most often, if there's any difference between them) to send a copy of kernel log entries to the logging server.

How to do this depends on whether you're using rsyslogd or systemd journal. See either man rsyslogd or man systemd-journal-remote.service.

For example, with rsyslogd:

My systems use journald but are also configured to log everything with rsyslogd.

I have the following in my logging server's rsyslog config:

$ModLoad imudp              # provides UDP syslog reception 
$UDPServerAddress x.x.x.x   # my logging server's IP address
$UDPServerRun 514

and in my other machines:

if $fromhost-ip == '127.0.0.1' and $syslogfacility-text == 'kern' then @logserver

(where "logserver" is the hostname of my log server)

Actually, I have it set up so that each machine sends kernel logs to at least one other machine on my LAN, and receives log messages from at least one machine. i.e. they all log to each other, which is why the if $fromhost-ip == '127.0.0.1' test is required, so they don't forward received messages in a never-ending loop.

1
  • 1
    Thanks for the hint. I just set up systemd-journal-remote and will try if it captures something useful. Commented 2 days ago
3

SSH is way too high-level to help with any kernel-level issues, the ability to log in via SSH is typically the first thing that fails on a machine that is otherwise still very much alive. Similar for logging to disk; that also involves way too many moving parts to be reliable if things start acting up.

However, if you're really getting "Connection refused" on SSH attempts instead of timeouts, that's a valuable data point because it means the network stack is entirely healthy. Connection refused is a response (TCP RST) actively generated by the kernel, confirming that it isn't dead.

Step 1 is always to check the console. Given this is a server, there most likely is a BMC of some sort with remote console access, or there's a physical KVM with an actual video cable etc. If anything went wrong in the kernel, there will be messages on the screen.

If that doesn't yield anything, start with the usual NumLock test (hit NumLock, see if the LED responds - if it does, the kernel is OK, if not, the kernel is locked up). Then proceed to trying Magic SysRq (enable it first if not done yet).

If this is a kernel-level issue but the logs don't have enough info to debug it, go set up kdump. That can be as trivial as installing one package and rebooting (at least on RHEL and friends).

1
  • It's good to have an answer specifically covering the ssh "connection refused" angle as that stands out in the original description, +1. If you have any notions about what could commonly cause that to happen that sort of speculation might be helpful to OP and future readers - would (say) dropping to runlevel 1 result in that behaviour? Commented yesterday

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.