I have a few servers running Ubuntu 22.04 for some computational tasks. Some of the machines silently freeze at random points, perhaps once per week or so. In this case, it looks like the machine is still running from the outside, but ssh <machine> is not possible (the connection is refused).
Even after rebooting, there is no trace of what could have happened:
journalctl -b -1does not show anything. The log just stops at the time of the failure. Cronjobs are not run anymore.- I have a cronjob saving various stats like CPU / GPU / memory utilization, temperatures, etc. Sometimes freezing occurs when the machines are completely idle. If a computational task is running during the crash, there are no suspicious stats (e.g., there is enough memory left, temperatures are fine, etc.).
/var/log/kern.logshows nothing at the time of the failure.
The only potential lead I have is that sometimes after rebooting the GPU is not detected anymore and has to be restarted (I use the drivers recommended by ubuntu-drivers which work fine).
How can I figure out what happens to the machine in this case, or, at least, what I can try to mitigate the issue?