I have been running some computation jobs on our Fedora server installed with 4 K40c gpus, the job is doing iterations. It runs no problem before, but today it causes the 4 GPUs to completely freeze.
nvidia-smi won’t work (freezes) either, and
top does not show any running GPU jobs. I encountered this problem just once before and now exactly the same thing is happening again. The job I run last time is similar but not exactly the same as this one. By rebooting the server I was able to get everything to work normally, but this problem is like a hidden detonator. Can anyone give some helpful comments here? I tried to attach the nvidia-bug-report-log file but it’s too large…
nvidia-bug-report.log (1.25 MB)