Our gpu server crashes from time to time while running deep learning frameworks like Caffe.
In these cases only a hard reset of the server helps
It is still kind of random and not reproducible at what point it will crash. Sometimes the whole training passes, sometimes it crashes at different iterations. Not the program is crashing but the server freezes and does not react on any input anymore also after hours.
I was logging nvidia-smi and also cpu load and memory but nothing extra ordinary happens around the time of the server freeze.
The server is a Supermicro SYS-4028GR-TR with Intel® C612 Chipset, 2x Intel Xeon E5-2640 v4, 8x16GB RAM, 4x NVIDIA GeForce GTX 1080 Ti
I have currently no idea how to find the cause of the freezes. Can you help me to find the issue? Pleas let me know in case I should provide more information.
nvidia-bug-report.log.gz (350 KB)