After running CUDA-based code for training neural networks for 5-30 minutes, I get a hard crash that causes the system to reboot. I initially thought it was heat, but managed to reproduce the crash in seconds by running the nbody simulation where n=130k.
- Case/PSU/Motherboard: 875W Alienware Aurora R4 http://www.dell.com/us/dfh/p/alienware-aurora-r4/pd
- GPU: I tested both the EVGA GTX 1080 Ti and NVIDIA GTX 1080 Ti
- CPU: Intel i7-4930K @ 3.40GHz (lscpu reports it is running around 1250MHz)
- RAM: 16GB DDR3
- OS: Ubuntu 16.04
- NVIDIA Driver: 375.66
- CUDA: 8.0.61
- cuDNN: 5.1.10
During the longer tests, CPU load average varies between 0.7-0.99. GPU power usage is normal. It fluctuates between 100W-280W, mostly staying below the 250W cap. RAM usage is low, 3GB. VRAM usage varies between 4GB and 10GB depending on the code I’m running, but this doesn’t seem to affect the crash. Neither RAM nor VRAM appears to be leaking.
The machine can run indefinitely with the GPU rendering graphics to a screen, it only crashes when I run CUDA applications.
A bug report log is here:
But I don’t think it has anything relevant because it is captured after the reboot.
The next options I am considering:
- Try another GPU.
- Try a different version of Ubuntu, the NVIDIA drivers, CUDA, or cuDNN.
Any advice would be very much appreciated!