I recently got a new rig for deep learning, which has the following specs:
7800X Intel CPU
MSI X299 Motherboard
16gb of RAM
Corsair Bronze 750W Power supply
I installed Ubuntu on the rig as well as the CUDA Toolkit 9.0 and the latest Nvidia driver 390.25. During stressing/benchmarking the rig, I encountered problems. Midway through a stress test where I am maxing out GPU and CPU load, the system crashes. This crash can either be a hard crash or a soft crash. There doesn’t seem to be anything in dmesg, kern.log, or syslog. Here are the ways I tried to fix it.
- Maybe it was a power issue? Well, I know the rig has been tested using windows and the same setup without fault. Second, I can recreate the crash when running only a single GPU, which should pull well under 750W. I know there might be spikes, but I can’t imagine 750W to be too little to support a single GPU.
- I tried downgrading my drivers, but the results were the same.
- I tried regular 16.04 LTS Ubuntu, 17.10 Ubuntu, and 16.04 Ubuntu with an updated kernel. None of these changed the results.
- I tried to induce the crash by stressing just the GPUs and just the CPUs. In each case, I could not get the rig to crash.
- I tried a different distribution of Linux, Fedora, and the rig still crashed. The driver is the same as the one I installed with Ubuntu.
Some things I think it could be: maybe the power supply, a problem with the PCI drivers, a problem with the Linux kernel (I am tried 4.09 - 4.13), or maybe a problem with the Nvidia drivers.
Any help would be greatly appreciated.