Failures on Ubuntu running with Nvidia 1070Ti

Hi all,

I recently got a new rig for deep learning, which has the following specs:
2 1070Ti
7800X Intel CPU
MSI X299 Motherboard
16gb of RAM
Corsair Bronze 750W Power supply

I installed Ubuntu on the rig as well as the CUDA Toolkit 9.0 and the latest Nvidia driver 390.25. During stressing/benchmarking the rig, I encountered problems. Midway through a stress test where I am maxing out GPU and CPU load, the system crashes. This crash can either be a hard crash or a soft crash. There doesn’t seem to be anything in dmesg, kern.log, or syslog. Here are the ways I tried to fix it.

  1. Maybe it was a power issue? Well, I know the rig has been tested using windows and the same setup without fault. Second, I can recreate the crash when running only a single GPU, which should pull well under 750W. I know there might be spikes, but I can’t imagine 750W to be too little to support a single GPU.
  2. I tried downgrading my drivers, but the results were the same.
  3. I tried regular 16.04 LTS Ubuntu, 17.10 Ubuntu, and 16.04 Ubuntu with an updated kernel. None of these changed the results.
  4. I tried to induce the crash by stressing just the GPUs and just the CPUs. In each case, I could not get the rig to crash.
  5. I tried a different distribution of Linux, Fedora, and the rig still crashed. The driver is the same as the one I installed with Ubuntu.

Some things I think it could be: maybe the power supply, a problem with the PCI drivers, a problem with the Linux kernel (I am tried 4.09 - 4.13), or maybe a problem with the Nvidia drivers.

Any help would be greatly appreciated.

I experienced something similar with the 390.25 and the 390.42 drivers when running a neural network training python script. In the middle of the run, at a more or less the same point in the script the system does a complete reset and reboots, just as if someone had pushed the hardware reset button. :-( Just as you, I have not found anything in dmseg, kern.log, etc.

My system is an i7-6700K, 16GB Ram, and a GTX1070 graphics card, running Fedora 27:

Linux groovy 4.15.6-300.fc27.x86_64 #1 SMP Mon Feb 26 18:43:03 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux