Failures on Ubuntu running with Nvidia 1070Ti

graham.gobieski · February 21, 2018, 3:45am

Hi all,

I recently got a new rig for deep learning, which has the following specs:
2 1070Ti
7800X Intel CPU
MSI X299 Motherboard
16gb of RAM
Corsair Bronze 750W Power supply

I installed Ubuntu on the rig as well as the CUDA Toolkit 9.0 and the latest Nvidia driver 390.25. During stressing/benchmarking the rig, I encountered problems. Midway through a stress test where I am maxing out GPU and CPU load, the system crashes. This crash can either be a hard crash or a soft crash. There doesn’t seem to be anything in dmesg, kern.log, or syslog. Here are the ways I tried to fix it.

Maybe it was a power issue? Well, I know the rig has been tested using windows and the same setup without fault. Second, I can recreate the crash when running only a single GPU, which should pull well under 750W. I know there might be spikes, but I can’t imagine 750W to be too little to support a single GPU.
I tried downgrading my drivers, but the results were the same.
I tried regular 16.04 LTS Ubuntu, 17.10 Ubuntu, and 16.04 Ubuntu with an updated kernel. None of these changed the results.
I tried to induce the crash by stressing just the GPUs and just the CPUs. In each case, I could not get the rig to crash.
I tried a different distribution of Linux, Fedora, and the rig still crashed. The driver is the same as the one I installed with Ubuntu.

Some things I think it could be: maybe the power supply, a problem with the PCI drivers, a problem with the Linux kernel (I am tried 4.09 - 4.13), or maybe a problem with the Nvidia drivers.

Any help would be greatly appreciated.

dov.grobgeld · March 12, 2018, 9:00pm

I experienced something similar with the 390.25 and the 390.42 drivers when running a neural network training python script. In the middle of the run, at a more or less the same point in the script the system does a complete reset and reboots, just as if someone had pushed the hardware reset button. :-( Just as you, I have not found anything in dmseg, kern.log, etc.

My system is an i7-6700K, 16GB Ram, and a GTX1070 graphics card, running Fedora 27:

Linux groovy 4.15.6-300.fc27.x86_64 #1 SMP Mon Feb 26 18:43:03 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Topic		Replies	Views
Hard crash using CUDA on GTX 1080 Ti on Ubuntu 16.04 CUDA Setup and Installation	8	4863	September 25, 2017
GTX1080 crash, after reboot for crashing in windows 10, must poweroff GPU - Hardware	13	2218	May 22, 2018
1080 Ti always dies shortly after strarting training, cuda 11.5, driver 495.29.05 Drivers - Linux, Windows, MacOS cuda	2	757	January 31, 2022
NVIDIA GPU driver consistent crash on CentOS 7 with RTX 3090 Drivers - Linux, Windows, MacOS nvbugs	3	1339	April 28, 2021
Recent nvidia Tesla drivers cause system crashs on POWERNVL w/ P100 GPUs Linux hw , kernel , ubuntu	1	944	July 8, 2021
Ubuntu 16.04 CUDA8 crashing graphics driver Linux	5	1615	October 14, 2021
Frequent catastrophic crashes on a multiple GPU machine CUDA Setup and Installation	8	4722	October 22, 2017
Four Titan X superclocked crashes with latest driver Linux	3	656	June 29, 2016
Ubuntu server 18.04 system crash after install Nvidia driver for Tesla M2075 Linux ubuntu	1	606	November 26, 2020
Problems with Nvidia gforce 1070 max q Linux	18	1512	December 25, 2022

Failures on Ubuntu running with Nvidia 1070Ti

Related topics