Hard crash using CUDA on GTX 1080 Ti on Ubuntu 16.04

After running CUDA-based code for training neural networks for 5-30 minutes, I get a hard crash that causes the system to reboot. I initially thought it was heat, but managed to reproduce the crash in seconds by running the nbody simulation where n=130k.

Setup:

  • Case/PSU/Motherboard: 875W Alienware Aurora R4 http://www.dell.com/us/dfh/p/alienware-aurora-r4/pd
  • GPU: I tested both the EVGA GTX 1080 Ti and NVIDIA GTX 1080 Ti
  • CPU: Intel i7-4930K @ 3.40GHz (lscpu reports it is running around 1250MHz)
  • RAM: 16GB DDR3
  • OS: Ubuntu 16.04
  • NVIDIA Driver: 375.66
  • CUDA: 8.0.61
  • cuDNN: 5.1.10

During the longer tests, CPU load average varies between 0.7-0.99. GPU power usage is normal. It fluctuates between 100W-280W, mostly staying below the 250W cap. RAM usage is low, 3GB. VRAM usage varies between 4GB and 10GB depending on the code I’m running, but this doesn’t seem to affect the crash. Neither RAM nor VRAM appears to be leaking.

The machine can run indefinitely with the GPU rendering graphics to a screen, it only crashes when I run CUDA applications.

A bug report log is here:

https://www.dropbox.com/s/3hoy1u2xvgxz62g/nvidia-bug-report.log.gz?dl=0

But I don’t think it has anything relevant because it is captured after the reboot.

The next options I am considering:

  • Try another GPU.
  • Try a different version of Ubuntu, the NVIDIA drivers, CUDA, or cuDNN.

Any advice would be very much appreciated!

Maybe you could try to upgrade the Driver to version 381.22.

Thanks williamzhong. I updated to 381.22:

sudo service lightdm stop
sudo apt purge nvidia-*
sudo reboot now
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/381.22/NVIDIA-Linux-x86_64-381.22.run
sudo chmod +x NVIDIA-Linux-x86_64-381.22.run
./NVIDIA-Linux-x86_64-381.22.run
sudo reboot now

Same problem. Runs for <5 seconds then reboots the machine.

I just tested with a NVIDIA GTX 980 using the 375.66 drivers and it works correctly (there is no crash). This means it is something specific to the GTX 1080 Ti.

So I installed Ubuntu 14.04 hoping that this bug was specific to the new OS, but it’s not. Here’s my install process from a clean Ubuntu 14.04:

# from tty
sudo apt update
sudo apt upgrade -y 
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/381.22/NVIDIA-Linux-x86_64-381.22.run
chmod +x NVIDIA-Linux-x86_64-381.22.run
sudo service lightdm stop
sudo ./NVIDIA-Linux-x86_64-381.22.run
sudo reboot now
 
# in Terminal app
wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb -O "cuda.deb"
sudo dpkg -i cuda.deb
sudo apt update
sudo apt install -y cuda
sudo modprobe nvidia
sudo reboot now

# in Terminal app
cd ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody
make
cd ../../bin/x86_64/linux/release
./nbody -numbodies=131072

Still the same crash.

I noticed that installing CUDA downgraded the NVIDIA drivers. So I reinstalled the latest NVIDIA drivers:

# in tty
sudo apt purge nvidia-*
sudo ./NVIDIA-Linux-x86_64-381.22.run
sudo reboot now

# in Terminal app
cd ~/NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody
make clean
make
cd ../../bin/x86_64/linux/release
./nbody -numbodies=131072

Same crash.

So I’ve now tested every configuration I have available: 375.66 or 381.22, Ubuntu 16.04 or 14.04, EVGA or NVIDIA GTX 1080 Ti. I don’t really have any other option except to use a different GPU.

For what it’s worth, I’ve tested on a 1070 and it works fine.

I experience similar issues whenever running a CUDA intensive model for more than 5 minutes on a similar build. Any suggestions on how best to troubleshoot much appreciated.

Why is NVIDIA NOT fixing these issues with 1080Ti?

For systems rebooting under heavy GPU load the first thing I would check is whether there is sufficient power supply.

(1) Total sum of nominal power ratings of all system components should be <= 60% of PSU’s rated power
(2) Do not use Y-splitters or 6-pin to 8-pin converters in the PCIe power cables
(3) Make sure all power cables are plugged in properly (tab on connector should engage)
(4) Make sure GPU firmly seated in PCIe socket and secured at bracket (screw, latch, etc).

I would recommend use of 80 PLUS Platinum compliant PSUs, or at least 80 PLUS Gold if that’s not feasible.

I just wanted to reply to njuffa briefly and say that: whatever the problem was, it wasn’t related to PSU efficiency, wattage, GPU seating, cable seating, or pin converters. Never found the exact problem, and had to instead use the GPU with another machine.

Too bad that you weren’t able to pinpoint the issue exactly. Your description seemed most consistent with a insufficient power supply scenario. It did not seem consistent at all with a software issue.

Hello Guys,
I seek your help. If may I ask, I have a question.
How the hell did you get to run these beast under your control? I tried to tame the with driver, but I cant run them, only on X.Org X.

Would you help me to run my GTX1070 Windforce on any kind of linux?

thank you for all suggestions