ASUSTek GTX 1080 TI on Ubuntu 16.04 with X.Org Server version 11.0 crashes at random times even when idle

I am experiencing random GPU crashes with an ASUSTek GTX 1080 TI on Ubuntu 16.04 with X.Org 11.0. There does not appear to be any trigger for these crashes, the computer can be sitting completely idle and crash. The error in the syslog is “NVRM: Xid (PCI:0000:81:00): 79, GPU has fallen off the bus.”. I have two crash reports from nvidia-bug-report.sh immediately after the crash. I had to ssh in from another computer to get the bug reports since the crashed computer is completely unresponsive with black screens and must be powered off to restore it. I have two of these GPU cards and have tried swapping the cards and the crash still occurs, so I suspect a driver or configuration problem or possibly a motherboard problem.

I am trying to use TensorFlow 1.6 and have followed the installation instructions here:

https://www.tensorflow.org/install/install_linux

and so am using CUDA 9.0 with cuDNN 7.0 which I obtained from:

https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=debnetwork

and

https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-linux-x64-v7

Here is some system info:

CUDA = cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
cuDNN = cudnn-9.0-linux-x64-v7.tgz

nvidia-settings info:

System Information
Operating System: Linux-x86_64
NVIDIA Driver Version: 390.30

X Server Information
Display Name: teeny:0
Server Version Number: 11.0
Server Vendor String: The X.Org Foundation
Server Vendor Version: 1.18.4 (11804000)
NV-CONTROL Version: 1.29
Screens: 1

2 Monitors:
Hewlett Packard 23" @ 1920x1080 (16:9)
Hewlett Packard 22" @ 1680x1050 (16:10)

uname -a:
Linux teeny 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I have attached 2 bug reports to this thread.

nvidia-bug-report.log.1.gz (274 KB)
nvidia-bug-report.log.2.gz (267 KB)

I think we’ve found the problem. Apparently the EVGA power supply in this machine has an “eco” mode that I was unaware of (I didn’t build this PC, it was built by our corporate IT team) which can be used for “low power” silent running. I believe this switch got inadvertently toggled to on (there is no label on this switch, it’s just an on/off switch on the back of the machine) when moving the box or whatever. The result was the random crashes described above. Turning this switch back to off restored the stability of the machine and it has not crashed for over 24 hours now with GPU stress programs, machine learning and game usage. I am tempted to add a “Random Failure Mode” label above the eco button. After learning of eco mode I now see that others have also had similar experiences with eco mode.

http://www.tomshardware.com/answers/id-3046709/power-supply-eco-mode.html

I am surprised that this resulted in the “NVRM: Xid (PCI:0000:81:00): 79, GPU has fallen off the bus.” error, I would have expected some sort of power shutdown message, but maybe the ASUS motherboard just disables a PCIe slot if it pulls more power than is available or something. Anyway, hopefully this experience and its resolution will help somebody else in the future.