Hello,
I am using a GTX1080 Ti, on 16.04.3.
The GPU driver/sub-system crashes on its own sometime (i.e. I leave the machine running), and sometime when I am running tensorflow code.
nvidia-smi tells me something like this, when this happens. Also, all the GPU card fans start running at high speed.
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU
...
[ 211.957262] NVRM: GPU at PCI:0000:01:00: GPU-f0a4ec3b-aa15-2398-6fe2-ea529751b19d
[ 211.957272] NVRM: GPU Board Serial Number:
[ 211.957278] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 211.957282] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 211.957285] NVRM: GPU is on Board .
[ 211.957298] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
System: Ubuntu 16.04.3 LTS, x86_64 (4.4.0-87-generic)
Hardware: GTX 1080 Ti
Driver: 384.98
Please see attached, nvidia-bug-report.log.gz.
Thank you in advance for your help.
nvidia-bug-report.log.gz (275 KB)