GeForce GTX 1060 reliably falls of the bus

Hi folks,

I have Ubuntu 18.04 with GeForce GTX 1060 6GB. I am running tensorflow-gpu 2.1 with CUDA version 10.1 and nvidia driver 440.82. Every time I start training, I get the following log message:

May 18 12:11:10 gelato kernel: [  287.660371] NVRM: GPU at PCI:0000:01:00: GPU-12298a78-5be3-f166-bdf4-1b9571596c31
May 18 12:11:10 gelato kernel: [  287.660373] NVRM: GPU Board Serial Number:
May 18 12:11:10 gelato kernel: [  287.660376] NVRM: Xid (PCI:0000:01:00): 79, pid=1831, GPU has fallen off the bus.
May 18 12:11:10 gelato kernel: [  287.660377] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
May 18 12:11:10 gelato kernel: [  287.660378] NVRM: GPU 0000:01:00.0: GPU is on Board .

This happens reliably everytime within a few minutes of running. The temperature does not get higher than 55C and power more than 84/120 W.

I have also attached the log file. nvidia-bug-report.log (2.6 MB)

Thanks!!

Please check/reseat power connectors, check/replace PSU. Looking for a bios update might help as well.