GPU is lost ramdomly and nvidia-smi asks for a reboot to recover it


We have a ubuntu machine with 2 2080 GPU and for some reason, the system does not detect one of them.

root@SimulatorProc:~# nvidia-smi
Unable to determine the device handle for GPU 0000:17:00.0: GPU is lost. Reboot the system to recover this GPU

After the reboot it seems to detect both but some time after it desappears…

Is it a HW related issue? How can we make sure of it?
nvidia-bug-report.log.gz (1.41 MB)
nvidia-bug-report.log (3.17 MB)

Today, after 2 days doing nothing with the computer, nvidia smi does not say anything but one of the 2 GPUs is missing.

Thu Dec 19 11:58:37 2019
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 GeForce RTX 208… On | 00000000:65:00.0 On | N/A |
| 28% 28C P8 21W / 250W | 175MiB / 10986MiB | 0% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| 0 1303 G /usr/lib/xorg/Xorg 74MiB |
| 0 1493 G /usr/bin/gnome-shell 99MiB |

The second one is missing

[ 1117.291270] NVRM: Xid (PCI:0000:17:00): 79, GPU has fallen off the bus.
[ 1117.291275] NVRM: GPU at 00000000:17:00.0 has fallen off the bus.

Most often caused by insufficient/broken power supply or overheating.

I have an Quadro RTX 8000 that does this. When pushed (training DL models) it suddenly “get lost” and I need to reboot to recover it.
Reading this, and having the same XID 79, I tried in another computer (with a powerful PS) and I am getting the same error.
I also tried in eGPU case (650W) also same issue.
The fan is working, the temps are ok, but as soon I try to access more memory (more than 8GB) it crashed. May be the memory getting faulty?
I made a twitter thread about this with more info: