GPU is lost ramdomly and nvidia-smi asks for a reboot to recover it

Hi,

We have a ubuntu machine with 2 2080 GPU and for some reason, the system does not detect one of them.

root@SimulatorProc:~# nvidia-smi
Unable to determine the device handle for GPU 0000:17:00.0: GPU is lost. Reboot the system to recover this GPU

After the reboot it seems to detect both but some time after it desappears…

Is it a HW related issue? How can we make sure of it?
nvidia-bug-report.log.gz (1.41 MB)
nvidia-bug-report.log (3.17 MB)

Today, after 2 days doing nothing with the computer, nvidia smi does not say anything but one of the 2 GPUs is missing.

Thu Dec 19 11:58:37 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:65:00.0 On | N/A |
| 28% 28C P8 21W / 250W | 175MiB / 10986MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1303 G /usr/lib/xorg/Xorg 74MiB |
| 0 1493 G /usr/bin/gnome-shell 99MiB |
±----------------------------------------------------------------------------+

The second one is missing

[ 1117.291270] NVRM: Xid (PCI:0000:17:00): 79, GPU has fallen off the bus.
[ 1117.291275] NVRM: GPU at 00000000:17:00.0 has fallen off the bus.

Most often caused by insufficient/broken power supply or overheating.