Hi,
We have a ubuntu machine with 2 2080 GPU and for some reason, the system does not detect one of them.
root@SimulatorProc:~# nvidia-smi
Unable to determine the device handle for GPU 0000:17:00.0: GPU is lost. Reboot the system to recover this GPU
After the reboot it seems to detect both but some time after it desappears…
Is it a HW related issue? How can we make sure of it?
nvidia-bug-report.log.gz (1.41 MB)
nvidia-bug-report.log (3.17 MB)
Today, after 2 days doing nothing with the computer, nvidia smi does not say anything but one of the 2 GPUs is missing.
Thu Dec 19 11:58:37 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:65:00.0 On | N/A |
| 28% 28C P8 21W / 250W | 175MiB / 10986MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1303 G /usr/lib/xorg/Xorg 74MiB |
| 0 1493 G /usr/bin/gnome-shell 99MiB |
±----------------------------------------------------------------------------+
The second one is missing
generix
December 19, 2019, 7:29pm
3
[ 1117.291270] NVRM: Xid (PCI:0000:17:00): 79, GPU has fallen off the bus.
[ 1117.291275] NVRM: GPU at 00000000:17:00.0 has fallen off the bus.
Most often caused by insufficient/broken power supply or overheating.
I have an Quadro RTX 8000 that does this. When pushed (training DL models) it suddenly “get lost” and I need to reboot to recover it.
Reading this, and having the same XID 79, I tried in another computer (with a powerful PS) and I am getting the same error.
I also tried in eGPU case (650W) also same issue.
The fan is working, the temps are ok, but as soon I try to access more memory (more than 8GB) it crashed. May be the memory getting faulty?
I made a twitter thread about this with more info: https://twitter.com/thecapeador/status/1443599230321532931