I have been redirected here by NVIDIA Customer Care (incident 190227-000272).
We have a GEFORCE RTX 2080 Ti (s/n 0324118055234) bought from NVIDIA on a machine with Ubuntu 18.04.
We intend to use the board for Deep Learning computing (nvidia driver 415, CUDA10).
At times the board is properly reported by the nvidia-smi command. We launch our computations,
the board starts working but after a while (say 30 min) the card stops (crashes ?) and is not detected anymore by nvidia-smi command.
In an attempt to troubleshoot the problem we installed the board on another Linux machine (Ubuntu 18.04 also) but the behavior was exactly the same (available at first, failed after a while).
This may occur after a reboot of the machine too (but I cannot tell this for sure).
Power supplies of both machines were (800W and 1200W). No other boards connected at the same time.
We wonder whether the card is defective.
Please find below the output of some commands:
$ lspci | grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation GV102 (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
02:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
02:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
No devices were found # After the board has stopped working - otherwise detected OK
I also tried to enclose a debug report: nvidia-bug-report.log.gz but I am not sure if I made it.
nvidia-bug-report.log.gz (569 KB)