I have a CentOS system with four RTX 2080s installed. This system is primarily used for scientific computing. One of the GPUs gets sporadically lost. In other words, I get the following error when running nvidia-smi: Unable to determine the device handle for GPU 0000:19:00.0: GPU is lost. I have tried reseating this card, moving to different PCI slots and using different power cords, but it’s always this particular card that will fail. Is there something I can do, or is this card faulty? I have run nvidia-bug-report.sh with the results attached here.
nvidia-bug-report.log.gz (1.97 MB)