Keep losing RTX 2080 GPU.

I have a CentOS system with four RTX 2080s installed. This system is primarily used for scientific computing. One of the GPUs gets sporadically lost. In other words, I get the following error when running nvidia-smi: Unable to determine the device handle for GPU 0000:19:00.0: GPU is lost. I have tried reseating this card, moving to different PCI slots and using different power cords, but it’s always this particular card that will fail. Is there something I can do, or is this card faulty? I have run nvidia-bug-report.sh with the results attached here.

Thanks!

nvidia-bug-report.log.gz (1.97 MB)

First of all, you’ll have to set up and start the nvidia-persistenced on boot and have it continuously running, otherwise the gpus will constantly be initialized/deinitialized which leads to all kinds of faulty behaviour.
This might already be the reason for the gpu falling off the bus, it being a touchy model. Another reason can be overheating, you should monitor temperatures, maybe that specific card has fan issues. If nothing comes out of that, you’ll be left to test it in another system, using gpu-burn to check it.

Thank you for these suggestions. I have enabled nvidia-persistenced, but that didn’t seem to be the trick as the card still falls off the bus. I have attached another bug report file, if that would help.

I have also monitored the temperature, and the crashing behavior does not seem to be related to card temperature. For instance, I ran a 1h30min job yesterday where the card temp peaked at 83C, but it didn’t fall off the bus until about 5min after the job was done and the card had cooled to 34C. This was when I generated the attached bug report.

A similar thing occurred when I ran a 2min gpu-burn job. The card was fine during the job and the temp peaked at 86C. But then the card fell off the bus about 1min after the gpu-burn job was done and the temp had fallen to 54C.

Is there anything else I can do? At this point, this card is basically unusable.

Thanks!
nvidia-bug-report.log.gz (2.05 MB)

Sounds like the gpu is always crashing when it throttles back to idle clocks and reducing pcie speed to minimum. Since your bios is very old, still on the initial release, you should upgrade it and then re-test. If nothing comes out of that, you can only check it in another system and RMA.