One GPU disappears

Hello,
I have a quite strange behaviour:
I have a opensuse 13.2 Server with a M60 and a P100
The P100 sometimes disappears in nvidia-smi, but the card is still visible in lspci
One time, the reloading of the nvidia driver helped, the other times I had to reboot.

In /var/log/messages I see no clue why this happened.

Anyone had this before and can give me an idea how to troubleshoot this?

maybe it is overheating

Is the P100 installed in a server that was qualified by the OEM for P100 use? Did it come from the OEM with the P100 installed?

nvidia-smi can report temperature, and it can be run in a loop. You can run it in a loop and monitor the temperature until the point of failure.

Servers often have a BIOS with an event log, that will log issues on the PCI express bus as well as ECC memory warnings/failures and such. Does this log indicate any trouble?

Is switching to a different PCIe slot an option?

Depending on your application’s throughput requirements, you could attempt to reduce PCIe bus speed from Gen3 to Gen2 or even Gen1 to see if that makes the problem go away.

Hello, the server is a DELL M620 and running P100 is supported
But sure, temperature might be an issue

And I also switched the both GPUs now to see if the PCI port is the issue

Thanks a lot for the hints

Since this is quite unregularly, I now have to wait again

Dell M620 is a blade server that Dell doesn’t offer for sale anymore. P100 was never a supported option in that server.

Oh sorry
I was quite confused there (probably because of working on M620 a lot this day)
It is a Poweredge R730