Hello,
I have a quite strange behaviour:
I have a opensuse 13.2 Server with a M60 and a P100
The P100 sometimes disappears in nvidia-smi, but the card is still visible in lspci
One time, the reloading of the nvidia driver helped, the other times I had to reboot.
In /var/log/messages I see no clue why this happened.
Anyone had this before and can give me an idea how to troubleshoot this?
Servers often have a BIOS with an event log, that will log issues on the PCI express bus as well as ECC memory warnings/failures and such. Does this log indicate any trouble?
Is switching to a different PCIe slot an option?
Depending on your application’s throughput requirements, you could attempt to reduce PCIe bus speed from Gen3 to Gen2 or even Gen1 to see if that makes the problem go away.