RTX 8000 - Instability

Good afternoon,

We are having some issues with our RTX 8000.

Setup:
Linux Host - KVM-based Guest running Ubuntu Linux with 3 RTX 8000s w/PCI Passthrough. The cards are primarily used for compute.

Problem:
Occasionally during a compute run, the RTX 8000 will stop working. We can reboot the Linux Host to get the GPU to work again.

Symptoms:
When the GPU stops working, the temperature will stop being reported to the system health monitor. The GPU no longer appears in nvidia-smi, and we get the following error: “Failed to initialize NVML: Driver/library version mismatch”.

Removing that specific RTX 8000 GPU from the system resolves the issue and we have no further issues with the other RTX 8000s.

Can you help us troubleshoot? Do you have tools we can run to check the stability of the card? We can place it within a linux or windows-based machine.

Thanks!

Failed to initialize NVML: Driver/library version mismatch
hints towards a broken driver install.
Using a linux system , you can check dmesg for NVRM error messages about the specific gpu.

I agree with you, but we removed that RTX 8000 in April, and the other two have been stable since then. We haven’t made any other changes.

Yes, of course it is likely broken, check nvrm messages when in another system . Just the message has nothing to do with it.