Tesla P40 "GPU has fallen off the bus" running gpu-burn

  1. Device is finicky, I have been trying different driver versions and PCI config but it doesn’t stabilize.
  2. Typical symptom is device “crashes” and after that nvidia-smi hangs for a long time and report shows power consumption as “ERR!” - in this state it’s unusable until reboot.
  3. Machine is headless, no X11, nouveau and nvidiafb are blacklisted.
  4. There is an old GTX 460 also plugged in just for a video output for installing the OS on the machine.
  5. gpu-burn worked for a while and then showed an error and nvidia-smi began showing symptoms (2)
  6. Stopping gpu-burn worked, but re-launching it failed with “no CUDA-capable device is detected” and at this point dmesg showed “GPU has fallen off the bus”.

I kind of suspect you’re going to tell me my card is junk, but I was hoping there would be some way to stabilize it - e.g. to identify the faulty core and disable it. Thanks.

nvidia-bug-report.log.gz (183.8 KB)

No bus errors, temperatures are fine, I’d say it’s broken.

