I have been using a Titan Xp for about a year, as both the primary graphics card and as a dual with an AMD APU. I use the nvidia card for machine learning (pytorch).
Yesterday, during high intensity workloads (>100W), the card starts falling off the bus, pytorch hangs, but the display still works. Note in this case the monitor was plugged into the Titan, not the motherboard HDMI.
Restarting puts the card in a good state, but several seconds into running a neural net, pytorch hangs again.
The card is able to do the glmark2 test indefinitely without hanging.
I am attaching the contents of nvidia-bug-report.sh.
Edit: I forgot to mention that after the card falls of the bus, final shutdown hangs at “The system is going down for shutdown now”. I assume this is close to the last step in the shutdown sequence.
Thank you, I tried the high stress CUDA workload again and it failed almost immediately. I cannot get nvidia-smi to show anything after the card disconnects.
Although, the graphics card fan went to max speed and stayed there until I rebooted the system.