I have been using a Titan Xp for about a year, as both the primary graphics card and as a dual with an AMD APU. I use the nvidia card for machine learning (pytorch).
Yesterday, during high intensity workloads (>100W), the card starts falling off the bus, pytorch hangs, but the display still works. Note in this case the monitor was plugged into the Titan, not the motherboard HDMI.
Restarting puts the card in a good state, but several seconds into running a neural net, pytorch hangs again.
The card is able to do the glmark2 test indefinitely without hanging.
I am attaching the contents of nvidia-bug-report.sh.
nvidia-bug-report.log.gz (108.7 KB)
Thank you for your help!
Edit: I forgot to mention that after the card falls of the bus, final shutdown hangs at “The system is going down for shutdown now”. I assume this is close to the last step in the shutdown sequence.
Sep 09 02:48:57 AtlanticCity kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Please check PSU and gpu temperatures
Thank you, I tried the high stress CUDA workload again and it failed almost immediately. I cannot get nvidia-smi to show anything after the card disconnects.
Although, the graphics card fan went to max speed and stayed there until I rebooted the system.
As far as I’m aware, since this is a desktop with a discrete power unit, I can’t get any info about the PSU. Find the power supply hardware information for a PC using Ubuntu's command-line - Ask Ubuntu
I bought an RTX 3060 and will try to set it up today, and see if the problem persists with the new card.
Just wondering if there were any settings I could tweak to turn off cores/limit the capacity.
Thank you so much!
You might try if nvidia-smi -lgc works on the Titan but I think that’s only available on Turing and newer.