GPU crashing randomly for nvidia version 390.48 on Quadro 4000

Hey guys,

I have an Linux Arch system with a Quadro 4000 GPU that is crashing. I have had this Arch installation for a number of years and don’t recall ever having this issue. In the past few months the random crashes seem to be happening more often. I was hoping that perhaps whatever was wrong would be fixed as new versions of the driver came out, however after a month or so always updating to the latest available, the issue is still there.

When the crash occurs the keyboard, mouse, and monitor all seem to go dead, however I can still SSH into the machine, and all my long running services like docker containers, network mounts, etc… all work fine. I ultimately have to issue a sudo reboot to get the machine back up and running where I can use the console again.

Ultimately I would like to figure out why it is crashing and get it fixed, but perhaps someone knows of a way to restart whatever is crashing so the console would work again without forcing me to reboot?

Here is the output from running sudo

Let me know if there is anything more I can provide or if anyone has any ideas what might be happening?
thank you!

nvidia-bug-report.log.gz (124 KB)

You’re getting an XID 79 error, fallen off the bus. Points to a hardware error. Choose any of overheating, defective psu, defective mainboard/pcie slot, defective memory, defective graphics card.

@generix, thanks for the help! I think temperature might have been (may still be) my issue. I ran:

nvidia-settings -q gpucoretemp

It was reading 99. I have in the past blown the card and exposed portions of the fan with compressed air, but this time I pulled the card and unscrewed the cover to better expose the heat sync’s. They looked pretty clogged up so I blew them all out and put it all back together. The temp is now reading 82. Better then where I was, but still a little high you think?

My case is reporting 22 and my processor cores between 34 - 40. Do you think the GPU is ‘ok’ in the low 80’s or should I take some more action to get its temp down?

82°C is still really hot if not under full load for some time. I think the thermal compound might have dried up due to running hot for a long time so you might have to remove the heatspreader and renew the thermal compound. Don’t know if that’s beyond your capabilities.

Are the fans still working, by the way?

Yeah all the fans are still working. I’ve applied thermal paste when installing CPUs and their heat syncs in the past, so I imagine it is a similar process for GPUs? I’ll order some and give it a shot! Thanks again.

Yes, same procedure. Remove heat sink, carefully clean heat sink and chip(s) of old compound, apply new compound, assemble.