GPU fans go to max and graphics drivers hang

Hello. I’ve been trying to sort out an issue with my system for a while now. Hopefully I can find some answers here, II’ll try to give all the relevant information I can.

The main symptom is that sometimes (increasingly often) my GPU fan seems to suddenly max out and a few seconds later my graphical interface hangs (if I am in text-only mode I can still interact).

Most recently this happened while I was in text-only mode and I also spied an error being thrown that said something along the lines of “invalid power transition from D3cold to D3hot”.

The final, possibly related, issue, is that I have not been able to update the drivers because I keep getting an error when I run the installer along the lines of some module being loaded in the kernel. This is the original reason I switched to text-only on my system but even there I get this issue.

I don’t exactly know what information will be useful. I am running a system two 2080 titan RTX graphics cards (one of the 2-gpu systems from lambda labs). This issue has been happening on all three of my environments: I have a windows 10 partition and two Ubuntu 18 partitions (a personal and a work one).

Edit: attached the bug report I ran this morning. Also, I misspoke before, my second linux partition is Ubuntu 20, that is the the one this bug report was created on.

Ah yes. I’ve attached it to my original post. Thanks!

Jul 27 16:39:09 emano kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=1370, GPU has fallen off the bus.

One of the gpus is shutting down. Since it’s not always the same one, I guess they’re not damaged but either overheating or lack of power occurs. Please monitor temperatures, check PSU, try limiting clocks using nvidia-smi -lgc.

Hmmmm. Ok. This issue just started occurring maybe four months ago. Is there a standard way to check the PSU?

Use nvidia-smi -lgc to limit/prevent boost. If it then runs stable, it’s most likely the psu.

Ah ok, do you have recommendations for what clock speeds I should try? And what the default is so I can reset it?

Try 300,1500. Remeber to set it on both gpus. The setting is not persistent, so a reboot will reset it to default.

Hey wanted to follow up on this because I think I have finally figured this one out - you’ll never guess what it is.

This problem occurs whenever I accidentally bump the computer with my foot LOL. Uhhh, any suggestions for how to secure the components to make them more shock-resistant? Perhaps I just need to go in there and re-seat things.