GPU fans go to max and graphics drivers hang

Hello. I’ve been trying to sort out an issue with my system for a while now. Hopefully I can find some answers here, II’ll try to give all the relevant information I can.

The main symptom is that sometimes (increasingly often) my GPU fan seems to suddenly max out and a few seconds later my graphical interface hangs (if I am in text-only mode I can still interact).

Most recently this happened while I was in text-only mode and I also spied an error being thrown that said something along the lines of “invalid power transition from D3cold to D3hot”.

The final, possibly related, issue, is that I have not been able to update the drivers because I keep getting an error when I run the installer along the lines of some module being loaded in the kernel. This is the original reason I switched to text-only on my system but even there I get this issue.

I don’t exactly know what information will be useful. I am running a system two 2080 titan RTX graphics cards (one of the 2-gpu systems from lambda labs). This issue has been happening on all three of my environments: I have a windows 10 partition and two Ubuntu 18 partitions (a personal and a work one).

Edit: attached the bug report I ran this morning. Also, I misspoke before, my second linux partition is Ubuntu 20, that is the the one this bug report was created on.

nvidia-bug-report_ubuntu_20_7_18_2022.log (543.6 KB)

Please run as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Ah yes. I’ve attached it to my original post. Thanks!

Jul 27 16:39:09 emano kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=1370, GPU has fallen off the bus.

One of the gpus is shutting down. Since it’s not always the same one, I guess they’re not damaged but either overheating or lack of power occurs. Please monitor temperatures, check PSU, try limiting clocks using nvidia-smi -lgc.

Hmmmm. Ok. This issue just started occurring maybe four months ago. Is there a standard way to check the PSU?

Use nvidia-smi -lgc to limit/prevent boost. If it then runs stable, it’s most likely the psu.

Ah ok, do you have recommendations for what clock speeds I should try? And what the default is so I can reset it?

Try 300,1500. Remeber to set it on both gpus. The setting is not persistent, so a reboot will reset it to default.

Hey wanted to follow up on this because I think I have finally figured this one out - you’ll never guess what it is.

This problem occurs whenever I accidentally bump the computer with my foot LOL. Uhhh, any suggestions for how to secure the components to make them more shock-resistant? Perhaps I just need to go in there and re-seat things.