Hard freeze / GPU has fallen off the bus

hi there,
I’ve been getting a hard freeze the last couple of weeks that I can’t get rid of. I’ve tried PopOS, Fedora, Ubuntu LTS, nvidia drivers 470,515,515, steam native, steam beta and flatpak, older laptop BIOS, external screen, still I see the issue

basically I launch steam using the dedicated nvidia GPU, the hard freeze mostly happens when launching a game, at “compiling vulkan shaders” but can sometimes happen just in the steam menu, ingame ~10mins or after exiting the game.

The issue does not happen when using AMD Vega iGPU, I also could not make the issue happen using a basic Lutris game or unigine benchmark

my laptop is only 6 months old so I am highly doubting this is a physical hardware issue as this (fallen off the bus) error can sometimes suggest.

ubuntu 22.04 LTS

nvidia-bug-report.log.gz (135.6 KB)

5.15 kernel / 5.17 kernel / 6.0.16 kernel
nvidia driver 470, 515, 525

yoga slim 7 pro
nvidia MX450 2gb
16gb ram

hoping you can assist! thanks

journal log.txt (707 Bytes)
journal snippet from one of the occurrences. As its usually a hard freeze its hard to capture log info

You might set “Prefer max performance” in nvidia-settings while the nvidia gpu is still alive to check for a power management issue. If that doesn’t help, it’s most likely broken, to double check, install Windows and inspect event log.

Hi thanks for the reply, I swapped in an old SSD and installed windows, I cant replicate the issue, games and stress tests just work so its not a hardware problem

I will try the max performance setting but I have doubts that will help as it happens before a game has even launched, I am wondering if this issue only happens with steam and steam games so will try some titles via Lutris

I have found that adding the kernel parameter pcie_aspm=off seems to help, although this relates to PCIE power management so would have an impact on battery life (better than a hard freeze I guess)

If turning off aspm helps, did you already check for a bios update?

Yeah I tried two different BIOS versions, not sure this can be a hardware/firmware issue as it doesnt happen in windows

I read somewhere on the forum that the xid 79 error indicates problems with power or communication with the card (or worst case scenario hardware issue). And someone suggested to reseat the card. I quickly read through it thinking it was a suggestion of “make sure you have plugged in the power cord”, but after I couldn’t find anything else I decided to take a look and de-dust the pc as well. Lo and behold the card was not fully inserted into the pci-e socket. It looked like it is, but the socket’s flipper was not clipped into place. I reseated it, pushed until I heard a click and verified that the flipper (not sure what it’s called) didn’t freely move and that was it. Never had a problem since then. It’s weird that this was only happening on ubuntu, and only when I was using a specific program (RubyMine). With the exact same setup and equivalent drivers, on windows I could game and stress the GPU and no problem occurred ever. Also when using the ubuntu (Nouveau) drivers the issue again was not present.
Setup was Ubuntu 22.04, RTX 3080, 535 driver (have also tried with 525 in the past).

Glad you got your issue sorted, I’m on a laptop so can’t just reseat the card, I switched to windows and its been working perfectly so its not a hardware issue

Yeah you are right. I ended up here after a google search for that specific error though, so I still thought it might help someone to not skip this check. Glad you could work around your case