NVIDIA 515 - RTX 3060 - GPU has fallen off the bus

My RTX 3060 keeps freezing my desktop PC, roughly every 2 days.

The end of the kernel log is always similar:

[16852.358181] NVRM: GPU at PCI:0000:01:00: GPU-230b77a1-605f-1cf9-d9f9-f749c44bc2f8
[16852.358184] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[16852.358187] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Going off of similar threads to save some time:

  • The GPU is definitely not overheating ( 46 Celsius )
  • My power supply is a Corsair CX750M without anything too crazy or non standard pieces that would draw too much power
  • This happens at IDLE state (no games or graphically demanding apps open)
  • PCH DMI ASPM and PCI Express Native Power Management were disabled when it first happened; I also tried enabling them and this did not change the situation.

I have 2 monitors, one connected with DisplayPort, another connected with DVI. When the freeze happens, the monitor connected with DisplayPort turns black instantly, while the one connected with DVI keeps the last frame until I turn off the machine.
After the freeze happens, I can SSH in, that is how I got the nvidia-bug-report, which is also attached.
Also interestingly, right after the freeze happens, my GPU’s fans spin up really high and get loud until I turn off the machine.
After a hard reset, everything is back to normal until it happen again. (Roughly every 2 days as I said)

nvidia-bug-report.log.gz (243.4 KB)

Please check for a bios update, try reseating the gpu in its slot, check if it works in another system.

This looks similar to my problem, and I’m also using two monitors most of the time, sometimes three:

Additionally, when pressing the reset button, I can see the PC slowly drawing a black box from top to bottom before really resetting the machine.

So far, disabling all ASPM functions in the BIOS seems to help. The PC didn’t freeze while idle, neither did it crash during light desktop workload with watching a video, and it didn’t freeze while gaming.

But let’s not count our chickens before they are hatched. I’ll give it a few more days.

Seems solved on my side after a bios update, thank you for suggesting it. For anyone interested in this, in my case, I updated a ROG STRIX Z370-F GAMING from BIOS 2401 to 3004. Looks like it really does take a fresh bios to have stability with the card, I should have known better and sorry for the noise.

Ignore the comment above, unfortunately this is still reproducing with the latest BIOS.

It’s stable for me at least… Maybe there’s still some PCIe power management going on in your system?

Well, all of the related options are disabled in the BIOS as well as the fact that I added pcie_aspm=off to my kernel command line. Unfortunately it still reproduces, even while IDLE.
I think the BIOS upgrade had some benefits still, as the reproduction rate is now down to once every 4-7 days, but still surely happening.
Also interestingly, whenever my GPU drops from the BUS, it starts to excessively vibrate until I shut the PC down. Really not sure what’s going on…

I’m having the same issue on a System76 laptop. RTX 3060 on 515 driver. I was asked to send in the machine for RMA and they replaced the motherboard, but still this keeps happening. I’ve now downgraded to the 470 driver and so far the problem hasn’t manifested itself again. It’s only been a day, so we’ll see. Couldn’t test the 510 driver since I’m running Pop OS and the 510 driver on the repos is not compatible with the latest kernel.

Hopefully this is definitively a driver issue and it gets fixed soon. Kind of annoying having a relatively new machine that won’t work with anything but legacy drivers.

Update: still happening with the 470 driver. Desktop locks up, GPU has fallen off the bus message printed to the system log, and X server pegs CPU to 100%.
nvidia-bug-report.log.gz (258.5 KB)

Still reproducing with driver 520.56.06 on the latest BIOS. Also tried reseating the card and a different PCI slot, no difference. It is also unstable in another linux system I tried.

@lcatoni An Xid 79 is never a driver bug. Furthermore, on a notebook, this is almost always defective hardware. Please have it replaced by vendor again.

Not sure what changed but I couldn’t reproduce this for a good while now. The hardware is the same, the bios is the same, the only thing that might be different is just the regular kernel/driver updates that I install. In case it was rooted in the driver and silently fixed… thank you?