My RTX 3060 keeps freezing my desktop PC, roughly every 2 days.
The end of the kernel log is always similar:
[16852.358181] NVRM: GPU at PCI:0000:01:00: GPU-230b77a1-605f-1cf9-d9f9-f749c44bc2f8
[16852.358184] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[16852.358187] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Going off of similar threads to save some time:
The GPU is definitely not overheating ( 46 Celsius )
My power supply is a Corsair CX750M without anything too crazy or non standard pieces that would draw too much power
This happens at IDLE state (no games or graphically demanding apps open)
PCH DMI ASPM and PCI Express Native Power Management were disabled when it first happened; I also tried enabling them and this did not change the situation.
I have 2 monitors, one connected with DisplayPort, another connected with DVI. When the freeze happens, the monitor connected with DisplayPort turns black instantly, while the one connected with DVI keeps the last frame until I turn off the machine.
After the freeze happens, I can SSH in, that is how I got the nvidia-bug-report, which is also attached.
Also interestingly, right after the freeze happens, my GPU’s fans spin up really high and get loud until I turn off the machine.
After a hard reset, everything is back to normal until it happen again. (Roughly every 2 days as I said)
So far, disabling all ASPM functions in the BIOS seems to help. The PC didn’t freeze while idle, neither did it crash during light desktop workload with watching a video, and it didn’t freeze while gaming.
But let’s not count our chickens before they are hatched. I’ll give it a few more days.
Seems solved on my side after a bios update, thank you for suggesting it. For anyone interested in this, in my case, I updated a ROG STRIX Z370-F GAMING from BIOS 2401 to 3004. Looks like it really does take a fresh bios to have stability with the card, I should have known better and sorry for the noise.
Well, all of the related options are disabled in the BIOS as well as the fact that I added pcie_aspm=off to my kernel command line. Unfortunately it still reproduces, even while IDLE.
I think the BIOS upgrade had some benefits still, as the reproduction rate is now down to once every 4-7 days, but still surely happening.
Also interestingly, whenever my GPU drops from the BUS, it starts to excessively vibrate until I shut the PC down. Really not sure what’s going on…
I’m having the same issue on a System76 laptop. RTX 3060 on 515 driver. I was asked to send in the machine for RMA and they replaced the motherboard, but still this keeps happening. I’ve now downgraded to the 470 driver and so far the problem hasn’t manifested itself again. It’s only been a day, so we’ll see. Couldn’t test the 510 driver since I’m running Pop OS and the 510 driver on the repos is not compatible with the latest kernel.
Hopefully this is definitively a driver issue and it gets fixed soon. Kind of annoying having a relatively new machine that won’t work with anything but legacy drivers.
Update: still happening with the 470 driver. Desktop locks up, GPU has fallen off the bus message printed to the system log, and X server pegs CPU to 100%. nvidia-bug-report.log.gz (258.5 KB)