My RTX 3060 keeps freezing my desktop PC, roughly every 2 days.
The end of the kernel log is always similar:
[16852.358181] NVRM: GPU at PCI:0000:01:00: GPU-230b77a1-605f-1cf9-d9f9-f749c44bc2f8
[16852.358184] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[16852.358187] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Going off of similar threads to save some time:
The GPU is definitely not overheating ( 46 Celsius )
My power supply is a Corsair CX750M without anything too crazy or non standard pieces that would draw too much power
This happens at IDLE state (no games or graphically demanding apps open)
PCH DMI ASPM and PCI Express Native Power Management were disabled when it first happened; I also tried enabling them and this did not change the situation.
I have 2 monitors, one connected with DisplayPort, another connected with DVI. When the freeze happens, the monitor connected with DisplayPort turns black instantly, while the one connected with DVI keeps the last frame until I turn off the machine.
After the freeze happens, I can SSH in, that is how I got the nvidia-bug-report, which is also attached.
Also interestingly, right after the freeze happens, my GPU’s fans spin up really high and get loud until I turn off the machine.
After a hard reset, everything is back to normal until it happen again. (Roughly every 2 days as I said)
So far, disabling all ASPM functions in the BIOS seems to help. The PC didn’t freeze while idle, neither did it crash during light desktop workload with watching a video, and it didn’t freeze while gaming.
But let’s not count our chickens before they are hatched. I’ll give it a few more days.
Seems solved on my side after a bios update, thank you for suggesting it. For anyone interested in this, in my case, I updated a ROG STRIX Z370-F GAMING from BIOS 2401 to 3004. Looks like it really does take a fresh bios to have stability with the card, I should have known better and sorry for the noise.
Well, all of the related options are disabled in the BIOS as well as the fact that I added pcie_aspm=off to my kernel command line. Unfortunately it still reproduces, even while IDLE.
I think the BIOS upgrade had some benefits still, as the reproduction rate is now down to once every 4-7 days, but still surely happening.
Also interestingly, whenever my GPU drops from the BUS, it starts to excessively vibrate until I shut the PC down. Really not sure what’s going on…
I’m having the same issue on a System76 laptop. RTX 3060 on 515 driver. I was asked to send in the machine for RMA and they replaced the motherboard, but still this keeps happening. I’ve now downgraded to the 470 driver and so far the problem hasn’t manifested itself again. It’s only been a day, so we’ll see. Couldn’t test the 510 driver since I’m running Pop OS and the 510 driver on the repos is not compatible with the latest kernel.
Hopefully this is definitively a driver issue and it gets fixed soon. Kind of annoying having a relatively new machine that won’t work with anything but legacy drivers.
Update: still happening with the 470 driver. Desktop locks up, GPU has fallen off the bus message printed to the system log, and X server pegs CPU to 100%. nvidia-bug-report.log.gz (258.5 KB)
Not sure what changed but I couldn’t reproduce this for a good while now. The hardware is the same, the bios is the same, the only thing that might be different is just the regular kernel/driver updates that I install. In case it was rooted in the driver and silently fixed… thank you?
I have the same problem exactly as you describe. Running Ubuntu 22.04 on Gigabyte X570 I Aorus Pro with RTX 3060. I believe I have tried all kernel 5.* and 500 series video drivers. First rate hardware but a third rate experience. I sent the card back to the store, but they sent it back to me saying they tested it working properly in Windows.
The fans start revving up after the graphical freeze. Both ssh and SysRq still work. Can you verify the problem hasn’t returned for you? What distro/kernel/driver are you running? Do you still use residual custom boot parameters? Can you share a cat /etc/default/grub | grep LINUX_DEFAULT?
Update: I have discovered that this is highly nvidia-driver minor version (0.x.x) dependent. After a minor update, I have these freezes very often. Sometimes within 3 minutes:
[ 133.278291] NVRM: Xid (PCI:0000:09:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 133.278294] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
[ 138.367502] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67e:6 2:0:4048:4040
Then I restore an image with the previous driver version using Timeshift, and I have zero freezes.
For over a year now, it seems that nvidia keeps fixing the issue and then introducing a regression later. Our hardware combinations are probably too uncommon for nvidia to test. It happens at least on 530 and 535 and 525 in the past.
Last night, an automated update from 535.98-0ubuntu0~gpu22.04.1 to 535.104.05-0ubuntu0.22.04.1 caused a super stable system to become super freezy due to the issue described in this thread. This specific update was informative because it contained only nvidia updates, and no other packages. I have been navigating between driver versions and using Timeshift a lot. At the moment 525.125.06 is safe to use with Linux 5.15.0-79 in my case.
Contrary to what @generix said, this is most definitely a driver bug.
I hope someone is still reading (@TomNVIDIA) because I don’t consider the RTX 3060 too ancient to support.
I’m stuck with the GPU falling off the bus again. The pattern is the same. You’re doing something insignificant like browsing the internet. It utilizes the GPU between 0 and 2%. Then, suddenly, scrolling becomes very laggy. Dragging windows around is very laggy. The UI runs at 5 FPS. GPU utilization is at 100%. No obvious reason. No game. No media playing.
Now a random pick between two things happens:
After a few dozen seconds, the GPU goes back to 0%. If you have “nvidia settings” open on the PowerMizer page, you can see the Performance Level switch from 4 to 3 to 2 to 1 to 0. Or:
The GPU has fallen off the bus and the computer freezes.
Something is broken. Perhaps it’s the firmware. I cannot find firmware upgrades for the PNY GeForce RTX 3060 12GB XLR8 Gaming REVEL EPIC-X RGB Single Fan Edition. It runs VBIOS94.06.25.00.7E.