My RTX 3060 keeps freezing my desktop PC, roughly every 2 days.
The end of the kernel log is always similar:
[16852.358181] NVRM: GPU at PCI:0000:01:00: GPU-230b77a1-605f-1cf9-d9f9-f749c44bc2f8
[16852.358184] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[16852.358187] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Going off of similar threads to save some time:
The GPU is definitely not overheating ( 46 Celsius )
My power supply is a Corsair CX750M without anything too crazy or non standard pieces that would draw too much power
This happens at IDLE state (no games or graphically demanding apps open)
PCH DMI ASPM and PCI Express Native Power Management were disabled when it first happened; I also tried enabling them and this did not change the situation.
I have 2 monitors, one connected with DisplayPort, another connected with DVI. When the freeze happens, the monitor connected with DisplayPort turns black instantly, while the one connected with DVI keeps the last frame until I turn off the machine.
After the freeze happens, I can SSH in, that is how I got the nvidia-bug-report, which is also attached.
Also interestingly, right after the freeze happens, my GPUās fans spin up really high and get loud until I turn off the machine.
After a hard reset, everything is back to normal until it happen again. (Roughly every 2 days as I said)
So far, disabling all ASPM functions in the BIOS seems to help. The PC didnāt freeze while idle, neither did it crash during light desktop workload with watching a video, and it didnāt freeze while gaming.
But letās not count our chickens before they are hatched. Iāll give it a few more days.
Seems solved on my side after a bios update, thank you for suggesting it. For anyone interested in this, in my case, I updated a ROG STRIX Z370-F GAMING from BIOS 2401 to 3004. Looks like it really does take a fresh bios to have stability with the card, I should have known better and sorry for the noise.
Well, all of the related options are disabled in the BIOS as well as the fact that I added pcie_aspm=off to my kernel command line. Unfortunately it still reproduces, even while IDLE.
I think the BIOS upgrade had some benefits still, as the reproduction rate is now down to once every 4-7 days, but still surely happening.
Also interestingly, whenever my GPU drops from the BUS, it starts to excessively vibrate until I shut the PC down. Really not sure whatās going onā¦
Iām having the same issue on a System76 laptop. RTX 3060 on 515 driver. I was asked to send in the machine for RMA and they replaced the motherboard, but still this keeps happening. Iāve now downgraded to the 470 driver and so far the problem hasnāt manifested itself again. Itās only been a day, so weāll see. Couldnāt test the 510 driver since Iām running Pop OS and the 510 driver on the repos is not compatible with the latest kernel.
Hopefully this is definitively a driver issue and it gets fixed soon. Kind of annoying having a relatively new machine that wonāt work with anything but legacy drivers.
Update: still happening with the 470 driver. Desktop locks up, GPU has fallen off the bus message printed to the system log, and X server pegs CPU to 100%. nvidia-bug-report.log.gz (258.5 KB)
Still reproducing with driver 520.56.06 on the latest BIOS. Also tried reseating the card and a different PCI slot, no difference. It is also unstable in another linux system I tried.
@lcatoni An Xid 79 is never a driver bug. Furthermore, on a notebook, this is almost always defective hardware. Please have it replaced by vendor again.
Not sure what changed but I couldnāt reproduce this for a good while now. The hardware is the same, the bios is the same, the only thing that might be different is just the regular kernel/driver updates that I install. In case it was rooted in the driver and silently fixedā¦ thank you?
I have the same problem exactly as you describe. Running Ubuntu 22.04 on Gigabyte X570 I Aorus Pro with RTX 3060. I believe I have tried all kernel 5.* and 500 series video drivers. First rate hardware but a third rate experience. I sent the card back to the store, but they sent it back to me saying they tested it working properly in Windows.
The fans start revving up after the graphical freeze. Both ssh and SysRq still work. Can you verify the problem hasnāt returned for you? What distro/kernel/driver are you running? Do you still use residual custom boot parameters? Can you share a cat /etc/default/grub | grep LINUX_DEFAULT?
Update: I have discovered that this is highly nvidia-driver minor version (0.x.x) dependent. After a minor update, I have these freezes very often. Sometimes within 3 minutes:
[ 133.278291] NVRM: Xid (PCI:0000:09:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 133.278294] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
[ 138.367502] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67e:6 2:0:4048:4040
Then I restore an image with the previous driver version using Timeshift, and I have zero freezes.
For over a year now, it seems that nvidia keeps fixing the issue and then introducing a regression later. Our hardware combinations are probably too uncommon for nvidia to test. It happens at least on 530 and 535 and 525 in the past.
Last night, an automated update from 535.98-0ubuntu0~gpu22.04.1 to 535.104.05-0ubuntu0.22.04.1 caused a super stable system to become super freezy due to the issue described in this thread. This specific update was informative because it contained only nvidia updates, and no other packages. I have been navigating between driver versions and using Timeshift a lot. At the moment 525.125.06 is safe to use with Linux 5.15.0-79 in my case.
Contrary to what @generix said, this is most definitely a driver bug.
I hope someone is still reading (@TomNVIDIA) because I donāt consider the RTX 3060 too ancient to support.
I think I met the similar issue. My dual boot windows doesnāt have GPU issue, but my Arch halt frequently since last few months after the driver/kernel upgrade.
Iām stuck with the GPU falling off the bus again. The pattern is the same. Youāre doing something insignificant like browsing the internet. It utilizes the GPU between 0 and 2%. Then, suddenly, scrolling becomes very laggy. Dragging windows around is very laggy. The UI runs at 5 FPS. GPU utilization is at 100%. No obvious reason. No game. No media playing.
Now a random pick between two things happens:
After a few dozen seconds, the GPU goes back to 0%. If you have ānvidia settingsā open on the PowerMizer page, you can see the Performance Level switch from 4 to 3 to 2 to 1 to 0. Or:
The GPU has fallen off the bus and the computer freezes.
Something is broken. Perhaps itās the firmware. I cannot find firmware upgrades for the PNY GeForce RTX 3060 12GB XLR8 Gaming REVEL EPIC-X RGB Single Fan Edition. It runs VBIOS94.06.25.00.7E.
Maybe itās too early to celebrate, but after settings this: "The NVIDIA GPU remains āon the busā if the NVIDIA Settings PowerMizer mode is set to āMaximum Performanceā. - no GPU fallen off the bus so far.
Jyka - is your change to the powermizer mode still working? Where exactly did you find this setting, can you provide some instructions? Iād like to try this also.