Yeah, definitely sounds like there’s some deeper stuff broken. Maybe some other bad hacks and workarounds you implemented and forgot about or a bad OC? I’m running a Ryzen 5 1600 and don’t need weird kernel lines and frankly nobody should for this processor in 2020. Might have helped in the first few months after the release when the kernel wasn’t fully supporting the processor, but that’s not the case anymore.
Try a live system and see if the issues still arise or just reinstall your OS.
You should investigate further and make some logs and put it in another thread (probably in your OS’s forums). This thread is about the Nvidia page allocation issue. Your issue seems like there’s a whole lot more in play than just this one awful bug.
Since I poured my heart on here I feel I should update everyone.
For those of you who don’t know there are issues with AMD Zen1 which show up much more frequently in linux. One bug was fatal and my original R5 1600 had it. AMD replaced it. The second processor has other issues. I won’t bother detailing them here, I will simply post a link to where I describe the various fixes I have in place. It is a lot more stable now. Up for three days without problem. I’ll leave it running until next week’s system update and see if it crashes. installing new packages from source is one of the stressers that caused it to crash before, so we’ll see how it does. For now I’m hopeful that I can continue to use the system until Zen 4 comes out in 2022.
I’m on Fedora 32. I had downgraded to 450.57 and everything was running smoothly again until the latest kernel-5.8.18-200.fc32.x86_64 was installed. (i.e. it was on ok 5.8.17-200)
It crashes every two days or so. Once the screen switches off I will sometimes come back and find the keyboard and mouse do not respond and the screen doesn’t come to live again, although I can SSH into the PC.
The error I’d get in the ABRT reports was:
BUG: unable to handle page fault for address: 0000000000007980 [nvidia_modeset]
crash function: _nv002760kms
@aplattner I’ve been running 455.46.01 with the patch applied for about 4 days and haven’t observed any page allocation errors. I’ve also tried enabling HardDPMS, which used to cause page allocation failures even with 450 series drivers, and it also doesn’t cause errors. So I can cautiously say that the patch helps.
Does anyone know how to install this patch on Arch Linux? When I run NVIDIA-Linux-x86_64-455.38-custom.run
ERROR: An NVIDIA kernel module ‘nvidia-drm’ appears to already be loaded in
your kernel. This may be because it is in use (for example, by an X
server, a CUDA program, or the NVIDIA Persistence Daemon), but this
may also happen if your kernel was configured without support for
module unloading. Please be sure to exit any programs that may be
using the GPU(s) before attempting to upgrade your driver. If no
GPU-based programs are running, you know that your kernel supports
module unloading, and you still receive this message, then an error
may have occured that has corrupted an NVIDIA kernel module’s usage
count, for which the simplest remedy is to reboot your computer.
So I removed the nvidia driver package and rebooted. It rebooted into a text console and I ran NVIDIA-Linux-x86_64-455.38-custom.run again, but it tells me to unload Nouveau. I tried modprobe -r nouveau, but this returned “modprobe: FATAL: Module nouveau is in use.”. I also tried rmmod -f nouveau, but this made by screen go dark.
@volker.weissmann, First of all, disable any systemd services related to login managers if you use such, reboot into the console mode, login, and run sudo NVIDIA-Linux-x86_64-455.38-custom.run. The installer should offer you to automatically disable nouveau and reboot, after that you will be able to install the driver without problems. If it doesn’t work for any reason, try to manually create /usr/lib/modprobe.d/disable-nouveau.conf file with the following content:
To be clear, does that mean the fix isn’t on the new Linux 5.9-compatible 455 series release from yesterday either? (since it’s release branch). Does Nvidia plan to release a new version of the 450 driver with Linux 5.9 compatibility? Those of us stuck on 5.8 need to upgrade since there have been important Intel security fixes since 5.8 went EOL, but the 455 series isn’t stable enough to stay online for a day if the fix didn’t land yet. If the fix wasn’t in this release, is there a rough timeline on a working or fixed 5.9-compatible release you can share so we can decide if it’s worth it to move our machines back to Linux 5.4 to get the security fixes? (i.e. weeks, months, etc)
Regarding the modesetting changes being “too risky for the release branch”, is there a breakdown in communication about the existing state of modesetting, or are those of us experiencing constant modesetting crashes a minority of the modesetting userbase? If it’s the latter do you know if there’s something we’re doing that we can change to work around the issue? Right now we need at least 2-3 days of system uptime in order to complete our network training, and the last 455 series we tested wasn’t making it past a day.
This was something they had given a release timeline on already (“mid November”), but I hadn’t considered “5.9 compatible” would include this bug. Since it’s now a security issue too I figured they might given an updated schedule on e.g. 450 series 5.9 compatibility or clarify their position on the severity.
How did you accomplish the downgrade? I’m on Fedora 32 also and I can’t figure out how to downgrade to 450. There isn’t a package for it in the rpmfusion repository. Did you use the official version from nvidia instead?