455.23.04: Page allocation failure in kernel module at random points

Yeah, definitely sounds like there’s some deeper stuff broken. Maybe some other bad hacks and workarounds you implemented and forgot about or a bad OC? I’m running a Ryzen 5 1600 and don’t need weird kernel lines and frankly nobody should for this processor in 2020. Might have helped in the first few months after the release when the kernel wasn’t fully supporting the processor, but that’s not the case anymore.

Try a live system and see if the issues still arise or just reinstall your OS.
You should investigate further and make some logs and put it in another thread (probably in your OS’s forums). This thread is about the Nvidia page allocation issue. Your issue seems like there’s a whole lot more in play than just this one awful bug.

Since I poured my heart on here I feel I should update everyone.

For those of you who don’t know there are issues with AMD Zen1 which show up much more frequently in linux. One bug was fatal and my original R5 1600 had it. AMD replaced it. The second processor has other issues. I won’t bother detailing them here, I will simply post a link to where I describe the various fixes I have in place. It is a lot more stable now. Up for three days without problem. I’ll leave it running until next week’s system update and see if it crashes. installing new packages from source is one of the stressers that caused it to crash before, so we’ll see how it does. For now I’m hopeful that I can continue to use the system until Zen 4 comes out in 2022.

Here’s the link: Gentoo Forums :: View topic - Method to test crashing system?

Summary: Going back to 450.80.02 does seem to stop the crashes, although I’m getting video stuttering now once in a while in youtube and other video applications.

One of my coworkers added a fix to avoid memory allocations in critical code paths, but it was fairly invasive and considered too risky for the release branch.

However, part of the change can be applied as a patch to existing drivers. While it’s not considered a complete fix, it might be worth trying.

  1. Download http://people.freedesktop.org/~aplattner/reduce-kmalloc-limit-455.38.patch
  2. Apply it to the .run package with
    bash NVIDIA-Linux-x86_64-455.38.run --apply-patch reduce-kmalloc-limit-455.38.patch
  3. Install the resulting .run package
    bash NVIDIA-Linux-x86_64-455.38-custom.run
1 Like

@aplattner, I have just installed the patched driver, will report back later about the results.

Thank you for the patch. I am trying it too.

I’m on Fedora 32. I had downgraded to 450.57 and everything was running smoothly again until the latest kernel-5.8.18-200.fc32.x86_64 was installed. (i.e. it was on ok 5.8.17-200)

It crashes every two days or so. Once the screen switches off I will sometimes come back and find the keyboard and mouse do not respond and the screen doesn’t come to live again, although I can SSH into the PC.

The error I’d get in the ABRT reports was:
BUG: unable to handle page fault for address: 0000000000007980 [nvidia_modeset]
crash function: _nv002760kms

I will test for a few days and feed back.

@aplattner I’ve been running 455.46.01 with the patch applied for about 4 days and haven’t observed any page allocation errors. I’ve also tried enabling HardDPMS, which used to cause page allocation failures even with 450 series drivers, and it also doesn’t cause errors. So I can cautiously say that the patch helps.

It seems so. I too have not yet gotten any crashes for 4 days after applying the patch.

The same here. Without the patch, I have the issue when copying huge files. I did tests and all works fine with it.

Does anyone know how to install this patch on Arch Linux? When I run NVIDIA-Linux-x86_64-455.38-custom.run
I get:

ERROR: An NVIDIA kernel module ‘nvidia-drm’ appears to already be loaded in
your kernel. This may be because it is in use (for example, by an X
server, a CUDA program, or the NVIDIA Persistence Daemon), but this
may also happen if your kernel was configured without support for
module unloading. Please be sure to exit any programs that may be
using the GPU(s) before attempting to upgrade your driver. If no
GPU-based programs are running, you know that your kernel supports
module unloading, and you still receive this message, then an error
may have occured that has corrupted an NVIDIA kernel module’s usage
count, for which the simplest remedy is to reboot your computer.

So I removed the nvidia driver package and rebooted. It rebooted into a text console and I ran NVIDIA-Linux-x86_64-455.38-custom.run again, but it tells me to unload Nouveau. I tried modprobe -r nouveau, but this returned “modprobe: FATAL: Module nouveau is in use.”. I also tried rmmod -f nouveau, but this made by screen go dark.

@volker.weissmann, First of all, disable any systemd services related to login managers if you use such, reboot into the console mode, login, and run sudo NVIDIA-Linux-x86_64-455.38-custom.run. The installer should offer you to automatically disable nouveau and reboot, after that you will be able to install the driver without problems. If it doesn’t work for any reason, try to manually create /usr/lib/modprobe.d/disable-nouveau.conf file with the following content:

blacklist nouveau
options nouveau modeset=0

Reboot and restart the installer.

nvidia-455xx-dkms-patched-PKGBUILD.txt (1.6 KB)

I made this pkgbuild based on the AUR dkms pkg for arch by adding the patch. Fingers crossed.

EDIT: I am not the maintainer

EDIT2: Seems to have applied properly on Manjaro.

Before reboot, I removed the linux{kernel-ver}-nvidia-455xx from the manjaro official repo without running into any missing dependency errors and it booted up no problem.

To be clear, does that mean the fix isn’t on the new Linux 5.9-compatible 455 series release from yesterday either? (since it’s release branch). Does Nvidia plan to release a new version of the 450 driver with Linux 5.9 compatibility? Those of us stuck on 5.8 need to upgrade since there have been important Intel security fixes since 5.8 went EOL, but the 455 series isn’t stable enough to stay online for a day if the fix didn’t land yet. If the fix wasn’t in this release, is there a rough timeline on a working or fixed 5.9-compatible release you can share so we can decide if it’s worth it to move our machines back to Linux 5.4 to get the security fixes? (i.e. weeks, months, etc)

Regarding the modesetting changes being “too risky for the release branch”, is there a breakdown in communication about the existing state of modesetting, or are those of us experiencing constant modesetting crashes a minority of the modesetting userbase? If it’s the latter do you know if there’s something we’re doing that we can change to work around the issue? Right now we need at least 2-3 days of system uptime in order to complete our network training, and the last 455 series we tested wasn’t making it past a day.

I doubt it’s in there. No mention in the changelog and seeing that they gave us this patch (which I haven’t tried yet) it’ll probably be a long time until they fix it in a future release.

On your question about release timeline, see their previous answer here:

This was something they had given a release timeline on already (“mid November”), but I hadn’t considered “5.9 compatible” would include this bug. Since it’s now a security issue too I figured they might given an updated schedule on e.g. 450 series 5.9 compatibility or clarify their position on the severity.

How did you accomplish the downgrade? I’m on Fedora 32 also and I can’t figure out how to downgrade to 450. There isn’t a package for it in the rpmfusion repository. Did you use the official version from nvidia instead?

Yes, I downloaded the official version from NVIDIA. After having had issues with the various repositories from time to time, I’ve used this excellent blog article as reference:
https://www.if-not-true-then-false.com/2015/fedora-nvidia-guide/
and been using the official version from NVIDIA directly for years now.

BTW, the patch process to 455 mentioned by @aplattner on 11 November has worked marvellously!! I have not had a crash since 12 November when I last rebooted.

@aplattner - my promised feedback: I applied the above patch to 455.38 on 12 Nov, and have not had one crash since then. 7 days uptime! :-]

1 Like

Thank you for your help, this worked.
I originally wanted to try exactly this, but I thought that blacklisting nouveau would result in a black screen if no nvidia driver is installed.
I followed this guide for blacklisting
https://wiki.archlinux.org/index.php/Kernel_module#Blacklisting

The patch seemed to fail on Fedora 32. The nvidia-modeset.ko.xz is included in the initramfs image. I needed to manually rebuild that:

dracut /boot/initramfs-$(uname -r).img $(uname -r)

Examination of the nvkms_alloc function in nvidia_modeset module with gdb disassemble in /proc/kcore now shows the expected change.