455.23.04: Page allocation failure in kernel module at random points

I went back to nvidia driver version 450.80.02 and still had a crash:

[68876.027615] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[68876.027624] CPU: 11 PID: 3521 Comm: Xorg Tainted: P O T 5.9.3-gentoo #1
[68876.027626] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[68876.027627] Call Trace:
[68876.027636] dump_stack+0x6d/0x90
[68876.027641] warn_alloc.cold+0x74/0xdb
[68876.027645] ? __alloc_pages_direct_compact+0x11d/0x140
[68876.027649] __alloc_pages_slowpath.constprop.0+0xb16/0xb50
[68876.027652] ? prep_new_page+0xbd/0xc0
[68876.027656] ? skb_copy_datagram_from_iter+0x53/0x1c0
[68876.027659] __alloc_pages_nodemask+0x205/0x250
[68876.027663] kmalloc_order+0x27/0x70
[68876.027681] nvkms_alloc+0x1b/0xd0 [nvidia_modeset]
[68876.027702] _nv002653kms+0x16/0x30 [nvidia_modeset]
[68876.027720] ? _nv002759kms+0x66/0x1470 [nvidia_modeset]
[68876.027736] ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[68876.027739] ? __alloc_pages_nodemask+0x125/0x250
[68876.027754] ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[68876.027757] ? kmalloc_order+0x61/0x70
[68876.027772] ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[68876.027787] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[68876.027802] ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[68876.027817] ? nvkms_ioctl_common+0x127/0x160 [nvidia_modeset]
[68876.028012] ? nvidia_frontend_unlocked_ioctl+0x2f/0x40 [nvidia]
[68876.028015] ? __x64_sys_ioctl+0x7b/0xb0
[68876.028019] ? do_syscall_64+0x2d/0x70
[68876.028022] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

This computer has been really flaky for months so I suspect I have a CPU or MOBO issue. Can’t be sure though since the failure is so nvidia related.

Are you positive you back-leveled cleanly? I was having this problem multiple times a day with 455 (especially during high i/o), but I’ve been running back on 450 for the last week or two and haven’t had a single crash.

If your system really does have a good 450 install, I think the cause might be different - as this doesn’t seem to be something others experienced before 455.

Thanks for replying. I’m really frustrated.

My system has been unstable for months. I’m getting a crash a day at this point. I just tested it last night for a couple of hours in windows to see if I could get it to crash. I ran prime95 and 3dmark burn. It did not crash.

If I run Windows and then boot linux, the computer has audio issues. I have to power it off and back on to get audio to work properly. The symptom is that the speaker output jack thinks that things are being plugged in when there is nothing there and switches the audio out that port on and off randomly. Audio doesn’t work.

Gentoo is pretty good about installing packages cleanly:

server ~ # equery list nvidia-drivers

  • Searching for nvidia-drivers …
    [IP-] [ ] x11-drivers/nvidia-drivers-450.80.02:0/450

server ~ # dmesg | grep -i nvidia
[ 6.110204] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input2
[ 6.110289] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input3
[ 6.110371] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input4
[ 6.110456] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input5
[ 6.110538] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input6
[ 6.110624] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input7
[ 10.630273] nvidia: loading out-of-tree module taints kernel.
[ 10.630281] nvidia: module license ‘NVIDIA’ taints kernel.
[ 10.646730] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
[ 10.647189] nvidia 0000:08:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 10.850882] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 11.102774] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 11.655834] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[ 11.660950] [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
[ 11.660953] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 0
[ 15.977342] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs

There is one thing that is unique to this system: I have an old Marvell PCIe SAS Raid controller running to give me three extra SATA connectors. There are six hard drives for a RAID6 array. The hard drives: Device Model: TOSHIBA MG06ACA600EY are inexpensive, I think because they have issues. There have been many firmware updates since I got them. There is a long (usually about a second) lag when you first access them, as if they are asleep. I used hdparm to tune down their power savings, but there is still a lag.

I’m also running three security cameras (one of them is 4K the other two are 2K) which grab all their data through Ethernet and are streamed through SHM to the boot NVMe.

There is a lot of I/O happening on this hardware all the time. I can’t reproduce that in Windows. So that could be a big difference between the testing setups.

The hardware also hangs when running MemTest86+ on the second pass, but I have not found a system in my house that will run memtest86+. The laptops I tried all hang with a blank screen before the test even begins. So I can’t be sure that memtest86+ works and is reporting actual issues.

I told AMD about the issue with memtest86+ and they said it was likely a defective CPU. But because the CPU is out of warranty, I’m SOL. Bummer.

Also, when it crashes, it’s always related to Video. Video is always the thing that dies. Sometimes I can recover it by hitting CTRL-ALT-F1 and a console will come up, and then switching back to X with CTRL-ALT-F7 produces a black screen and cursor, if I repeat the sequence: CTRL-ALT-F1, F7, over and over usually on the third iteration X will restore. Some of the apps have to be minimized and restored to fix their graphics, but everything starts working again. So the thought keeps haunting me: could this be a bad GTX1050gt?

The problems I’m having happen even when DPMS is not active. I can be watching a video using mpv (an mplayer derivative) and video will lock while audio is still working. The same is true of youtube videos. Often CTRL-ALT-F1 to F7 three times will recover those as well.

Then there’s the other problem. If I wake up DPMS at exactly the wrong time (just after it activates) video stays black. CTRL-ALT-F1 works, but switching back with CTRL-ALT-F7 results in a black screen. However, if I unplug the LG TV from the Yamaha Audio Receiver, and plug it back in, video is restored.

I’ve been studying these issues, trying things, for months. And it’s just getting worse. I’m thinking of going back to an old kernel to see if it makes a difference.

Which reminds me, I’m also booting the kernel with ‘idle=nomwait rcu_nocbs=0-11 pci=msi’ because that was how you got the Ryzen 5 1600 to be stable back in the day. I wonder if I should remove those now. I haven’t tried that yet.

It is far more likely to crash when the CPUs are busy. Specifically CPU 0 which does all the I/O. I have seen that. When the system is doing a compile (gentoo is a source based linux distro) everything crawls in X, even though the builds are supposed to be lower in priority, it seems like the combination of I/O and compiling in the background causes X to get really laggy. If I kill the security cameras (zoneminder), then the lag is noticeably less.

I’m looking at a hardware upgrade as the only choice to fix this and I’m still not convinced it’s a hardware issue. I want to wait for Zen 4 or at least until Zen3 has a cheaper six core option.

Honestly, I don’t want to replace things when there may not actually be a hardware issue. It could be software, right?

Edit: I also told the linux kernel to not use CPU0 and CPU1 (core 0) to see if that would help. I can see the I/O on core 0 as blips in htop. But even that didn’t help stability.

I think there’s more underlying issues than just this bug. Downgrading to 450 should get rid of the page allocation thing due to I/O or whatever the hell causes it.

I downgraded to 450 today and I finally can use my PC normally again. No more crashes when I download or move 2 GB of data or something. What a hell it has been.

You are right. I started digging into what it takes to run Ryzen 5 1600 stably in linux. I had done this in 2018 and found the work arounds to keep the system from crashing. They include kernel command line options and a little script called zenstates.py which disables powerstate C6.

Well, I’m getting messages in dmesg about a python program not accessing the arb interface correctly. It turns out the kernel is complaining about zenstates.py.

I assume it’s not working anymore and removed it from my boot sequence. That’s when things got really bad.

It may be working. I put it back. It says it is working, but there are error messages in dmesg, so I don’t know for sure. I’m running all the command line stuff I could gather: idle=nomwait rcu_nocbs=0-11 amd_iommu=on video=efifb:off kvm.ignore_msrs=1 pci=msi, zenstates is running again and I’m running the old nvidia driver.

Hopefully this will make things better? I dunno. I just wish nvidia would fix this bug so I would have half a chance to fix the rest of my issues.

Yeah, definitely sounds like there’s some deeper stuff broken. Maybe some other bad hacks and workarounds you implemented and forgot about or a bad OC? I’m running a Ryzen 5 1600 and don’t need weird kernel lines and frankly nobody should for this processor in 2020. Might have helped in the first few months after the release when the kernel wasn’t fully supporting the processor, but that’s not the case anymore.

Try a live system and see if the issues still arise or just reinstall your OS.
You should investigate further and make some logs and put it in another thread (probably in your OS’s forums). This thread is about the Nvidia page allocation issue. Your issue seems like there’s a whole lot more in play than just this one awful bug.

Since I poured my heart on here I feel I should update everyone.

For those of you who don’t know there are issues with AMD Zen1 which show up much more frequently in linux. One bug was fatal and my original R5 1600 had it. AMD replaced it. The second processor has other issues. I won’t bother detailing them here, I will simply post a link to where I describe the various fixes I have in place. It is a lot more stable now. Up for three days without problem. I’ll leave it running until next week’s system update and see if it crashes. installing new packages from source is one of the stressers that caused it to crash before, so we’ll see how it does. For now I’m hopeful that I can continue to use the system until Zen 4 comes out in 2022.

Here’s the link: https://forums.gentoo.org/viewtopic-t-1118722-start-55.html

Summary: Going back to 450.80.02 does seem to stop the crashes, although I’m getting video stuttering now once in a while in youtube and other video applications.

One of my coworkers added a fix to avoid memory allocations in critical code paths, but it was fairly invasive and considered too risky for the release branch.

However, part of the change can be applied as a patch to existing drivers. While it’s not considered a complete fix, it might be worth trying.

  1. Download http://people.freedesktop.org/~aplattner/reduce-kmalloc-limit-455.38.patch
  2. Apply it to the .run package with
    bash NVIDIA-Linux-x86_64-455.38.run --apply-patch reduce-kmalloc-limit-455.38.patch
  3. Install the resulting .run package
    bash NVIDIA-Linux-x86_64-455.38-custom.run
1 Like

@aplattner, I have just installed the patched driver, will report back later about the results.

Thank you for the patch. I am trying it too.

I’m on Fedora 32. I had downgraded to 450.57 and everything was running smoothly again until the latest kernel-5.8.18-200.fc32.x86_64 was installed. (i.e. it was on ok 5.8.17-200)

It crashes every two days or so. Once the screen switches off I will sometimes come back and find the keyboard and mouse do not respond and the screen doesn’t come to live again, although I can SSH into the PC.

The error I’d get in the ABRT reports was:
BUG: unable to handle page fault for address: 0000000000007980 [nvidia_modeset]
crash function: _nv002760kms

I will test for a few days and feed back.

@aplattner I’ve been running 455.46.01 with the patch applied for about 4 days and haven’t observed any page allocation errors. I’ve also tried enabling HardDPMS, which used to cause page allocation failures even with 450 series drivers, and it also doesn’t cause errors. So I can cautiously say that the patch helps.

It seems so. I too have not yet gotten any crashes for 4 days after applying the patch.

The same here. Without the patch, I have the issue when copying huge files. I did tests and all works fine with it.

Does anyone know how to install this patch on Arch Linux? When I run NVIDIA-Linux-x86_64-455.38-custom.run
I get:

ERROR: An NVIDIA kernel module ‘nvidia-drm’ appears to already be loaded in
your kernel. This may be because it is in use (for example, by an X
server, a CUDA program, or the NVIDIA Persistence Daemon), but this
may also happen if your kernel was configured without support for
module unloading. Please be sure to exit any programs that may be
using the GPU(s) before attempting to upgrade your driver. If no
GPU-based programs are running, you know that your kernel supports
module unloading, and you still receive this message, then an error
may have occured that has corrupted an NVIDIA kernel module’s usage
count, for which the simplest remedy is to reboot your computer.

So I removed the nvidia driver package and rebooted. It rebooted into a text console and I ran NVIDIA-Linux-x86_64-455.38-custom.run again, but it tells me to unload Nouveau. I tried modprobe -r nouveau, but this returned “modprobe: FATAL: Module nouveau is in use.”. I also tried rmmod -f nouveau, but this made by screen go dark.

@volker.weissmann, First of all, disable any systemd services related to login managers if you use such, reboot into the console mode, login, and run sudo NVIDIA-Linux-x86_64-455.38-custom.run. The installer should offer you to automatically disable nouveau and reboot, after that you will be able to install the driver without problems. If it doesn’t work for any reason, try to manually create /usr/lib/modprobe.d/disable-nouveau.conf file with the following content:

blacklist nouveau
options nouveau modeset=0

Reboot and restart the installer.

nvidia-455xx-dkms-patched-PKGBUILD.txt (1.6 KB)

I made this pkgbuild based on the AUR dkms pkg for arch by adding the patch. Fingers crossed.

EDIT: I am not the maintainer

EDIT2: Seems to have applied properly on Manjaro.

Before reboot, I removed the linux{kernel-ver}-nvidia-455xx from the manjaro official repo without running into any missing dependency errors and it booted up no problem.

To be clear, does that mean the fix isn’t on the new Linux 5.9-compatible 455 series release from yesterday either? (since it’s release branch). Does Nvidia plan to release a new version of the 450 driver with Linux 5.9 compatibility? Those of us stuck on 5.8 need to upgrade since there have been important Intel security fixes since 5.8 went EOL, but the 455 series isn’t stable enough to stay online for a day if the fix didn’t land yet. If the fix wasn’t in this release, is there a rough timeline on a working or fixed 5.9-compatible release you can share so we can decide if it’s worth it to move our machines back to Linux 5.4 to get the security fixes? (i.e. weeks, months, etc)

Regarding the modesetting changes being “too risky for the release branch”, is there a breakdown in communication about the existing state of modesetting, or are those of us experiencing constant modesetting crashes a minority of the modesetting userbase? If it’s the latter do you know if there’s something we’re doing that we can change to work around the issue? Right now we need at least 2-3 days of system uptime in order to complete our network training, and the last 455 series we tested wasn’t making it past a day.

I doubt it’s in there. No mention in the changelog and seeing that they gave us this patch (which I haven’t tried yet) it’ll probably be a long time until they fix it in a future release.

On your question about release timeline, see their previous answer here:

This was something they had given a release timeline on already (“mid November”), but I hadn’t considered “5.9 compatible” would include this bug. Since it’s now a security issue too I figured they might given an updated schedule on e.g. 450 series 5.9 compatibility or clarify their position on the severity.