455.23.04: Page allocation failure in kernel module at random points

bogus12 · November 19, 2020, 9:36am

I doubt it’s in there. No mention in the changelog and seeing that they gave us this patch (which I haven’t tried yet) it’ll probably be a long time until they fix it in a future release.

On your question about release timeline, see their previous answer here:

nvidia212 · November 19, 2020, 9:42am

This was something they had given a release timeline on already (“mid November”), but I hadn’t considered “5.9 compatible” would include this bug. Since it’s now a security issue too I figured they might given an updated schedule on e.g. 450 series 5.9 compatibility or clarify their position on the severity.

jhaiduce · November 19, 2020, 5:34pm

How did you accomplish the downgrade? I’m on Fedora 32 also and I can’t figure out how to downgrade to 450. There isn’t a package for it in the rpmfusion repository. Did you use the official version from nvidia instead?

louis.vandyk · November 19, 2020, 9:08pm

Yes, I downloaded the official version from NVIDIA. After having had issues with the various repositories from time to time, I’ve used this excellent blog article as reference:
https://www.if-not-true-then-false.com/2015/fedora-nvidia-guide/
and been using the official version from NVIDIA directly for years now.

BTW, the patch process to 455 mentioned by @aplattner on 11 November has worked marvellously!! I have not had a crash since 12 November when I last rebooted.

louis.vandyk · November 19, 2020, 9:12pm

@aplattner - my promised feedback: I applied the above patch to 455.38 on 12 Nov, and have not had one crash since then. 7 days uptime! :-]

volker.weissmann · November 19, 2020, 11:27pm

Thank you for your help, this worked.
I originally wanted to try exactly this, but I thought that blacklisting nouveau would result in a black screen if no nvidia driver is installed.
I followed this guide for blacklisting
https://wiki.archlinux.org/index.php/Kernel_module#Blacklisting

jim.mollmann · November 20, 2020, 5:19pm

The patch seemed to fail on Fedora 32. The nvidia-modeset.ko.xz is included in the initramfs image. I needed to manually rebuild that:

dracut /boot/initramfs-$(uname -r).img $(uname -r)

Examination of the nvkms_alloc function in nvidia_modeset module with gdb disassemble in /proc/kcore now shows the expected change.

simeonkleinnibbelink · November 22, 2020, 9:21pm

posted patch seems to be working great for me on arch linux, i have been using it without any problems for a week or so. i havent tried if the patch works on newer driver versions, but if you don’t want your system to undo it every driver update you should blacklist the nvidia package from being updated. on arch / manjaro you can do this by uncommenting the line ignorepkg and adding nvidia(or nvidia-dkms depending on which one you installed), nvidia-utils, nvidia-settings and lib32-nvidia-utils to it in the pacman config file(which is located at /etc/pacman.conf)

jhaiduce · November 23, 2020, 6:21pm

Thanks. I’ve been using rpmfusion until now but with this bug in the latest release it might be time to switch.

jhaiduce · November 23, 2020, 6:29pm

I recently upgraded to 455.45 and am seeing this problem there too. Will try downgrading to 455.38 and applying the patch described above.

bsingharora · November 23, 2020, 11:56pm

I am attaching a patch here which I think is the right way to handle the BUG, if you need my signed-off-by, please reach out or just add it

--- nvidia-modeset/nvidia-modeset-linux.c.org	2020-11-23 20:46:12.817979880 +1100
+++ nvidia-modeset/nvidia-modeset-linux.c	2020-11-24 10:50:31.474395155 +1100
@@ -21,6 +21,7 @@
 #include <linux/file.h>
 #include <linux/list.h>
 #include <linux/rwsem.h>
+#include <linux/mm.h>
 
 #include "nvstatus.h"
 
@@ -169,33 +170,19 @@ static inline void nvkms_write_unlock_pm
  * are called while nvkms_lock is held.
  *************************************************************************/
 
-/* Don't use kmalloc for allocations larger than 128k */
-#define KMALLOC_LIMIT (128 * 1024)
-
+/*
+ * Let the system decide when to switch between kmalloc and vmalloc
+ */
 void* NVKMS_API_CALL nvkms_alloc(size_t size, NvBool zero)
 {
-    void *p;
-
-    if (size <= KMALLOC_LIMIT) {
-        p = kmalloc(size, GFP_KERNEL);
-    } else {
-        p = vmalloc(size);
-    }
-
-    if (zero && (p != NULL)) {
-        memset(p, 0, size);
-    }
-
-    return p;
+    if (zero)
+        return kvzalloc(size, GFP_KERNEL);
+    return kvmalloc(size, GFP_KERNEL);
 }
 
 void NVKMS_API_CALL nvkms_free(void *ptr, size_t size)
 {
-    if (size <= KMALLOC_LIMIT) {
-        kfree(ptr);
-    } else {
-        vfree(ptr);
-    }
+    return kvfree(ptr);
 }
 
 void* NVKMS_API_CALL nvkms_memset(void *ptr, NvU8 c, size_t size)

mailinglists35 · November 24, 2020, 5:46pm

root@host:/usr/local/src/nvidia# bash NVIDIA-Linux-x86_64-455.38.run --apply-patch bsingharora.patch
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 455.38..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
can't find file to patch at input line 3
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|--- nvidia-modeset/nvidia-modeset-linux.c.org  2020-11-23 20:46:12.817979880 +1100
|+++ nvidia-modeset/nvidia-modeset-linux.c      2020-11-24 10:50:31.474395155 +1100
--------------------------
File to patch:

your patch header does not look like the one provided by aplattner

see the difference to modify it to apply like nvidia patch:

root@host:/usr/local/src/nvidia# head -3 reduce-kmalloc-limit-455.38.patch bsingharora.patch
==> reduce-kmalloc-limit-455.38.patch <==
diff -Naur kernel.orig/nvidia-modeset/nvidia-modeset-linux.c kernel/nvidia-modeset/nvidia-modeset-linux.c
--- kernel.orig/nvidia-modeset/nvidia-modeset-linux.c   2020-10-21 23:17:41.000000000 -0700
+++ kernel/nvidia-modeset/nvidia-modeset-linux.c        2020-11-04 10:35:44.113986369 -0800

==> bsingharora.patch <==
--- nvidia-modeset/nvidia-modeset-linux.c.org   2020-11-23 20:46:12.817979880 +1100
+++ nvidia-modeset/nvidia-modeset-linux.c       2020-11-24 10:50:31.474395155 +1100

bsingharora · November 24, 2020, 9:38pm

OK, lets try once more (this time via git)

diff --git a/nvidia-modeset/nvidia-modeset-linux.c b/nvidia-modeset/nvidia-modeset-linux.c
index ffbbeb9..2302541 100644
--- a/nvidia-modeset/nvidia-modeset-linux.c
+++ b/nvidia-modeset/nvidia-modeset-linux.c
@@ -21,6 +21,8 @@
 #include <linux/file.h>
 #include <linux/list.h>
 #include <linux/rwsem.h>
+#include <linux/mm.h>
+#include <linux/version.h>
 
 #include "nvstatus.h"
 
@@ -169,8 +171,9 @@ static inline void nvkms_write_unlock_pm_lock(void)
  * are called while nvkms_lock is held.
  *************************************************************************/
 
-/* Don't use kmalloc for allocations larger than 128k */
-#define KMALLOC_LIMIT (128 * 1024)
+#if LINUX_VERSION_CODE < KERNEL_VERSION(4, 12, 0)
+/* Don't use kmalloc for allocations larger than PAGE_SIZE */
+#define KMALLOC_LIMIT (PAGE_SIZE)
 
 void* NVKMS_API_CALL nvkms_alloc(size_t size, NvBool zero)
 {
@@ -197,6 +200,19 @@ void NVKMS_API_CALL nvkms_free(void *ptr, size_t size)
         vfree(ptr);
     }
 }
+#else
+void* NVKMS_API_CALL nvkms_alloc(size_t size, NvBool zero)
+{
+    if (zero)
+        return kvzalloc(size, GFP_KERNEL);
+    return kvmalloc(size, GFP_KERNEL);
+}
+
+void NVKMS_API_CALL nvkms_free(void *ptr, size_t size)
+{
+    kvfree(ptr);
+}
+#endif
 
 void* NVKMS_API_CALL nvkms_memset(void *ptr, NvU8 c, size_t size)
 {

jhaiduce · November 27, 2020, 3:04pm

I am pleased to report that I successfully applied the patch from @aplattner to version 455.45.01 and have not encountered a random display failure after almost 4 days of uptime (was previously getting failures every 1-2 days).

root1 · December 2, 2020, 6:04am

Not sure if the issue I’ve been seeing on my computer (GTX 1070ti) is related, but I can reproduce a video lockup by starting a VR session on my computer via Steam, exiting, and starting another one right after. Video just locks up at that point.

Or start a VR session after the computer was on for a while.

But that first one is the usual repro case for me.

emcq · December 3, 2020, 6:06pm

Some good news for Arch Linux users. Thanks to the efforts for the Frogging-Family/ nvidia-all some of the patches from this thread here were included into the kernel package building PKGBUILD config. You can find all the info needed at the linked github page.

root1 · December 3, 2020, 6:40pm

Circling back:

Apparently, on occasion, the hardlocks happen on the first attempt at launching SteamVR. Meaning my repro case is “reliable” only if/when it doesn’t crash in the first place.
It seems that the issues with my hard-hang are mitigated by disabling KDE/kwin composition altogether. Meaning either there are oddities in the driver at context setup if/when a certain “type” of context already exists, or KWin puts the driver in an odd position if/when certain composition options are active.

nvidia212 · December 4, 2020, 9:09am

Are there any updates?

kevinp49hp · December 5, 2020, 5:23pm

Yeah seriously. Where is the final fix? This is a MAJOR problem for me. How has this NOT been addressed officially yet? WTF? Thank the good lord there are some folks here nice enough to offer a patch. But why no NVidia folks? Really disappointed in their lack of response here.

suspectinfotm · December 5, 2020, 9:16pm

Another random user chiming in here to report that this seems to have finally fixed the problem for me as well - 48+ hours since I purged the Ubuntu nvidia-drivers-455 packages and reinstalled the NVIDIA-Linux-x86_64-455.45.01.run package with aplattner’s patch applied and no re-occurrence of the fault.

In case the info is helpful to anyone I’ve been experiencing exactly the same (GFP_KERNEL|__GFP_COMP) kernel fault but with noticeably different results than most report in this thread. Furthermore I’ve been having this error for much longer - 2 to 3 months I’d guess. Initially it was only mildly annoying as it would restart my DM after 5 seconds of being frozen out with no further consequences - about 3 weeks ago the symptoms suddenly got worse and the desktop would freeze for 60 seconds or so before restarting and trigger further segmentation faults in the DM. After that had happened it wouldn’t be long before the fault would reoccur again meaning a lengthy reboot became mandatory after every single crash and like everyone else, I was getting them entirely randomly: from during login itself up to 24 hours or so later.

The fault isn’t random - it’s to do with memory pressure as far as I can tell. With the system under heavy load it definitely triggered the fault more easily. Most easily triggered on my ZFS based system when the ARC was full and staging large amounts of files into L2ARC: actual GPU load is as good as zero on my system normally.

Many thanks for the helpful thread contributions that let me piece together the fix.

Ubuntu 20.10
kernel (various 5.8 - 5.10 realtime)
Nvidia 970
root on ZFS
Gnome 3.38