455.23.04: Page allocation failure in kernel module at random points

At seemingly random points in time kwin_x11 freezes the desktop for several seconds, and after that it unfreezes with a notification that compositing was restarted. In the system logs I can see that there was a page allocation failure in the kernel module:

[87398.720448] kwin_x11: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[87398.720453] CPU: 4 PID: 1988 Comm: kwin_x11 Tainted: P OE 5.4.0-48-lowlatency #52-Ubuntu
[87398.720453] Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 11/09/2012
[87398.720454] Call Trace:
[87398.720460] dump_stack+0x6d/0x9a
[87398.720462] warn_alloc.cold+0x7b/0xdf
[87398.720464] __alloc_pages_slowpath+0xe34/0xe80
[87398.720467] ? compact_zone_order+0xbb/0xf0
[87398.720468] ? get_page_from_freelist+0x233/0x390
[87398.720470] __alloc_pages_nodemask+0x2d2/0x320
[87398.720471] alloc_pages_current+0x87/0xe0
[87398.720472] kmalloc_order+0x1f/0x80
[87398.720473] kmalloc_order_trace+0x24/0xc0
[87398.720474] __kmalloc+0x228/0x280
[87398.720490] nvkms_alloc+0x24/0x60 [nvidia_modeset]
[87398.720499] _nv002714kms+0x16/0x30 [nvidia_modeset]
[87398.720501] WARNING: kernel stack frame pointer at 00000000ffcb012a in kwin_x11:1988 has bad value 0000000000000000

I have seen this happen at different times - under CPU load and mostly idle, with a lot of free host memory available.

Note that unlike the problem described in the other topic (440.48.02: Random X.org lock ups due to kernel module crash), in this case it doesn’t happen on waking up or going to sleep (DPMS), and it happens with HardDPMS=False. It also affects KWin instead of Xorg.

This started happening after the upgrade to 455.23.04. I didn’t have this problem with 450 series.

Kubuntu 20.04, x86_64.

nvidia-bug-report.log.gz (280.9 KB)

1 Like

Here is another report.
nvidia-bug-report.log.gz (293.7 KB)

The same happens with the 455.22.04 beta Vulkan driver.
nvidia-bug-report.log.gz (294.1 KB)

One observation I made is that this problem is more likely to reproduce when there is a lot of filesystem IO is happening. For example, when a large (multi-gigabyte) directory with lots of files (tens/hundreds of thousands) is being copied between partitions on an SSD.

1 Like

I’ve got another one!

And yes, the problem probably occurs faster when there is lots of disk IO. I run ZFS on two SSDs and three 2TB drives so when I try to launch any game… it just freezes right off the bat!

Even staying idle at the desktop will make it freeze in under 15 minutes, usually much less!

We’re looking into this type of allocation failure and for future reference, it’s being tracked internally in bug number 3032665.

2 Likes

The same happens with 455.26.01.
nvidia-bug-report.log.gz (298.9 KB)
kern.log (177.5 KB)

I’m having these, too. nvidia-modeset/: page allocation failure, not kwin or xorg as others experienced. I was moving files from an external HDD to another when suddenly at least everything visual froze, music did stop too though. Doing Ctrl+alt+f2 and then f1 did help a couple of times, but after the third or fourth time it didn’t. My monitors reported no signal and I had to reset the PC yet again.

Here’s (yet another) kernel log of the problem.
The second one is from after I was able to recover a few times through ctrl+alt+f1/f2 switching. It happened the moment I was trying to start an application (Discord). Looks like x11 completely broke down then.

kernel1.log (12.6 KB) kernel2.log (27.6 KB)

Glad to see you’re aware of this issue, I hope it gets fixed quickly, it’s quite severe.

2 Likes

Wanted to chip in here. I believe I am experiencing the same issue. I am attaching the log info here. For addition, my system is a 9900k with 32gb, RTX 2080, with drivers 450.80.02-0ubuntu0.20.04.2.X11pagefault.txt (26.0 KB)

I also just experienced what seems to be the same issue while running chromium. I have RTX 2070 SUPER on Arch Linux, with driver 455.28-7 and GNOME DE.
I experienced a complete freeze on my desktop, and the cursor disappeared. Like bogus12, repeatedly trying to switch ttys eventually fixed the issue, and I was able to to resume work on chromium. This has so far happened only once for me though.

Kernel logs:
chromium_crash_journalctl.txt (24.0 KB)

1 Like

I have had this error for maybe 2-3 weeks now too. During this time i have been using the latest Arch Linux 5.8.X kernels and the “top of the line newest” nvidia drivers (at the moment 455.28-1). In most cases the freeze bug happens while using web browsers (chromium, firefox, vivalid all are affected) and in some other cases while doing some mouse interactions.

I can confirm that CTRL-ALT-F2 switching to console and back sometimes gets my desktop(using i3wm) unfrozen . In some cases i have to “kill -9” some process that seems to be side effected by the freeze. Sometime it’s the browsers sometimes it’s polybar or emacs-daemon.

I tried it with and without profile-sync-daemon to see if that help’s with the browser and problem. As someone above mentioned it being connected to high IO load. But that changed nothing.

Even with my system only having 8GB at the moment the bug happens even if most of these 8GB are free.

Is there any way to follow the “3032665” in some way or are there no options to get updates directly ?

There’s no way to get updates directly, sorry.

We’ve made a change that should avoid this problem in the future. It’ll be available in a future release.

3 Likes

@aplattner Thanks for the update.

Is this change specific to this one allocation failure? Is it possible that it will also fix 440.48.02: Random X.org lock ups due to kernel module crash ?

It should apply to all memory allocation failures that happen during mode setting operations. I’m not 100% sure it applies to the one in that other thread, but I think so.

2 Likes

Seeing the exact same thing with 455.28 and an RTX 2070 Super on Ubuntu 20.10 (kernel 5.8.0-23-generic) running plasma desktop. I wasn’t seeing this issue at all with the 450 train, but it happens frequently after moving to 455.28.

Oct 21 19:27:22 H510 kernel: [441965.951840] warn_alloc: 3 callbacks suppressed
Oct 21 19:27:22 H510 kernel: [441965.951842] kwin_x11: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
Oct 21 19:27:22 H510 kernel: [441965.951847] CPU: 14 PID: 332304 Comm: kwin_x11 Tainted: P W OE 5.8.0-23-generic #24-Ubuntu

I’ve been going nuts tracing this bug for a few weeks and finally tracked it back here. Is it recommended for 455.28 to downgrade to 450 series?

This bug brings down our systems in one way or another daily (requiring force shutoff) along with service disruptions more often; can you confirm what driver series is appropriate to downgrade to?

@nvidia212 It doesn’t reproduce with 450 series for me. In particular, I’m currently running 450.56.11.

Our dist only packages 450.80.02. Can nvidia let us know if the bug made it back to older series and we need to dig farther back?