440.48.02: Random X.org lock ups due to kernel module crash

At some random points X.org locks up and does not respond to keyboard and mouse input. This usually happens after a considerable period of uptime (like a few days), during which the system is loaded with various desktop apps (web browser, email client, text editor, etc.), ffmpeg for video transcoding and occasional games via Wine/DXVK. The lock up may happen when the display is off for powersaving or when it is active. In any case, there is no particular user activity that leads to the problem, it just happens by itself at seemingly random points.

When the lock up happens, in the kernel log I can see a backtrace involving Nvidia kernel module and an error saying:

Xorg: page allocation failure: order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0

Kubuntu 19.10, x86_64, Nvidia driver 440.48.02.
nvidia-bug-report.log.gz (116 KB)
kern.log.gz (7.7 KB)

Looks like this long standing bug:
https://devtalk.nvidia.com/default/topic/1063809/linux/rhel-7-7-430-52-random-kernel-crashes/post/5387057/#5387057
https://devtalk.nvidia.com/default/topic/1068614/linux/still-experiencing-random-crashdumps-rhel7-7-x-page-allocation-failure-order-4-mode-0x40d0-in-nvkms_ioctl-nvidia_modeset-/post/5412883/#5412883
https://devtalk.nvidia.com/default/topic/1066227/linux/nvidia-gpu-kvm-switch-x11-crash/post/5399854/#5399854
Put simply, low memory situation, the nvidia driver needs to do something and tries to allocate memory, does not get it fast enough and crashes.

Well, there was plenty memory available (both host RAM and presumably VRAM) when the error took place. The system was not running any GPU-intensive tasks aside from idle KDE desktop. It was running CPU-intensive ffmpeg (without hwaccel), which utilized all CPU cores.

When you say “fast enough”, does it mean there’s some sort of timeout for the allocation operation? If so, that might explain the crash even if there is enough free memory.

The question is how much contiguous memory was available at the time of the allocation attempt and how much the nvidia driver wanted. Since it’s a blackbox, I can only speculate from observation. The linked threads (there are more) all had the same situation, no CPU/GPU intensive but large memory consuming tasks and grown caches and then the driver needed to do something.
There’s enough memory available but the kernel needs some time to free it up from caches and the driver doesn’t seem to ask twice.

Still happens with 440.59, which had a release note about fixing crashes on wake up. This time the crash happened when I tried to switch the primary display from DisplayPort to HDMI.
kern.log.gz (30.3 KB)

Still happens on 440.66.08. I tried enabling KMS by setting “options nvidia-drm modeset=1” in modprobe.d/nvidia.conf, it doesn’t help. Please fix.
kern.log (165.0 KB)

Still happens with 440.66.11.

I may have found a workaround. If I disable HardDPMS in xorg.conf (Option “HardDPMS” “False”), the crashes don’t seem to happen. At least they have not happened in a few days for me.

The problem with HardDPMS off for me is that my display turns on backlight twice, with 5 minute interval, after it was turned off. I suspect, this is because the GPU is switching between DPMS modes (Display turns on twice after being turned off for powersaving).

This happened to me for the first time just now.
It is weird, since i use the intel igp as the primary gpu and nvidia via prime render offload!

After the time passed since my last post I have been running with HardDPMS off and have not had a single crash. So it looks like HardDPMS is the culprit.

I would really like it to work though, since, as I described earlier, my NEC MultiSync EA272UHD display doesn’t work well with DPMS. I also remember my older Samsung display didn’t work with DPMS when connected via DisplayPort (the backlight wouldn’t turn off), so I imagine this is not a one off problematic display model.

Nvidia, is there any update on this problem?

Here’s a fresh report log from a crash, Kubuntu 20.04, Nvidia driver 455.23.04.

nvidia-bug-report.log.gz (106.2 KB)

I’m having this problem too. It just started a few months ago.

I noticed that it may be happening when the screen blanks and the Yamaha receiver and LG TV connected to the GTX1050 have been turned off.

I’m hoping the fix posted in here will work around the issue for me as well.

Thanks Lastique.

I still had another failure after setting harddpms false… Bummer.

I had the same lock up today (after locking my screen) with 440.100

Oct  8 12:23:25 five kernel: [1149767.359221] Xorg: page allocation failure: order:4,     mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
Oct  8 12:23:25 five kernel: [1149767.359226] CPU: 18 PID: 2585978 Comm: Xorg Tainted: P           OE     5.4.0-47-generic #51-Ubuntu
Oct  8 12:23:25 five kernel: [1149767.359227] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS, BIOS 2407 07/01/2020
Oct  8 12:23:25 five kernel: [1149767.359227] Call Trace:
Oct  8 12:23:25 five kernel: [1149767.359234]  dump_stack+0x6d/0x9a
Oct  8 12:23:25 five kernel: [1149767.359237]  warn_alloc.cold+0x7b/0xdf
Oct  8 12:23:25 five kernel: [1149767.359238]  __alloc_pages_slowpath+0xe07/0xe50
Oct  8 12:23:25 five kernel: [1149767.359240]  ? get_page_from_freelist+0x6b/0x390
Oct  8 12:23:25 five kernel: [1149767.359242]  __alloc_pages_nodemask+0x2d0/0x320
Oct  8 12:23:25 five kernel: [1149767.359243]  alloc_pages_current+0x87/0xe0
Oct  8 12:23:25 five kernel: [1149767.359245]  kmalloc_order+0x1f/0x80
Oct  8 12:23:25 five kernel: [1149767.359246]  kmalloc_order_trace+0x24/0xa0
Oct  8 12:23:25 five kernel: [1149767.359257]  ? _nv000491kms+0x50/0x50 [nvidia_modeset]
Oct  8 12:23:25 five kernel: [1149767.359258]  __kmalloc+0x220/0x280
Oct  8 12:23:25 five kernel: [1149767.359266]  ? _nv000491kms+0x50/0x50 [nvidia_modeset]
Oct  8 12:23:25 five kernel: [1149767.359273]  nvkms_alloc+0x24/0x60 [nvidia_modeset]
Oct  8 12:23:25 five kernel: [1149767.359284]  _nv002521kms+0x16/0x30 [nvidia_modeset]
Oct  8 12:23:25 five kernel: [1149767.359285] WARNING: kernel stack frame pointer at 000000003bcd4f23 in Xorg:2585978 has bad value 00000000b371c8d7
Oct  8 12:23:25 five kernel: [1149767.359287] unwind stack type:0 next_sp:0000000000000000 mask:0x2 graph_idx:0
Oct  8 12:23:25 five kernel: [1149767.359288] 00000000cf9c8690: ffffbd8e8252f848 (0xffffbd8e8252f848)

@btmckee9 Did you put the HardDPMS option in the Device section? Here is how my section looks, and it doesn’t crash:

Section "Device"
    Identifier "Default nvidia Device"
    Driver "nvidia"
    Option "NoLogo" "True"
    Option "CoolBits" "28"
    Option "TripleBuffer" "True"
    Option "HardDPMS" "False"
EndSection

nvidia-bug-report.log.gz (409.5 KB)

The system is mostly headless and I only turn on the connected external TV to consume media.
When I don’t use the gpu, it either sits idle at the lightdm login window or I shut down lightdm and unload the driver completely.

It started since about a year. Happens randomly. IIRC it never occured when I was actively using the GPU, but only when idling, except for the most recent hang which occured when I opened Vivaldi browser.

Has my card become physically defective or is this a software issue?

What I also observed is if using the opensource nouveau module, I am seeing a thread sitting constantly at 100% on one of the cpus.

I don’t remember seeing any hangs on kernel 4.x series.