Wayland: applications freezing sporadically, suspected vram issues

nvidia-bug-report.log.gz (491.6 KB)

summary: applications will sometimes freeze and cease rendering new frames (“render halt”, as a hang implies it resumes at some point, which this never does). Application audio will continue, and inputs are still processed under the hood.

Nvidia-open vs nvidia (proprietary) does not make a difference. Proprietary drivers w/ GSP disabled did not prevent this either.

Graphics cards with more vram (e.g. my 3090 with 24gib of vram) are not immune to this; they just have this happen less frequently. Happens to both wayland and xwayland windows (e.g. has happened once or twice to an alacritty / ghostty window, native-wayland gpu-accelerated terminal).

Details:

  • I do not think that explicit sync is actually related either.

  • This has been a perpetual issue for my friend with a 3080 and otherwise identical setup to mine - however, it does rarely happen to me, and happened to my Steam window this morning at about 6am while I was still asleep, according to logs:

sudo dmesg -e | tail -n2
[Apr 8 06:35] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
[  +0.000035] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object

From the looks of things, Fossilize started segfaulting a lot after that; my best guess is that fossilize was.. doing something? with shader pre-caching or something in the background, and that chewed up enough vram to make my Steam window explode (& then released the vram, because only ~3.5/24gb were used when I poked at my desktop about this).

This has been happening very consistently for some friends (who I will direct to put bug reports for in this thread ☺️) - we’ve done a lot of poking between window managers, driver versions, proprietary drivers w/ GSP disabled, etc - and I generally struggle to reproduce this with my 3090, but it can still happen OVERNIGHT to a random window on my system.

sidenote: this thread looks somewhat related, but it’s being observed with any vulkan/gl applications in wayland, regardless of xwayland - a native wayland gl-accelerated terminal freezing during runtime is weird. These issues we’re observing are separate from the vram issues surrounding suspend/resume (me & my friends are “turn displays off but leave system running overnight” people, because uptime is all about those 9’s)

Ideally, vram could be paged out to system ram when full (as degraded performance is preferable to render halts). Also ideally, applications wouldn’t balloon their vram usage infinitely, but we can’t all be winners, unfortunately.

2 Likes

Hi,

I’m the friend in question. Here’s a sudo nvidia-bug-report.sh --extra-system-data immediately after Discord froze and subsequently crash-looped 3 times – I can reproduce this any time with discord-canary with hardware acceleration on. (Discord stable, with hardware acceleration off, does not exhibit the same issue). When this issue happens, the following error is observed:

[Tue Apr  8 20:38:06 2025] [drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to map NvKmsKapiMemory 0x0000000041e744b9
[Tue Apr  8 20:38:22 2025] [drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to map NvKmsKapiMemory 0x000000006ddb51d5
[Tue Apr  8 20:38:38 2025] [drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to map NvKmsKapiMemory 0x00000000d7f128b0
[Tue Apr  8 20:38:52 2025] [drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to map NvKmsKapiMemory 0x000000002b03a0af
[Tue Apr  8 21:04:56 2025] [drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to map NvKmsKapiMemory 0x00000000af0cc908
[Tue Apr  8 21:05:12 2025] [drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to map NvKmsKapiMemory 0x000000005507ab12

nvidia-bug-report.log.old.gz (1.4 MB)

And here is a more “normal” exhibition of the bug, playing DIRT Rally 2, and the game froze with the last drawn state at some point (presumably when the game tried to allocate some VRAM and failed to because of the VRAM being nearly full at the time, per nvtop)
This one is preceded by this message in dmesg:

[Tue Apr  8 21:33:56 2025] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to allocate NVKMS memory for GEM object
[Tue Apr  8 21:33:56 2025] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to allocate NVKMS memory for GEM object
[Tue Apr  8 21:33:56 2025] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to allocate NVKMS memory for GEM object
[Tue Apr  8 21:33:56 2025] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to allocate NVKMS memory for GEM object
[Tue Apr  8 21:33:56 2025] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to allocate NVKMS memory for GEM object
[Tue Apr  8 21:33:56 2025] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000b00] Failed to allocate NVKMS memory for GEM object

nvidia-bug-report.log.gz (1.3 MB)

I was watching nvtop the entire time while gaming trying to trigger the bug the second time, and can definitely confirm this bug seems most likely to occur at the very high end of memory usage – which is especially brutal on my 3080 with only 10Gi of vram.

1 Like
[Apr12 17:33] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
[  +0.000026] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object

happened again on my host an hour ago; this time it happened while I was away (the screen was locked & display was off) and I was ssh’d in to the host. 4gb vram in use according to nvtop. Kinda baffling.

Upon my return, the game I had left open was exhibiting the same behavior (no new frames, but still gave audio & would respond to inputs, judging by audio)

Alright - so some games (Deadlock in particular, amusingly enough) love to freeze when left open with my screen locked and displays off.

Same nv_drm_gem_alloc_nvkms_memory_ioctl error as above. I have yet to be able to ‘quickly’ repro this, as it only happens when something is left open for several hours while I’m away.

You might be inclined to say “it’s a bit silly to leave games open for hours at end while afk, don’t do that” but: they simply have this happen the most consistently. This occasionally also hits one of my Ghostty windows overnight. Hopefully soon I can manage to catch this relatively quickly and get a (maybe more useful?) bug-report.sh that isn’t hours after the fact.

[Apr21 12:20] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
[  +0.000044] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
[  +3.407522] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
[  +0.000033] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
[ +14.887412] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
[  +0.000044] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object

I can consistently reproduce this by leaving my desktop idle (-> swaylock → displays off after 10m) with Deadlock open. After about an hour or so of idle time, I’ll return to the game being stuck frozen, with many of the above loglines. Whatever it is that causes the vram to get gobbled up releases it all by the time I unlock my computer.

This also happens overnight with the basic Steam x11 (via xwayland-satellite) window. Usually after about ~6 hours, while I’m asleep.

It looks like there’s a few issues here:

  • vram use climbs while desktop is idle and locked. Unrelated to the issues abut suspend/wake I’ve seen
  • when out of vram, things freeze catastrophically due to not paging vram ↔ system ram

If vram would at least page to system ram, it’d be easier to diagnose the first issue here…

nvidia-bug-report.log.gz (520.0 KB)

can reproduce on 5600x 4070ti on nobara41 wayland plasma, and endeavour wayland plasma, and endeavour x11 plasma.

easiest way to reproduce it is to slam the gpu with a game that is behaving badly lately, DCS World Standalone on a moderately busy server does it pretty fast. if still not, enable a targeting pod, should get you no problem. other apps freeze up and wont recover until you terminate the game to free the vram, and the whole desktop gets stuttery. some applications need to be restarted as they will never unfreeze.

I am having this same issue as well. Last Epoch seems to be the game.

The other programs just kind of hang (discord steam, etc) but only visually. They fully function otherwise.

I also occasionally get an alert that plasma had dropped back to software rendering if it happens long enough.

Something is screwy with the new 570 driver I think, because I did not have this previously.

1 Like

Also getting this error. for me its Halo 2 Anniversary after 2+ hours of gameplay and Gmod after a few minutes with heavy addons. They seem to be working fine under the hood if I attempt to play it, its just the display that gets killed.

My steam client window also freezes, but it starts working again if I open and close it.

I’ve been having the same issue with Last Epoch since the latest patch where they upgraded Unity. Usually start having issues within an hour on a 3080 10GB. Nothing locks up, but fps will drop from ~120 to below 30 and I’ll get the Failed to allocate NVKMS memory for GEM object error.
nvidia-bug-report.log.gz (1.3 MB)

This is still an issue a month later.

See also Non-existent shared VRAM on NVIDIA Linux drivers - apparently also entirely without acknowledgement or comment.

This is a solved engineering problem space, man. This is necessary functionality for end user systems.

1 Like

Still happening, and I’ve had to start coping by always having nvtop up and dodging maxing out my vram ever ever lest all of my apps freeze.

wish i could edit the title to ‘Wayland: applications freeze when vram alloc fails, paging to system ram needed’ or something along those lines

i’ve frankensteined the GPU memory stats bits out of nvtop and rolled it into a waybar module here so that at 90% vram util I can take steps to prevent my system from freezing.

userland should not have to resort to measures like this to maintain system operability

1 Like

This is happening to me consistently when playing Stellar Blade, my background windows will freeze and stop repainting updates, this most commonly happens to Discord and my browser. It’s reproducible every time I play the game.

Video Card:
Driver: NVIDIA Corporation NVIDIA GeForce RTX 4070 Ti SUPER/PCIe/SSE2
Driver Version: 4.6.0 NVIDIA 575.57.08
1 Like

Related issue:

Also here’s my log, but it’s basically the same as the above errors.
nvidia-bug-report.log.gz (598.8 KB)

1 Like

Good lord. I have a fun partial solution here.

Turns out that an application profile that sets GLVidHeapReuseRatio=1 against my compositor’s process name can reduce its idle memory usage from 2668MiB to 168MiB, and saving 2.5GiB of vram (or more?) has dropped my vram footprint significantly.

Dunno if applying this against other processes will be meaningful or needed, though.

That sounds like an it's broken difference lol.

absolutely. it’s kinda dumbfounding how the solution here is:

  • a random json file in etc
  • that does a string match on a process name
  • to change allocator behavior based on a knob
  • and this was sorta noted in release notes briefly with a mention of default profiles for “some” Wayland compositors

Admittedly a control knob for a heap allocator is incredibly normal, but that being a magic bullet to the tune of 2.5GiB of vram is rather mind boggling.

I’m sure there’s some reason why this couldn’t be done more sanely, but it’s a comple mystery to me as a consumer.

The lack of vram paging / tendency for things to jam up when vram nears full is still possible, I think, but vastly less likely to occur from now on with this increased headroom (and decreased growth during use - I don’t think I’ll be seeing my compositor spiral up to 4 or more GiB of used vram anymore now that it’s down below 200MiB constantly…)

Sounds like Nvidia was listening to the geniuses in the Linux community claiming that unused RAM is wasted RAM.

In all fairness this would be perfectly fine if shared memory was working.

Reading more of this thread, it seems these profiles are now bundled in the driver since 565.77, are you on an older driver version? If not this shouldn’t have any changes.