nvidia-bug-report.log.gz (491.6 KB)
summary: applications will sometimes freeze and cease rendering new frames (“render halt”, as a hang implies it resumes at some point, which this never does). Application audio will continue, and inputs are still processed under the hood.
Nvidia-open vs nvidia (proprietary) does not make a difference. Proprietary drivers w/ GSP disabled did not prevent this either.
Graphics cards with more vram (e.g. my 3090 with 24gib of vram) are not immune to this; they just have this happen less frequently. Happens to both wayland and xwayland windows (e.g. has happened once or twice to an alacritty / ghostty window, native-wayland gpu-accelerated terminal).
Details:
-
I do not think that explicit sync is actually related either.
-
This has been a perpetual issue for my friend with a 3080 and otherwise identical setup to mine - however, it does rarely happen to me, and happened to my Steam window this morning at about 6am while I was still asleep, according to logs:
sudo dmesg -e | tail -n2
[Apr 8 06:35] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
[ +0.000035] [drm:nv_drm_gem_alloc_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NVKMS memory for GEM object
From the looks of things, Fossilize started segfaulting a lot after that; my best guess is that fossilize was.. doing something? with shader pre-caching or something in the background, and that chewed up enough vram to make my Steam window explode (& then released the vram, because only ~3.5/24gb were used when I poked at my desktop about this).
This has been happening very consistently for some friends (who I will direct to put bug reports for in this thread ☺️) - we’ve done a lot of poking between window managers, driver versions, proprietary drivers w/ GSP disabled, etc - and I generally struggle to reproduce this with my 3090, but it can still happen OVERNIGHT to a random window on my system.
sidenote: this thread looks somewhat related, but it’s being observed with any vulkan/gl applications in wayland, regardless of xwayland - a native wayland gl-accelerated terminal freezing during runtime is weird. These issues we’re observing are separate from the vram issues surrounding suspend/resume (me & my friends are “turn displays off but leave system running overnight” people, because uptime is all about those 9’s)
Ideally, vram could be paged out to system ram when full (as degraded performance is preferable to render halts). Also ideally, applications wouldn’t balloon their vram usage infinitely, but we can’t all be winners, unfortunately.