nvkms_close deadlock on semaphore held by Xorg (580.126.09, RTX 3050)
Summary
Kernel hang in nvidia_modeset: a process closing its /dev/nvidia-modeset file descriptor blocks indefinitely in nvkms_close → down() on a semaphore the kernel hung-task watchdog identifies as “likely last held by task Xorg”. Xorg itself remains in R state, apparently still inside nvidia_modeset code paths (nvkms_get_usec, nvkms_call_rm, multiple _nv*kms / _nv*rm frames). Every subsequent process that attempts to close its /dev/nvidia-modeset fd (e.g. during exit cleanup) joins the same wait, cascading the hang across the session. VT switching is blocked, nvidia-smi hangs, and SIGTERM/SIGKILL are ineffective because the target processes can no longer complete __fput.
No prior GPU error (no XID, no reset, no OOM) preceded this. The first symptom was an application hang (ghostty terminal emulator) that would not exit; the kernel hung-task watchdog fired 1228s later.
Environment
- GPU: NVIDIA GeForce RTX 3050 6GB (GA107)
- Driver: NVRM 580.126.09 (proprietary, built 2026-01-07)
- Kernel: 6.17.0-20-generic #20~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC
- Distro: Ubuntu 24.04.4 LTS (noble)
- Desktop: GNOME Shell on Xorg (
org.gnome.Shell@x11.service) - Loaded modules:
nvidia 104144896 (refcount 596),nvidia_modeset 1638400 (34),nvidia_drm 135168 (26),nvidia_uvm 2076672 (2)
Reproduction
Not a deterministic reproducer. The machine had been up 6+ days with a normal desktop workload (Firefox, GNOME terminals, Electron apps, Steam, a local 3-worker R PSOCK cluster doing pure CPU work — no GPU compute). The hang appeared after an extended session, triggered by a ghostty terminal exit.
Key stack (kernel hung-task watchdog)
The victim, ghostty, D state for 1228s, wchan nvkms_close:
INFO: task ghostty:7328 blocked for more than 1228 seconds.
Tainted: P W OE 6.17.0-20-generic #20~24.04.1-Ubuntu
Blocked by coredump.
task:ghostty state:D stack:0 pid:7328 tgid:7328 ppid:5799
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_timeout+0x104/0x110
___down_common+0x107/0x180
__down_common+0x58/0x130
__down+0x1d/0x30
down+0x60/0x80
nvkms_close+0x35/0xc0 [nvidia_modeset]
__fput+0xed/0x2d0
____fput+0x15/0x20
task_work_run+0x60/0xa0
do_exit+0x1fa/0x480
do_group_exit+0x34/0x90
get_signal+0x832/0x840
arch_do_signal_or_restart+0x41/0x200
exit_to_user_mode_loop+0x91/0x170
do_syscall_64+0x198/0xa40
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
The suspected holder, Xorg, still R, deep inside nvidia_modeset:
INFO: task ghostty:7328 blocked on a semaphore likely last held by task Xorg:5963
task:Xorg state:R running task stack:0 pid:5963
Call Trace:
<TASK>
__schedule+0x515/0x7a0
? nvkms_get_usec+0x34/0xb0 [nvidia_modeset]
? _nv000172kms+0xc5/0x160 [nvidia_modeset]
? os_release_spinlock+0x1a/0x30 [nvidia]
? _nv000386kms+0xf0/0x180 [nvidia_modeset]
? _nv014193rm+0x86/0xa0 [nvidia]
? _nv000652rm+0x5e/0x70 [nvidia]
? nvidia_modeset_rm_ops_free_stack+0x45/0x50 [nvidia]
? nvkms_call_rm+0x5e/0x90 [nvidia_modeset]
? _nv003169kms+0x42/0x50 [nvidia_modeset]
? _nv002902kms+0xcd/0x180 [nvidia_modeset]
...
The hung-task watchdog repeated identical traces at +2min and +4min intervals; the trace did not change, indicating a durable deadlock rather than slow progress.
A live /proc/7328/stack captured ~10 hours after the initial dump is byte-identical to the watchdog trace — ghostty has made zero kernel-side progress in 10+ hours. A live /proc/5963/stack for Xorg at the same point reduces to a single frame:
[<0>] nvkms_yield+0xe/0x20 [nvidia_modeset]
So Xorg is parked inside nvkms_yield, voluntarily yielding the CPU in what appears to be a wait loop within the module, while holding the semaphore that every other process’s nvkms_close is blocked on. This localizes the issue: whatever nvkms_yield-loop Xorg entered has no exit condition that will ever be satisfied in the current driver state, and it holds the close-path semaphore for the duration.
Observed downstream effects
SIGTERMto gnome-shell (PID 6284) accepted but process never exited — blocks in its own__fputpath once it reaches a/dev/nvidia-modesetfdsystemctl --user restart org.gnome.Shell@x11.servicerefused (service hasRefuseManualStart/Stop);systemctl --user killineffective for the same kernel-level reasonCtrl+Alt+F<n>VT switch does not change active VT (/sys/class/tty/tty0/activeremainedtty2) — Xorg cannot release DRM master while stuck insidenvidia_modesetnvidia-smihangs indefinitely — the driver cannot service new ioctls- “Blocked by coredump” annotation on the ghostty trace suggests ghostty was being core-dumped when the deadlock was triggered; the coredump handler’s fd cleanup walked through
/dev/nvidia-modesetand never returned
Only a hard reboot clears the state. A graceful reboot is likely to hang on shutting down user services that hold nvidia_modeset fds.
Attached
system.txt— uname, driver version, lsmodkern_excerpt.log— kernel log from 23:43:18 through morning (includes three successive hung-task dumps of the same stack)stuck_procs.txt— ps/wchan for ghostty, Xorg, gnome-shelllive_stacks.txt—/proc/<pid>/stackfor ghostty (7328) and Xorg (5963) captured ~10 hours after the initial hung-task dump. ghostty’s stack is identical to the watchdog dump, confirming no progress. Xorg’s live stack reduces to a single frame:nvkms_yield+0xe/0x20 [nvidia_modeset]ps_auxf.txt— full process tree at the time of diagnosis
Hypothesis
Either (a) a lock-ordering bug between nvkms_close and whatever Xorg path is holding the semaphore, (b) an error path in _nv*kms/_nv*rm that leaks a down() without a matching up(), or (c) Xorg is itself blocked waiting on GPU state (firmware, DPC, etc.) and will never release the semaphore — in which case the down() in nvkms_close probably ought to be interruptible or have a timeout so SIGKILL can at least free user-space.
live_stacks.txt (415 Bytes)
stuck_procs.txt (559 Bytes)
kern_excerpt.log (76.4 KB)
system.txt (725 Bytes)