Nvkms_close deadlock on semaphore held by Xorg parked in nvkms_yield (580.126.09, RTX 3050, 6.17 kernel)

nvkms_close deadlock on semaphore held by Xorg (580.126.09, RTX 3050)

Summary

Kernel hang in nvidia_modeset: a process closing its /dev/nvidia-modeset file descriptor blocks indefinitely in nvkms_closedown() on a semaphore the kernel hung-task watchdog identifies as “likely last held by task Xorg”. Xorg itself remains in R state, apparently still inside nvidia_modeset code paths (nvkms_get_usec, nvkms_call_rm, multiple _nv*kms / _nv*rm frames). Every subsequent process that attempts to close its /dev/nvidia-modeset fd (e.g. during exit cleanup) joins the same wait, cascading the hang across the session. VT switching is blocked, nvidia-smi hangs, and SIGTERM/SIGKILL are ineffective because the target processes can no longer complete __fput.

No prior GPU error (no XID, no reset, no OOM) preceded this. The first symptom was an application hang (ghostty terminal emulator) that would not exit; the kernel hung-task watchdog fired 1228s later.

Environment

  • GPU: NVIDIA GeForce RTX 3050 6GB (GA107)
  • Driver: NVRM 580.126.09 (proprietary, built 2026-01-07)
  • Kernel: 6.17.0-20-generic #20~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC
  • Distro: Ubuntu 24.04.4 LTS (noble)
  • Desktop: GNOME Shell on Xorg (org.gnome.Shell@x11.service)
  • Loaded modules: nvidia 104144896 (refcount 596), nvidia_modeset 1638400 (34), nvidia_drm 135168 (26), nvidia_uvm 2076672 (2)

Reproduction

Not a deterministic reproducer. The machine had been up 6+ days with a normal desktop workload (Firefox, GNOME terminals, Electron apps, Steam, a local 3-worker R PSOCK cluster doing pure CPU work — no GPU compute). The hang appeared after an extended session, triggered by a ghostty terminal exit.

Key stack (kernel hung-task watchdog)

The victim, ghostty, D state for 1228s, wchan nvkms_close:

INFO: task ghostty:7328 blocked for more than 1228 seconds.
      Tainted: P        W  OE       6.17.0-20-generic #20~24.04.1-Ubuntu
      Blocked by coredump.
task:ghostty         state:D stack:0     pid:7328  tgid:7328  ppid:5799
Call Trace:
 <TASK>
 __schedule+0x30d/0x7a0
 schedule+0x27/0x90
 schedule_timeout+0x104/0x110
 ___down_common+0x107/0x180
 __down_common+0x58/0x130
 __down+0x1d/0x30
 down+0x60/0x80
 nvkms_close+0x35/0xc0 [nvidia_modeset]
 __fput+0xed/0x2d0
 ____fput+0x15/0x20
 task_work_run+0x60/0xa0
 do_exit+0x1fa/0x480
 do_group_exit+0x34/0x90
 get_signal+0x832/0x840
 arch_do_signal_or_restart+0x41/0x200
 exit_to_user_mode_loop+0x91/0x170
 do_syscall_64+0x198/0xa40
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
 </TASK>

The suspected holder, Xorg, still R, deep inside nvidia_modeset:

INFO: task ghostty:7328 blocked on a semaphore likely last held by task Xorg:5963
task:Xorg            state:R  running task     stack:0     pid:5963
Call Trace:
 <TASK>
 __schedule+0x515/0x7a0
 ? nvkms_get_usec+0x34/0xb0 [nvidia_modeset]
 ? _nv000172kms+0xc5/0x160 [nvidia_modeset]
 ? os_release_spinlock+0x1a/0x30 [nvidia]
 ? _nv000386kms+0xf0/0x180 [nvidia_modeset]
 ? _nv014193rm+0x86/0xa0 [nvidia]
 ? _nv000652rm+0x5e/0x70 [nvidia]
 ? nvidia_modeset_rm_ops_free_stack+0x45/0x50 [nvidia]
 ? nvkms_call_rm+0x5e/0x90 [nvidia_modeset]
 ? _nv003169kms+0x42/0x50 [nvidia_modeset]
 ? _nv002902kms+0xcd/0x180 [nvidia_modeset]
 ...

The hung-task watchdog repeated identical traces at +2min and +4min intervals; the trace did not change, indicating a durable deadlock rather than slow progress.

A live /proc/7328/stack captured ~10 hours after the initial dump is byte-identical to the watchdog trace — ghostty has made zero kernel-side progress in 10+ hours. A live /proc/5963/stack for Xorg at the same point reduces to a single frame:

[<0>] nvkms_yield+0xe/0x20 [nvidia_modeset]

So Xorg is parked inside nvkms_yield, voluntarily yielding the CPU in what appears to be a wait loop within the module, while holding the semaphore that every other process’s nvkms_close is blocked on. This localizes the issue: whatever nvkms_yield-loop Xorg entered has no exit condition that will ever be satisfied in the current driver state, and it holds the close-path semaphore for the duration.

Observed downstream effects

  • SIGTERM to gnome-shell (PID 6284) accepted but process never exited — blocks in its own __fput path once it reaches a /dev/nvidia-modeset fd
  • systemctl --user restart org.gnome.Shell@x11.service refused (service has RefuseManualStart/Stop); systemctl --user kill ineffective for the same kernel-level reason
  • Ctrl+Alt+F<n> VT switch does not change active VT (/sys/class/tty/tty0/active remained tty2) — Xorg cannot release DRM master while stuck inside nvidia_modeset
  • nvidia-smi hangs indefinitely — the driver cannot service new ioctls
  • “Blocked by coredump” annotation on the ghostty trace suggests ghostty was being core-dumped when the deadlock was triggered; the coredump handler’s fd cleanup walked through /dev/nvidia-modeset and never returned

Only a hard reboot clears the state. A graceful reboot is likely to hang on shutting down user services that hold nvidia_modeset fds.

Attached

  • system.txt — uname, driver version, lsmod
  • kern_excerpt.log — kernel log from 23:43:18 through morning (includes three successive hung-task dumps of the same stack)
  • stuck_procs.txt — ps/wchan for ghostty, Xorg, gnome-shell
  • live_stacks.txt/proc/<pid>/stack for ghostty (7328) and Xorg (5963) captured ~10 hours after the initial hung-task dump. ghostty’s stack is identical to the watchdog dump, confirming no progress. Xorg’s live stack reduces to a single frame: nvkms_yield+0xe/0x20 [nvidia_modeset]
  • ps_auxf.txt — full process tree at the time of diagnosis

Hypothesis

Either (a) a lock-ordering bug between nvkms_close and whatever Xorg path is holding the semaphore, (b) an error path in _nv*kms/_nv*rm that leaks a down() without a matching up(), or (c) Xorg is itself blocked waiting on GPU state (firmware, DPC, etc.) and will never release the semaphore — in which case the down() in nvkms_close probably ought to be interruptible or have a timeout so SIGKILL can at least free user-space.

live_stacks.txt (415 Bytes)

stuck_procs.txt (559 Bytes)

kern_excerpt.log (76.4 KB)

system.txt (725 Bytes)

1 Like

d_state_processes.txt (1.7 KB)

ps_auxf.txt (75.2 KB)

Looks like that i have similar issue while using kde + wayland (595.58.03).

journalctl.txt (10.6 KB)