Need suggestions on diagnosing KMS related kernel hang

As another controlled experiment, I installed Debian 12.9 on an iSCSI disk, and booted my machine on to it. This is a fresh install without any customization.

The result of this experiment is once again negative, unfortunately. The combination of stock 6.1.0 kernel and the 570 driver simply lead to an immediate hang after the boot. I can’t even see the login screen this time. The panel simply gets into power saving mode and the toggling the capslock no longer work. The SSH server still accepts connection but a new session cannot spawn, likely due to the kernel being unresponsive. This is definitely a kernel hang – unfortunately I can’t capture a call-stack this time because the journal only recorded ‘Xorg hang’ before it fully locks up.

The next experiment I attempted was to upgrade to Debian testing with 6.12.12 kernel. The same syndrome sticks. I also switched between different kernel module versions (between the proprietary and the GPL version) and nothing changed.

@morgwai666, you mentioned crash with 6.1 kernel. I suppose we are with different syndromes as upgrading kernel solves your problem but not mine.

It’s ironic to see a fresh install is even worse in my case. Actually with my old system, the stock 6.12.17 kernel (from Debian unstable) can work with 570 driver version (in the sense that it does not hang before login but only hang after logout). I think the new OS release probably have a different default settings that makes the situation worse. Would it be Wayland related? I heard may distros switched to Wayland by default nowadays. Is it also the case for Debian?

The bottom line – the issue is not specific to my old OS but also affects fresh new install. Likely there are something special with the hardware I use…

Just list my HW spec for reference:

  • E3-1231 V3 CPU
  • Asus H97M-plus motherboard
  • Geforce RTX 3090 founder edition

I’m glad to come back a report a good progress I made just now. This is the only good news I had among all the frustrations in this thread.

I’ve captured a lot of call-stacks, but those mysterious mangled symbols from the binary blob gave zero information at all. Fortunately, as a side product of my experiment with GPL kernel module, it finally give me a readable call-stack. (It’s ironic to see binary blob in GPL module source, but at least this time it’s no longer mangled.)

As the stack suggested, the kernel is stuck on the DisplayPort detach code path, for unknown reason. Fortunately, I’m currently with a dual panel setup – 1 DP + 1 HDMI so I can simply take another experiment by unplugging the DP panel and reboot. The result is very promising that all the hangs are gone without the DP panel, for both the new and the old systems.

Unfortunately, the GPU I have only comes with a single HDMI port. So DP port is something I cannot avoid with a 2 panel setup. Even though I know a little bit more about this issue, the problem is not solved for me. @aplattner, do you have any information for such DP related hang? It does not look like KMS related anymore…

One funny aspect of this issue is that the syndrome differs in the two OS setups I played with.
My daily system can deal with the DP panel just fine as long as I don’t attempt to logout my account while the experimental system can’t even login. By ‘can deal with’, I mean everything normal users would expect, including unplug(power-off), power-saving, wake-up and normal display. The system can also read the panel information just fine (it’s a Dell U2410 on DP-4, BTW).

During the DP experiment, I also tried to move the panel to a different port. It did not help at all. But to my surprise, my daily system began to hang at login when the panel was moved to DP-3. This frighten me to death. But luckily moving it back to DP-4 served as a rescue. I’m not sure why DP-4 is special (for my daily system only). Maybe some old configurations in my daily system helped? I hope such details can make it easier for you to reason about this issue, @aplattner .

Mar 07 19:43:39 Hostname kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 1642s! [Xorg:1038]                                                                                              
Mar 07 19:43:39 Hostname kernel: Modules linked in: nvidia_uvm(OE) snd_seq_dummy snd_hrtimer snd_seq qrtr rfcomm bnep binfmt_misc nls_ascii nls_cp437 vfat fat intel_rapl_msr intel_rapl_common >
Mar 07 19:43:39 Hostname kernel:  soundcore apple_mfi_fastcharge acpi_pad evdev sg isci libsas scsi_transport_sas drm msr parport_pc ppdev lp parport configfs efi_pstore nfnetlink efivarfs ip_>
Mar 07 19:43:39 Hostname kernel: CPU: 0 UID: 0 PID: 1038 Comm: Xorg Tainted: G           OEL     6.12.12-amd64 #1  Debian 6.12.12-1
Mar 07 19:43:39 Hostname kernel: Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP
Mar 07 19:43:39 Hostname kernel: Hardware name: ASUS All Series/H97M-PLUS, BIOS 3602 04/08/2018
Mar 07 19:43:39 Hostname kernel: RIP: 0010:_ZN11DisplayPort13ConnectorImpl15notifyDetachEndEb+0xda/0x320 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel: Code: eb 1b 66 0f 1f 44 00 00 48 8b 03 48 89 ee 48 89 df 48 8b 40 20 e8 76 c4 0a ed 48 89 c5 48 8b 03 48 89 ee 48 89 df 48 8b 40 20 <e8> 61 c4 0a ed 48 85 c0 7>
Mar 07 19:43:39 Hostname kernel: RSP: 0018:ffffb932cbf77778 EFLAGS: 00000216
Mar 07 19:43:39 Hostname kernel: RAX: ffffffffc2068510 RBX: ffff9bf5a29e5408 RCX: ffff9bf58d1bdd08
Mar 07 19:43:39 Hostname kernel: RDX: ffff9bf5a2b69808 RSI: ffff9bf5a2b69808 RDI: ffff9bf5a29e5408
Mar 07 19:43:39 Hostname kernel: RBP: ffff9bf5a2b69808 R08: 0000000000000000 R09: 0000000000000000
Mar 07 19:43:39 Hostname kernel: R10: 0000000000000001 R11: ffffffffffffffff R12: ffff9bf5a2b6e008
Mar 07 19:43:39 Hostname kernel: R13: ffff9bf5a2b6e140 R14: 0000000000000000 R15: ffffb932cbea96e0
Mar 07 19:43:39 Hostname kernel: FS:  00007f205fa3ab00(0000) GS:ffff9bfc3ea00000(0000) knlGS:0000000000000000
Mar 07 19:43:39 Hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 07 19:43:39 Hostname kernel: CR2: 000055bf41d80528 CR3: 00000001075fa003 CR4: 00000000001726f0
Mar 07 19:43:39 Hostname kernel: Call Trace:
Mar 07 19:43:39 Hostname kernel:  <IRQ>
Mar 07 19:43:39 Hostname kernel:  ? watchdog_timer_fn.cold+0x3d/0xa1
Mar 07 19:43:39 Hostname kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Mar 07 19:43:39 Hostname kernel:  ? __hrtimer_run_queues+0x132/0x2a0
Mar 07 19:43:39 Hostname kernel:  ? hrtimer_interrupt+0xfa/0x210
Mar 07 19:43:39 Hostname kernel:  ? __sysvec_apic_timer_interrupt+0x55/0x100
Mar 07 19:43:39 Hostname kernel:  ? sysvec_apic_timer_interrupt+0x6c/0x90
Mar 07 19:43:39 Hostname kernel:  </IRQ>
Mar 07 19:43:39 Hostname kernel:  <TASK>
Mar 07 19:43:39 Hostname kernel:  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
Mar 07 19:43:39 Hostname kernel:  ? _ZThn32_N11DisplayPort9GroupImpl7expiredEPKv+0x10/0x10 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  ? _ZN11DisplayPort13ConnectorImpl15notifyDetachEndEb+0xda/0x320 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  ? _ZN11DisplayPort13ConnectorImpl15notifyDetachEndEb+0xca/0x320 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  _ZN11DisplayPort13ConnectorImpl13dpPostModesetEv+0x81/0x90 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  nvDPPostSetMode+0x73/0x200 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  KickoffModesetUpdateState+0x1a8/0x270 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  nvSetDispModeEvo+0x3b9e/0x4150 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  ? Flip+0xf0/0xf0 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  nvKmsIoctl+0xf2/0x240 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  nvkms_unlocked_ioctl+0x10c/0x180 [nvidia_modeset]
Mar 07 19:43:39 Hostname kernel:  __x64_sys_ioctl+0x94/0xd0
Mar 07 19:43:39 Hostname kernel:  do_syscall_64+0x82/0x190
Mar 07 19:43:39 Hostname kernel:  ? syscall_exit_to_user_mode+0x4d/0x210
Mar 07 19:43:39 Hostname kernel:  ? do_syscall_64+0x8e/0x190
Mar 07 19:43:39 Hostname kernel:  ? nvidia_unlocked_ioctl+0x160/0x8c0 [nvidia]
Mar 07 19:43:39 Hostname kernel:  ? syscall_exit_to_user_mode+0x4d/0x210
Mar 07 19:43:39 Hostname kernel:  ? do_syscall_64+0x8e/0x190
Mar 07 19:43:39 Hostname kernel:  ? os_acquire_spinlock+0x12/0x30 [nvidia]
Mar 07 19:43:39 Hostname kernel:  ? portSyncSpinlockAcquire+0x1d/0x50 [nvidia]
Mar 07 19:43:39 Hostname kernel:  ? threadStateFree+0xde/0x1f0 [nvidia]
Mar 07 19:43:39 Hostname kernel:  ? rm_ioctl+0x7a/0x4f0 [nvidia]
Mar 07 19:43:39 Hostname kernel:  ? nvidia_unlocked_ioctl+0x160/0x8c0 [nvidia]
Mar 07 19:43:39 Hostname kernel:  ? syscall_exit_to_user_mode+0x4d/0x210
Mar 07 19:43:39 Hostname kernel:  ? do_syscall_64+0x8e/0x190
Mar 07 19:43:39 Hostname kernel:  ? exc_page_fault+0x7e/0x180
Mar 07 19:43:39 Hostname kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Mar 07 19:43:39 Hostname kernel: RIP: 0033:0x7f205ff1637b
Mar 07 19:43:39 Hostname kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1>
Mar 07 19:43:39 Hostname kernel: RSP: 002b:00007ffc3f10b910 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Mar 07 19:43:39 Hostname kernel: RAX: ffffffffffffffda RBX: 0000000000000015 RCX: 00007f205ff1637b
Mar 07 19:43:39 Hostname kernel: RDX: 00007ffc3f10b970 RSI: 00000000c0106d00 RDI: 0000000000000015
Mar 07 19:43:39 Hostname kernel: RBP: 00000000c0106d00 R08: 0000000000000000 R09: 000055bf41920c70
Mar 07 19:43:39 Hostname kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc3f10b970
Mar 07 19:43:39 Hostname kernel: R13: 000055bf4192beb8 R14: 00007ffc3f110768 R15: 0000000000000040
Mar 07 19:43:39 Hostname kernel:  </TASK>

I’ll have to report a bad news. My daily system begin to get stuck with the login hang. The rescue that I mentioned appears to be a temporal workaround only. I’ll need to power-off the DP panel before turn on the computer to avoid the login hang. Turning the DP panel back-on after the GPU has selected HDMI as the boot output appears to be harmless.

This is much more annoying than before… I really hope I will not need to bear this kind of weird workaround from now on …

The recent progress reminds me one thing that I didn’t find the connection before. I have another Windows system for the same HW build, installed on a iSCSI disk. This Windows system also suffers a similar syndrome with the DP panel. The DP panel was able to show boot progress just fine until the system booted to the Windows login screen. Once the Windows boots to the login screen, the DP panel was brought into the power-saving mode. I’ll need to toggle the power state of the DP panel to force the system in and out single-head mode. The system can just functional normally from then on. It looks very similar to the new syndrome I’m now suffering with my daily Linux system. The only difference is that on Windows it never hangs. I think the syndrome on Windows only shows up with a recent driver version, but I can’t remember the exact time of regression…

I guess this may be due to outdated user-space components (libraries etc) on Debian-12: I’m ready to bet that if you try a fresh install of Debian-13 it will work just fine, at least on X11 (no idea about Wayland).

Regarding X11/Wayland default on Debian: AFAIK Wayland is the default for Gnome. I use Mate, for which the default is fortunately still X11.

Debian 13 is not a frozen release yet. So there is no guarantee that I can reproduce your environment.

That said, my problem has lasted a long time, across different kernel and driver versions. Even the Windows system also suffered from similar syndrome. I really have no confidence on user-land based experiment anymore. I believe it’s just a corner case Driver bug…

Just realized that it’s not the same underlying syndrome as before, even though in both cases they appears to be freezing.

This time, there is no watchdog call-stack, but the following:

2025-03-08T22:26:17.587259+08:00 Hostname kernel: [   50.753839] NVRM: Xid (PCI:0000:01:00): 16, Head 00000003 Count 00000222
2025-03-08T22:26:24.471039+08:00 Hostname kernel: [   58.945390] NVRM: Xid (PCI:0000:01:00): 16, Head 00000003 Count 00000223
2025-03-08T22:26:32.666232+08:00 Hostname kernel: [   67.136987] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000224
2025-03-08T22:26:35.030230+08:00 Hostname kernel: [   69.504047] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1230
2025-03-08T22:26:37.034225+08:00 Hostname kernel: [   71.505342] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1230
2025-03-08T22:26:40.858222+08:00 Hostname kernel: [   75.329052] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000225
2025-03-08T22:27:04.410219+08:00 Hostname kernel: [   98.880581] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000226
2025-03-08T22:27:12.602202+08:00 Hostname kernel: [  107.070970] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000227
2025-03-08T22:27:20.794231+08:00 Hostname kernel: [  115.262009] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000228
2025-03-08T22:27:28.986223+08:00 Hostname kernel: [  123.453724] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000229
2025-03-08T22:27:37.178222+08:00 Hostname kernel: [  131.645477] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 0000022a
2025-03-08T22:27:40.860054+08:00 Hostname kernel: [  135.331054] NVRM: Going over RM unhandled interrupt threshold for irq 50
2025-03-08T22:27:45.370218+08:00 Hostname kernel: [  139.837460] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 0000022b
2025-03-08T22:27:50.306221+08:00 Hostname kernel: [  144.773133] NVRM: Going over RM unhandled interrupt threshold for irq 50
2025-03-08T22:27:53.562223+08:00 Hostname kernel: [  148.029281] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 0000022c
2025-03-08T22:28:01.751545+08:00 Hostname kernel: [  156.222510] NVRM: Xid (PCI:0000:01:00): 16, Head 00000003 Count 0000022d
2025-03-08T22:28:09.946227+08:00 Hostname kernel: [  164.413362] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 0000022e
2025-03-08T22:28:18.138228+08:00 Hostname kernel: [  172.605275] NVRM: Xid (PCI:0000:01:00): 16, Head 00000003 Count 0000022f
2025-03-08T22:28:26.330224+08:00 Hostname kernel: [  180.794864] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000230
2025-03-08T22:28:34.522204+08:00 Hostname kernel: [  188.986427] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000231
2025-03-08T22:28:34.978196+08:00 Hostname kernel: [  189.442832] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1221
2025-03-08T22:28:36.978197+08:00 Hostname kernel: [  191.443512] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:4:0:1230
2025-03-08T22:28:38.978199+08:00 Hostname kernel: [  193.443934] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1230
2025-03-08T22:28:40.978197+08:00 Hostname kernel: [  195.444370] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1221
2025-03-08T22:28:42.978198+08:00 Hostname kernel: [  197.445008] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:4:0:1230
2025-03-08T22:28:44.078197+08:00 Hostname kernel: [  198.543779] NVRM: Going over RM unhandled interrupt threshold for irq 50
2025-03-08T22:28:44.978196+08:00 Hostname kernel: [  199.445239] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1230
-08T22:28:09.946227+08:00 Hostname kernel: [  164.413362] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 0000022e
2025-03-08T22:28:18.138228+08:00 Hostname kernel: [  172.605275] NVRM: Xid (PCI:0000:01:00): 16, Head 00000003 Count 0000022f
2025-03-08T22:28:26.330224+08:00 Hostname kernel: [  180.794864] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000230
2025-03-08T22:28:34.522204+08:00 Hostname kernel: [  188.986427] NVRM: Xid (PCI:0000:01:00): 16, pid=2807, name=Xorg, Head 00000003 Count 00000231
2025-03-08T22:28:34.978196+08:00 Hostname kernel: [  189.442832] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1221
2025-03-08T22:28:36.978197+08:00 Hostname kernel: [  191.443512] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:4:0:1230
2025-03-08T22:28:38.978199+08:00 Hostname kernel: [  193.443934] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1230
2025-03-08T22:28:40.978197+08:00 Hostname kernel: [  195.444370] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1221
2025-03-08T22:28:42.978198+08:00 Hostname kernel: [  197.445008] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:4:0:1230
2025-03-08T22:28:44.078197+08:00 Hostname kernel: [  198.543779] NVRM: Going over RM unhandled interrupt threshold for irq 50
2025-03-08T22:28:44.978196+08:00 Hostname kernel: [  199.445239] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:6:0:1230

I dist-upgrade my test box to Debian Unstable and switched to SLIM + Mate.

I was not able to take a clean install because net-booting the installer only works in a stable release (due to kernel module version matching issue I think). During the upgrade process, I chose to migrate the configuration if the installer asked, to minimize the left-overs from the old release. I hope this does not defeat the intention of this experiment.

The overall result of this experiment is still negative. With SLIM + Debian unstable, I was able to see the login screen, but the system still hangs before the Mate session can load. With Gnome + Debian Stable (actually Stable + Testing mixed so as to pull the 6.12 kernel), the hang happens before the login screen. That’s the only difference I can observe.

I checked the Windows driver version in this case. It was one of the 560 series. I tried the latest Windows driver release, the 572 version, and the problem appears to be gone, hopefully. I’m not fully confident since the syndrome in my daily Linux system now becomes non-deterministic. The new ‘NVMR: Going over RM unhandled interrupt threshold’ syndrome related hang only pops up during boot time by chance. Hopefully the 572 version gave me a deterministic fix to my Windows system.

I’m glad to report that I finally resolved my issue. To my surprise, the syndrome goes away after I carefully cleaned and re-plugged my display-port cable.

I don’t understand why cable connectivity issue can lead to such subtle syndrome – normal usage in Windows (at least in some driver versions) and daily Linux usage except hang on logout – that turns out to be the fact I’m facing at.

So I think this long thread can finally be closed. But for my curiosity I would like to invite @aplattner to chime in about the theory under the hood, if an educated guess can be made without too much effort. Hope this can help other people who may run into such issue in the future…