Need suggestions on diagnosing KMS related kernel hang

firemeteor · January 19, 2024, 4:49pm

I’m trying to get KMS working on my machine but unfortunately run into problems that I’ll need some suggestions to debug.

I’m with Debian 12.4 with Linux kernel 6.1.0, GNOME 43 and NVIDIA driver 535.43.02. The GPU is a single RTX3090 with dual panels. The machine build works just fine without nvidia_drm.modeset=1 parameter.

However once I add that line to the kernel parameter, I can no longer login to my X session. The GDM window still shows up as usual. But once I login to my account, the machine appears to get stuck in a kernel hang situation – the main screen panel enters power saving mode, the other panel still shows the GDM wallpaper, the mouse is gone and the keyboard does not response (including the caps-lock switch light). The system is not fully dead yet as it still accepts SSH connection at that time, but it may not last very long as the connection suddenly got lost and never comes back even though the system still responses to ping. The only thing I could do was to to force a dirty power-off.

Maybe I didn’t check the right log file. I wasn’t able to find anything special during and after the hang. The only thing I can find is ‘Failed to grab modeset ownership’ line which has been explained to be harmless in other threads.

Is there any debug log control I can enable to diagnose this issue?

EDIT: I need to mention that this driver version also demonstrated similar issue with KMS disabled. But without KMS, the problem only happens after I logoff my session – i.e. system hang and dirty power-off required.

Since I rarely need to logoff my session instead of power-off directly, I’m annoyed by this issue in my daily usage. But this really leads me to suspect if this specific driver version is bad, since I didn’t remember any similar behavior before… I was on 525.147.05 as provided by Debian stable. I didn’t notice any issue in that version. I did a blind upgrade to 535.43.02 in Debian experimental as I thought a newer version may provide better Wayland compatibility. Maybe I should try a different driver version first…

firemeteor · January 20, 2024, 3:33pm

I reproduced the kernel hang syndrome with KMS disabled using the stock Debian kernel with hang detector enabled. The complete dmesg log from this boot is captured here:
kernel_lockup.log.gz (31.0 KB)

This reproduce was on NVIDIA driver version 525.147.05. I rolled back the version but unfortunately observed the exact same syndrome. What I did is simply to login the Gnome session and logout.

firemeteor · January 20, 2024, 3:55pm

Using the same environment I was able to capture dmesg log with KMS enabled too. This time it happens right after I login into my Gnome session. The log could be found here:
kernel_lockup_kms.log.gz (29.8 KB)

At a quick glance, the two cases appear to share very similar callstack when the hang happens.

2024-01-20T23:38:52.246708+08:00 Hostname kernel: [  224.174284] Call Trace:
2024-01-20T23:38:52.246709+08:00 Hostname kernel: [  224.174285]  <IRQ>
2024-01-20T23:38:52.246710+08:00 Hostname kernel: [  224.174286]  ? watchdog_timer_fn+0x1a4/0x200
2024-01-20T23:38:52.246710+08:00 Hostname kernel: [  224.174290]  ? lockup_detector_update_enable+0x50/0x50
2024-01-20T23:38:52.246711+08:00 Hostname kernel: [  224.174291]  ? __hrtimer_run_queues+0x112/0x2b0
2024-01-20T23:38:52.246712+08:00 Hostname kernel: [  224.174294]  ? hrtimer_interrupt+0xf4/0x210
2024-01-20T23:38:52.246713+08:00 Hostname kernel: [  224.174296]  ? __sysvec_apic_timer_interrupt+0x5d/0x110
2024-01-20T23:38:52.246714+08:00 Hostname kernel: [  224.174299]  ? sysvec_apic_timer_interrupt+0x69/0x90
2024-01-20T23:38:52.246715+08:00 Hostname kernel: [  224.174302]  </IRQ>
2024-01-20T23:38:52.246716+08:00 Hostname kernel: [  224.174302]  <TASK>
2024-01-20T23:38:52.246717+08:00 Hostname kernel: [  224.174303]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
2024-01-20T23:38:52.246718+08:00 Hostname kernel: [  224.174306]  ? _nv001555kms+0x4d0/0x4d0 [nvidia_modeset]
2024-01-20T23:38:52.246718+08:00 Hostname kernel: [  224.174335]  ? _nv001548kms+0x4/0x80 [nvidia_modeset]
2024-01-20T23:38:52.246719+08:00 Hostname kernel: [  224.174363]  ? _nv001117kms+0x97/0x3a0 [nvidia_modeset]
2024-01-20T23:38:52.246720+08:00 Hostname kernel: [  224.174392]  ? _nv002265kms+0x149/0x210 [nvidia_modeset]
2024-01-20T23:38:52.246721+08:00 Hostname kernel: [  224.174415]  ? _nv000497kms+0xe5/0x140 [nvidia_modeset]
2024-01-20T23:38:52.246722+08:00 Hostname kernel: [  224.174454]  ? _nv002667kms+0x31d0/0x32c0 [nvidia_modeset]
2024-01-20T23:38:52.246723+08:00 Hostname kernel: [  224.174482]  ? _copy_from_user+0x46/0x60
2024-01-20T23:38:52.246724+08:00 Hostname kernel: [  224.174486]  ? _nv000332kms+0x50/0x50 [nvidia_modeset]
2024-01-20T23:38:52.246725+08:00 Hostname kernel: [  224.174499]  ? nvKmsIoctl+0xf9/0x270 [nvidia_modeset]
2024-01-20T23:38:52.246726+08:00 Hostname kernel: [  224.174513]  ? nvkms_ioctl+0x117/0x180 [nvidia_modeset]
2024-01-20T23:38:52.246727+08:00 Hostname kernel: [  224.174526]  ? nvidia_frontend_unlocked_ioctl+0x38/0x50 [nvidia]
2024-01-20T23:38:52.246728+08:00 Hostname kernel: [  224.174740]  ? __x64_sys_ioctl+0x90/0xd0
2024-01-20T23:38:52.246728+08:00 Hostname kernel: [  224.174743]  ? do_syscall_64+0x5b/0xc0
2024-01-20T23:38:52.246729+08:00 Hostname kernel: [  224.174746]  ? handle_mm_fault+0xdb/0x2d0
2024-01-20T23:38:52.246730+08:00 Hostname kernel: [  224.174748]  ? do_user_addr_fault+0x1b0/0x580
2024-01-20T23:38:52.246731+08:00 Hostname kernel: [  224.174751]  ? exit_to_user_mode_prepare+0x40/0x1e0
2024-01-20T23:38:52.246732+08:00 Hostname kernel: [  224.174753]  ? entry_SYSCALL_64_after_hwframe+0x64/0xce
2024-01-20T23:38:52.246733+08:00 Hostname kernel: [  224.174756]  </TASK>

firemeteor · January 25, 2024, 2:30pm

Any suggestion on how to diagnose further?

generix · January 25, 2024, 2:46pm

Why are you forcibly enabling aspm? That’s rather dangerous. Furthermore, the bios is 9 years old, no update available?

firemeteor · January 25, 2024, 4:18pm

That was a left over from my experiment on a Mellanox ConnectX-3 card. I forgot to turn it off and but it turns out does not make any noticeable difference for my daily usage. As a diagnose attempt, I did tried to reboot with that parameter dropped, didn’t make a difference unfortunately…

aplattner · January 25, 2024, 7:02pm

Would it be possible to try this with the recently-released 550.40.07 beta driver?

firemeteor · January 26, 2024, 2:42pm

Thanks for the suggestion, @aplattner! Sure I can give that a try. For now I’m on the distro managed package and that’s why my version is a little bit dated. I’ll need to figure out a clean way to transit back-and-force between the vendor release and the distro release. Haven’t done such a thing in a long time and couldn’t remember if there is anything need special care…

Will report back later…

PS: any suggestion on the transition to (and from) vendor release will be appreciated…

firemeteor · January 26, 2024, 4:26pm

Good news and bad news. The transition to vendor package is smooth. But unfortunately the issue still exists.

Here is the log for the hang on KMS=disable + GDM logout:
kernel_lockup_kmsoff_linux6.1.0_nv550.40.07.log.gz (26.6 KB)

2024-01-27T00:11:05.768704+08:00 Hostname kernel: [  196.069872] CPU: 0 PID: 4551 Comm: Xorg Tainted: P           OE      6.1.0-17-amd64 #1  Debian 6.1.69-1
2024-01-27T00:11:05.768715+08:00 Hostname kernel: [  196.069874] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H97M Pro4, BIOS C2.00 07/27/2015
2024-01-27T00:11:05.768715+08:00 Hostname kernel: [  196.069875] RIP: 0010:_nv001626kms+0x0/0x80 [nvidia_modeset]
2024-01-27T00:11:05.768716+08:00 Hostname kernel: [  196.069912] Code: 48 48 8b 53 28 e9 e5 fd ff ff 45 31 c0 e9 e6 fc ff ff 49 c7 44 24 48 00 00 00 00 48 8b 53 28 e9 96 fd ff ff 66 0f 1f 44 00 00 <f3> 0f 1e fa 55 48 89 e5 41 55 49 89 fd 41 54 49 89 f4 53 48 8d 5f
2024-01-27T00:11:05.768716+08:00 Hostname kernel: [  196.069913] RSP: 0018:ffffac1786e8bac8 EFLAGS: 00000282
2024-01-27T00:11:05.768717+08:00 Hostname kernel: [  196.069914] RAX: ffffffffc1a96c40 RBX: ffff925215346c08 RCX: ffff925281cf2b48
2024-01-27T00:11:05.768717+08:00 Hostname kernel: [  196.069915] RDX: ffff92529f43b408 RSI: ffff92529f43b408 RDI: ffff925215346c08
2024-01-27T00:11:05.768718+08:00 Hostname kernel: [  196.069916] RBP: ffffac1786e8bb10 R08: ffffac1786e8b990 R09: 0000000000000001
2024-01-27T00:11:05.768718+08:00 Hostname kernel: [  196.069917] R10: 0000000000000001 R11: 0000000000000001 R12: ffff92529f43b408
2024-01-27T00:11:05.768719+08:00 Hostname kernel: [  196.069918] R13: ffff92520fc4a008 R14: ffff92520fc4a178 R15: 0000000000000000
2024-01-27T00:11:05.768719+08:00 Hostname kernel: [  196.069919] FS:  00007f8a9a9e7ac0(0000) GS:ffff92551dc00000(0000) knlGS:0000000000000000
2024-01-27T00:11:05.768720+08:00 Hostname kernel: [  196.069920] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-01-27T00:11:05.768720+08:00 Hostname kernel: [  196.069921] CR2: 00005619507ea000 CR3: 00000001494c0003 CR4: 00000000001706f0
2024-01-27T00:11:05.768721+08:00 Hostname kernel: [  196.069922] Call Trace:
2024-01-27T00:11:05.768721+08:00 Hostname kernel: [  196.069924]  <IRQ>
2024-01-27T00:11:05.768722+08:00 Hostname kernel: [  196.069925]  ? watchdog_timer_fn+0x1a4/0x200
2024-01-27T00:11:05.768722+08:00 Hostname kernel: [  196.069928]  ? lockup_detector_update_enable+0x50/0x50
2024-01-27T00:11:05.768723+08:00 Hostname kernel: [  196.069930]  ? __hrtimer_run_queues+0x112/0x2b0
2024-01-27T00:11:05.768723+08:00 Hostname kernel: [  196.069933]  ? hrtimer_interrupt+0xf4/0x210
2024-01-27T00:11:05.768724+08:00 Hostname kernel: [  196.069935]  ? __sysvec_apic_timer_interrupt+0x5d/0x110
2024-01-27T00:11:05.768724+08:00 Hostname kernel: [  196.069938]  ? sysvec_apic_timer_interrupt+0x69/0x90                                                                                     
2024-01-27T00:11:05.768725+08:00 Hostname kernel: [  196.069940]  </IRQ>
2024-01-27T00:11:05.768725+08:00 Hostname kernel: [  196.069941]  <TASK>
2024-01-27T00:11:05.768726+08:00 Hostname kernel: [  196.069941]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
2024-01-27T00:11:05.768758+08:00 Hostname kernel: [  196.069945]  ? _nv001633kms+0x4d0/0x4d0 [nvidia_modeset]
2024-01-27T00:11:05.768761+08:00 Hostname kernel: [  196.069974]  ? _nv001633kms+0x4d0/0x4d0 [nvidia_modeset]
2024-01-27T00:11:05.768762+08:00 Hostname kernel: [  196.070015]  ? _nv001633kms+0x4d0/0x4d0 [nvidia_modeset]
2024-01-27T00:11:05.768762+08:00 Hostname kernel: [  196.070052]  ? _nv001185kms+0x97/0x3c0 [nvidia_modeset]
2024-01-27T00:11:05.768763+08:00 Hostname kernel: [  196.070081]  ? _nv001166kms+0xed/0x100 [nvidia_modeset]
2024-01-27T00:11:05.768763+08:00 Hostname kernel: [  196.070111]  ? _nv002365kms+0x74/0x200 [nvidia_modeset]
2024-01-27T00:11:05.768764+08:00 Hostname kernel: [  196.070133]  ? _nv000524kms+0x160/0x1b0 [nvidia_modeset]
2024-01-27T00:11:05.768764+08:00 Hostname kernel: [  196.070153]  ? _nv002852kms+0x42d6/0x4a40 [nvidia_modeset]
2024-01-27T00:11:05.768765+08:00 Hostname kernel: [  196.070174]  ? _copy_from_user+0x46/0x60
2024-01-27T00:11:05.768765+08:00 Hostname kernel: [  196.070177]  ? _nv000348kms+0xf0/0xf0 [nvidia_modeset]
2024-01-27T00:11:05.768766+08:00 Hostname kernel: [  196.070192]  ? nvKmsIoctl+0xf9/0x270 [nvidia_modeset]
2024-01-27T00:11:05.768766+08:00 Hostname kernel: [  196.070206]  ? nvkms_unlocked_ioctl+0x11b/0x190 [nvidia_modeset]
2024-01-27T00:11:05.768767+08:00 Hostname kernel: [  196.070220]  ? __x64_sys_ioctl+0x90/0xd0
2024-01-27T00:11:05.768768+08:00 Hostname kernel: [  196.070222]  ? do_syscall_64+0x5b/0xc0
2024-01-27T00:11:05.768768+08:00 Hostname kernel: [  196.070225]  ? do_syscall_64+0x67/0xc0
2024-01-27T00:11:05.768769+08:00 Hostname kernel: [  196.070227]  ? do_syscall_64+0x67/0xc0
2024-01-27T00:11:05.768769+08:00 Hostname kernel: [  196.070229]  ? do_syscall_64+0x67/0xc0
2024-01-27T00:11:05.768770+08:00 Hostname kernel: [  196.070230]  ? exit_to_user_mode_prepare+0x40/0x1e0
2024-01-27T00:11:05.768770+08:00 Hostname kernel: [  196.070232]  ? entry_SYSCALL_64_after_hwframe+0x64/0xce
2024-01-27T00:11:05.768771+08:00 Hostname kernel: [  196.070235]  </TASK>

firemeteor · January 26, 2024, 4:53pm

I did some more experiments to rule out impacts from my user-land environments like user profiles – it was an aged system that went through many rolling upgrades and config customization anyway.

I created a brand-new user and observed the same issue under that user account. So looks like the issue has nothing to do with those config files in my home directory. During this experiment, I find that switching between users is working properly, despite of the mode-switch blank screen in between. Logging out users works as long as there are still other user sessions active. In other words, only logging out the last user session would trigger the issue.

During the driver package migration, I switched to console and successfully executed systemctl stop gdm without triggering issue. Note this is on the old driver version since I was about to perform the switch-over…

EDIT: Found another way of triggering the issue, doing the following in console:

systemctl stop gdm
systemctl start gdm

Basically GDM can only properly start for the first time?

aplattner · January 26, 2024, 5:37pm

Interesting. Starting and stopping gdm repeatedly does work for me so there must be something about your system that’s triggering the issue. Can you please run sudo nvidia-bug-report.sh and attach the bug report log here?

firemeteor · January 27, 2024, 3:43pm

Sure, here comes the requested log:
nvidia-bug-report.log.gz (574.1 KB)

This issue must have a subtle triggering condition or it should have already been reported and fixed instead of spanning so many driver versions…

firemeteor · January 30, 2024, 5:53pm

Hi, @aplattner is there any finding for this issue?

firemeteor · February 17, 2024, 4:20pm

Hi, @aplattner did you find anything interesting from the bug report log?

aplattner · March 17, 2024, 7:47am

I’m still not able to reproduce the problem but I have a theory about the hang. Unfortunately, I haven’t been able to figure out how it could get into that state just from looking at the code. I’ll keep puzzling over it.

firemeteor · March 5, 2025, 5:13am

Coming back to poke this thread from driver version 570.124.04.Yes, unfortunately the issue still persists.

I quote the hang detection log from kernel for reference. This time, it’s generated on a custom built 5.10.223 Debian kernel with KMS disabled.
Hi @aplattner, I wish you could get back to this issue and spare some more time on it. I’m totally running out of idea…

This is becoming increasing annoying to me. Recently I attempted to run VM with virtio-GPU acceleration, which appears to rely on GBM – something only available with KMS enabled?

[ 8546.015022] rcu:     1-....: (5249 ticks this GP) idle=a2e/1/0x4000000000000000 softirq=128535/128535 fqs=2624 
[ 8546.015023]  (t=5250 jiffies g=273061 q=1404)
[ 8546.015024] NMI backtrace for cpu 1
[ 8546.015025] CPU: 1 PID: 25338 Comm: Xorg Tainted: P         C O 5.10.223.xxxx.87 #3
[ 8546.015026] Hardware name: ASUS All Series/H97M-PLUS, BIOS 3602 04/08/2018
[ 8546.015026] Call Trace:
[ 8546.015029]  <IRQ>
[ 8546.015033]  dump_stack+0x57/0x6e
[ 8546.015035]  nmi_cpu_backtrace.cold+0x30/0x65
[ 8546.015037]  ? lapic_can_unplug_cpu+0x70/0x70
[ 8546.015040]  nmi_trigger_cpumask_backtrace+0x80/0x90
[ 8546.015052]  rcu_dump_cpu_stacks+0x9a/0xcc
[ 8546.015053]  rcu_sched_clock_irq.cold+0x202/0x3d5
[ 8546.015056]  ? tick_sched_do_timer+0x90/0x90
[ 8546.015057]  update_process_times+0x8c/0xc0
[ 8546.015058]  tick_sched_handle+0x34/0x50
[ 8546.015059]  tick_sched_timer+0x63/0x80
[ 8546.015069]  __hrtimer_run_queues+0x129/0x270
[ 8546.015070]  hrtimer_interrupt+0xf5/0x290
[ 8546.015073]  __sysvec_apic_timer_interrupt+0x5d/0xd0
[ 8546.015075]  asm_call_irq_on_stack+0x12/0x20
[ 8546.015075]  </IRQ>
[ 8546.015077]  sysvec_apic_timer_interrupt+0x6e/0x80
[ 8546.015078]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 8546.015097] RIP: 0010:_nv001859kms+0x5/0x80 [nvidia_modeset]                                                                                                                               
[ 8546.015098] Code: e9 ed fd ff ff 49 c7 44 24 48 00 00 00 00 48 8b 53 28 e9 a6 fd ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa 55 <48> 89 e5 41 55 49 89 fd 41 54 49 89 f4 53 48 8d 5f 38 48 89 df 48
[ 8546.015099] RSP: 0018:ffffb39292923ad0 EFLAGS: 00000282
[ 8546.015100] RAX: ffffffffc7467180 RBX: ffff8e4984572808 RCX: ffff8e4955f50ec8
[ 8546.015101] RDX: ffff8e4953ae2808 RSI: ffff8e4953ae2808 RDI: ffff8e4984572808
[ 8546.015101] RBP: ffffb39292923b20 R08: 0000000000000000 R09: 0000000000000001
[ 8546.015102] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8e4953ae2808
[ 8546.015102] R13: ffff8e4a0c1a2808 R14: ffff8e4a0c1a2980 R15: 0000000000000000
[ 8546.015115]  ? _nv001866kms+0x550/0x550 [nvidia_modeset]
[ 8546.015128]  ? _nv001291kms+0xca/0x3c0 [nvidia_modeset]
[ 8546.015141]  ? _nv001273kms+0x81/0x90 [nvidia_modeset]
[ 8546.015150]  ? _nv002660kms+0x74/0x200 [nvidia_modeset] 
[ 8546.015159]  ? _nv000572kms+0x1d4/0x2b8 [nvidia_modeset]
[ 8546.015167]  ? _nv003176kms+0x4115/0x4730 [nvidia_modeset]
[ 8546.015175]  ? _nv000390kms+0xe0/0xe0 [nvidia_modeset]
[ 8546.015181]  ? nvKmsIoctl+0xf9/0x270 [nvidia_modeset]
[ 8546.015187]  ? nvkms_unlocked_ioctl+0x10a/0x180 [nvidia_modeset]
[ 8546.015188]  ? __x64_sys_ioctl+0x90/0xd0
[ 8546.015190]  ? do_syscall_64+0x33/0x80
[ 8546.015192]  ? entry_SYSCALL_64_after_hwframe+0x67/0xd1
[ 8609.026865] rcu: INFO: rcu_sched self-detected stall on CPU
[ 8609.026881] rcu:     1-....: (21002 ticks this GP) idle=a2e/1/0x4000000000000000 softirq=128535/128535 fqs=10501
[ 8609.026882]  (t=21003 jiffies g=273061 q=5733)
[ 8609.026883] NMI backtrace for cpu 1

morgwai666 · March 5, 2025, 9:00am

Have you tried a newer kernel, for example the one from bookworm-backports? I’m on Debian-13 with kernel 6.12, driver 570 with modeset=1 and generally it works pretty stable for me (except when hot-unplugging my eGPU).

firemeteor · March 5, 2025, 9:45am

The newest kernel version I tried was the 6.1 stock kernel one year ago. Sure I can give 6.12 another try when I get some spare time. But to be honest I don’t have high expectation on the result…

I’m with the feeling that I’m stuck with some corner cases that only very few people run into…

morgwai666 · March 5, 2025, 10:34am

FYI, on Debian-12 with 6.1 I’ve tried several versions of the driver and all were crashing for me also (not sure if it was the same crash as yours as I didn’t bother to investigate and switched to Debian-13 as I already knew that it worked fine there).

firemeteor · March 5, 2025, 12:12pm

Just came back from a reboot after a failed experiment on 6.12.17-1 Debian stock kernel.
The issue persists with the combination of 6.12.17 and driver 570 as I expected.

There must be something special in my system that triggered such a corner case bug. I just cannot figure out what it is. If the user-land is to suspect, it should be the system level config and package, as the same issue persist with a new user account. My system has experienced several rolling upgrades over 15 years of service so there could be many customization left. Maybe I can setup a clean system as another controlled experiment.

Quote the Kernel hang log as usual:

[  148.148424] watchdog: BUG: soft lockup - CPU#0 stuck for 78s! [Xorg:4544]
[  148.148428] Modules linked in: nvidia_uvm(POE) rfcomm nfsv3 nf_conntrack_netlink rpcsec_gss_krb5 xfrm_user xfrm_algo xt_addrtype br_netfilter nfsv4 dns_resolver nfs netfs squashfs xt_conntrack vboxnetadp(OE) ipt_REJECT nf_reject_ipv4 vboxnetflt(OE) xt_CHECKSUM vboxdrv(OE) nft_chain_nat xt_comment xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_tcpudp nft_compat bridge stp llc scsi_transport_iscsi nf_tables libcrc32c overlay binder_linux qrtr cmac algif_hash algif_skcipher af_alg bnep mlx5_core binfmt_misc mlxfw pci_hyperv_intf mlx4_ib ib_uverbs ib_core mlx4_en nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) joydev snd_ctl_led intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp gspca_zc3xx gspca_main kvm_intel snd_usb_audio nls_ascii videobuf2_vmalloc nls_cp437 vfat videobuf2_memops fat videobuf2_v4l2 snd_usbmidi_lib videodev btusb hid_generic snd_rawmidi kvm btrtl snd_seq_device zfs(POE) btintel videobuf2_common btbcm crct10dif_pclmul btmtk usbhid mc crc32_pclmul
[  148.148461]  bluetooth ghash_clmulni_intel hid apple_mfi_fastcharge snd_hda_codec_realtek sha512_ssse3 snd_hda_codec_generic snd_hda_codec_hdmi sha256_ssse3 snd_hda_scodec_component sha1_ssse3 eeepc_wmi snd_hda_intel asus_wmi snd_intel_dspcfg snd_intel_sdw_acpi aesni_intel snd_hda_codec iTCO_wdt sparse_keymap intel_pmc_bxt uas gf128mul platform_profile drm_ttm_helper crypto_simd snd_hda_core usb_storage nvme ttm at24 mei_pxp battery cryptd iTCO_vendor_support snd_hwdep spl(OE) snd_pcm wmi_bmof mei_hdcp evdev mxm_wmi nvme_core watchdog drm_kms_helper i2c_i801 rfkill rapl mlx4_core intel_cstate snd_timer nvme_auth i2c_smbus video intel_uncore pcspkr snd mei_me ehci_pci wmi mei acpi_pad ehci_hcd button soundcore lpc_ich sg nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm nct6775 nct6775_core hwmon_vid uinput loop efi_pstore configfs nfnetlink efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod ahci libahci xhci_pci libata xhci_hcd usbcore scsi_mod crc32c_intel scsi_common usb_common fan
[  148.148498] CPU: 0 UID: 113 PID: 4544 Comm: Xorg Tainted: P           OEL     6.12.17-amd64 #1  Debian 6.12.17-1
[  148.148501] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP
[  148.148501] Hardware name: ASUS All Series/H97M-PLUS, BIOS 3602 04/08/2018
[  148.148502] RIP: 0010:__x86_indirect_thunk_array+0x6/0x20
[  148.148507] Code: cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 31 ff e9 e0 e1 3c ff e8 01 00 00 00 cc <48> 89 04 24 c3 cc cc cc cc 90 66 66 2e 0f 1f 84 00 00 00 00 00 0f
[  148.148508] RSP: 0018:ffffa26b5676b938 EFLAGS: 00000292
[  148.148509] RAX: ffffffffc7b508f0 RBX: ffff91c15d53b208 RCX: ffff91c11ba19b48
[  148.148510] RDX: ffff91c16d41e008 RSI: ffff91c16d41e008 RDI: ffff91c15d53b208
[  148.148510] RBP: ffffa26b5676b988 R08: 0000000000000000 R09: 0000000000000001
[  148.148511] R10: 0000000000000001 R11: 0000000000000000 R12: ffff91c16d41e008
[  148.148512] R13: ffff91c13f33c008 R14: ffff91c13f33c180 R15: 0000000000000000
[  148.148513] FS:  00007fd6e6f99b00(0000) GS:ffff91c40ea00000(0000) knlGS:0000000000000000
[  148.148514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  148.148514] CR2: 000056362ace3000 CR3: 0000000170c64002 CR4: 00000000001726f0
[  148.148515] Call Trace:
[  148.148517]  <IRQ>
[  148.148519]  ? watchdog_timer_fn.cold+0x3d/0xa1
[  148.148522]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  148.148525]  ? __hrtimer_run_queues+0x132/0x2a0
[  148.148528]  ? hrtimer_interrupt+0xfa/0x210
[  148.148530]  ? __sysvec_apic_timer_interrupt+0x55/0x100
[  148.148532]  ? sysvec_apic_timer_interrupt+0x6c/0x90
[  148.148533]  </IRQ>
[  148.148534]  <TASK>
[  148.148534]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  148.148537]  ? _nv001866kms+0x550/0x550 [nvidia_modeset]
[  148.148573]  ? __x86_indirect_thunk_array+0x6/0x20
[  148.148575]  ? __x86_indirect_thunk_array+0x5/0x20
[  148.148576]  ? _nv001291kms+0xdf/0x3c0 [nvidia_modeset]
[  148.148604]  ? _nv001273kms+0x81/0x90 [nvidia_modeset]
[  148.148631]  ? _nv002660kms+0x74/0x200 [nvidia_modeset]
[  148.148656]  ? _nv000572kms+0x1d4/0x2b8 [nvidia_modeset]
[  148.148679]  ? _nv003176kms+0x4115/0x4730 [nvidia_modeset]
[  148.148702]  ? _nv000390kms+0xe0/0xe0 [nvidia_modeset]
[  148.148720]  ? nvKmsIoctl+0xf9/0x270 [nvidia_modeset]
[  148.148738]  ? nvkms_unlocked_ioctl+0x10c/0x180 [nvidia_modeset]
[  148.148755]  ? __x64_sys_ioctl+0x94/0xd0
[  148.148757]  ? do_syscall_64+0x82/0x190
[  148.148760]  ? nvidia_unlocked_ioctl+0x160/0x8c0 [nvidia]
[  148.149047]  ? syscall_exit_to_user_mode_prepare+0x149/0x170
[  148.149050]  ? syscall_exit_to_user_mode+0x4d/0x210
[  148.149052]  ? do_syscall_64+0x8e/0x190
[  148.149053]  ? __irq_exit_rcu+0x37/0xb0
[  148.149055]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  148.149057]  </TASK>

Topic		Replies	Views
[Regression 460 series] Black screen on boot: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer Linux	64	21475	January 7, 2024
X hangs using 100% CPU, WAIT and mieq overflowing errors in logs Linux	67	23571	June 28, 2014
KMS not working with GT940M on Fedora 26 Linux	9	2266	October 28, 2017
Driver crashes after upgrade from 510.68.02 to 515.48.07 on GTX 1080ti Linux boot , kernel , nvbugs	6	1107	January 28, 2023
Nvidia-uvm module bug on suspend Linux	14	1725	December 7, 2023
GPU timeout \| lockup Linux	14	1327	July 7, 2024
X hangs, blocked on nvidia_modeset in ubuntu 22.02/nvidia drivers 535 Linux ubuntu	5	2970	August 24, 2023
System seems locked while rebooting with Linux 5.2.1 and nvidia drivers 430.34 or 430.26 Linux	80	6454	November 11, 2019
Device driver crash (unable to handle page fault) after suspend-&-resume with version 555.58.02 on Linux kernel v6.9.9 Linux kernel	13	1757	October 17, 2024
After shutting down my machine, it hangs, forcing me to hold the power button Linux	0	285	November 10, 2021

Need suggestions on diagnosing KMS related kernel hang

Related topics