X hangs, blocked on nvidia_modeset in ubuntu 22.02/nvidia drivers 535

System info:

Ubuntu: 22.04
kernel: 6.2.0-26
video driver: nvidia-driver-535
Window system: X11 - no wayland for any of the experiments.
Alienware 15 R3 laptop,
Nvidia Geforce GTX 1060 + Intel HD Graphics 630

Full specs @ https://dl.dell.com/content/manual38481186-alienware-15-r3-setup-and-specifications.pdf?language=en-us

Current Behavior:

I dual boot to Win10, which has no graphics issues. So I don’t think some video hardware has failed/is in the process of failing.

The behavior is different depending on whether nvidia-drm.modeset=0 or nvidia-drm.modeset=1 is set on boot.

  • In all cases, the spash kernel argument is supplied but the plymouth splash screen does not display.

The behavior with nvidia-drm.modeset=1 is quite erratic, and I’ve identified 3 different behaviors so far - just by rebooting a bunch without changing anything between boots.

Booting with nvidia-drm.modeset=1 Behavior #0:

(I’ll attach a nvidia-bug-report.sh output for this scenario as drm-full-login.nvidia-bug-report.log.gz)

  • The login screen does not display, and the backlight is off.
  • After a few moments the fans go crazy. Something seems to be working hard, but I can’t see it :(.
  • I can ssh to the laptop at this point though,
    • top shows that plymouthd is taking up 100% cpu.
    • dmesg shows that plymouthd is blocked waiting on nvidia_modeset. This message repeats forever:
[   60.327875] watchdog: BUG: soft lockup - CPU#7 stuck for 53s! [plymouthd:329]
[   60.327880] Modules linked in: rfcomm ccm snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic nvidia_uvm(POE) cmac algif_hash algif_skcipher af_alg bnep intel_tcc_cooling nvidia_drm(POE) x86_pkg_temp_thermal intel_powerclamp nvidia_modeset(POE) coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd snd_soc_avs nvidia(POE) snd_soc_hda_codec snd_hda_ext_core snd_soc_core snd_hda_codec_hdmi binfmt_misc snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec rapl hid_generic mei_hdcp mei_pxp intel_rapl_msr nls_iso8859_1 snd_hda_core snd_hwdep ath10k_pci i915 ath10k_core snd_pcm ath snd_seq_midi snd_seq_midi_event drm_buddy mac80211 ttm uvcvideo snd_rawmidi dell_wmi videobuf2_vmalloc drm_display_helper videobuf2_memops snd_seq btusb processor_thermal_device_pci_legacy cec btrtl intel_cstate videobuf2_v4l2 joydev input_leds btbcm snd_seq_device
[   60.327967]  processor_thermal_device rc_core processor_thermal_rfim cfg80211 snd_timer btintel videodev dell_smbios dcdbas btmtk drm_kms_helper usbhid videobuf2_common processor_thermal_mbox i2c_algo_bit dell_wmi_descriptor ledtrig_audio intel_wmi_thunderbolt wmi_bmof hid serio_raw mc bluetooth mxm_wmi snd processor_thermal_rapl mei_me intel_rapl_common syscopyarea sysfillrect libarc4 soundcore ecdh_generic ee1004 mei sysimgblt intel_soc_dts_iosf ecc intel_pch_thermal int3403_thermal int340x_thermal_zone mac_hid intel_hid int3400_thermal acpi_pad acpi_thermal_rel sparse_keymap sch_fq_codel msr parport_pc ppdev lp ramoops parport reed_solomon pstore_blk pstore_zone drm efi_pstore ip_tables x_tables autofs4 nvme ahci nvme_core xhci_pci i2c_i801 alx crc32_pclmul psmouse i2c_smbus nvme_common libahci mdio xhci_pci_renesas video wmi
[   60.328027] CPU: 7 PID: 329 Comm: plymouthd Tainted: P           OEL     6.2.0-26-generic #26~22.04.1-Ubuntu
[   60.328029] Hardware name: Alienware Alienware 15 R3/Alienware 15 R3, BIOS 1.10.0 07/21/2020
[   60.328030] RIP: 0010:_nv001596kms+0x0/0x80 [nvidia_modeset]
[   60.328065] Code: 48 48 8b 53 28 e9 e5 fd ff ff 45 31 c0 e9 e6 fc ff ff 49 c7 44 24 48 00 00 00 00 48 8b 53 28 e9 96 fd ff ff 66 0f 1f 44 00 00 <f3> 0f 1e fa 55 48 89 e5 41 55 49 89 fd 41 54 49 89 f4 53 48 8d 5f
[   60.328067] RSP: 0018:ffffa7b040433558 EFLAGS: 00000282
[   60.328069] RAX: ffffffffc56efce0 RBX: ffff92cccf488208 RCX: ffff92cccdcd68c8
[   60.328070] RDX: ffff92ccc394cc08 RSI: ffff92ccc394cc08 RDI: ffff92cccf488208
[   60.328071] RBP: ffffa7b0404335a0 R08: 0000000000000000 R09: 0000000000000000
[   60.328072] R10: 0000000000000000 R11: 0000000000000000 R12: ffff92ccc394cc08
[   60.328073] R13: ffff92ccce142008 R14: ffff92ccce142168 R15: 0000000000000000
[   60.328075] FS:  00007fbe874c7440(0000) GS:ffff92d42edc0000(0000) knlGS:0000000000000000
[   60.328076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   60.328077] CR2: 00007f47e02f3000 CR3: 0000000117ad8005 CR4: 00000000003706e0
[   60.328079] Call Trace:
[   60.328080]  <TASK>
[   60.328081]  ? _nv001165kms+0x82/0x3a0 [nvidia_modeset]
[   60.328112]  ? nvkms_call_rm+0x5d/0x90 [nvidia_modeset]
[   60.328128]  _nv002331kms+0x145/0x210 [nvidia_modeset]
[   60.328152]  _nv000529kms+0x160/0x1b0 [nvidia_modeset]
[   60.328173]  _nv002766kms+0x4bf6/0x4cd0 [nvidia_modeset]
[   60.328198]  ? _nv000355kms+0x100/0x100 [nvidia_modeset]
[   60.328213]  nvKmsIoctl+0xf9/0x270 [nvidia_modeset]
[   60.328228]  ? _raw_spin_lock_irqsave+0xe/0x20
[   60.328232]  nvkms_ioctl_from_kapi+0x6e/0xd0 [nvidia_modeset]
[   60.328247]  _nv000019kms+0x368/0x890 [nvidia_modeset]
[   60.328272]  ? nvkms_free+0x26/0x30 [nvidia_modeset]
[   60.328287]  ? _nv000019kms+0x388/0x890 [nvidia_modeset]
[   60.328313]  nv_drm_atomic_apply_modeset_config.isra.0+0x2f1/0x520 [nvidia_drm]
[   60.328320]  ? nv_drm_atomic_apply_modeset_config.isra.0+0x401/0x520 [nvidia_drm]
[   60.328327]  nv_drm_atomic_commit+0xba/0x350 [nvidia_drm]
[   60.328333]  ? drm_atomic_check_only+0x1ad/0x400 [drm]
[   60.328360]  drm_atomic_commit+0x96/0xd0 [drm]
[   60.328379]  ? __pfx___drm_printfn_info+0x10/0x10 [drm]
[   60.328408]  nv_drm_atomic_helper_disable_all+0x23d/0x310 [nvidia_drm]
[   60.328414]  nv_drm_master_drop+0x28/0x70 [nvidia_drm]
[   60.328419]  drm_dropmaster_ioctl+0xe4/0x160 [drm]
[   60.328439]  ? __pfx_drm_dropmaster_ioctl+0x10/0x10 [drm]
[   60.328459]  drm_ioctl_kernel+0xc0/0x160 [drm]
[   60.328488]  ? raw_spin_rq_unlock+0x10/0x40
[   60.328492]  drm_ioctl+0x27b/0x4c0 [drm]
[   60.328521]  ? __pfx_drm_dropmaster_ioctl+0x10/0x10 [drm]
[   60.328541]  ? schedule+0x68/0x110
[   60.328545]  nv_drm_ioctl+0x48/0x3a0 [nvidia_drm]
[   60.328552]  __x64_sys_ioctl+0x9a/0xe0
[   60.328555]  do_syscall_64+0x59/0x90
[   60.328558]  ? syscall_exit_to_user_mode+0x2a/0x50
[   60.328560]  ? do_syscall_64+0x69/0x90
[   60.328561]  ? exit_to_user_mode_prepare+0x3b/0xd0
[   60.328564]  ? syscall_exit_to_user_mode+0x2a/0x50
[   60.328566]  ? do_syscall_64+0x69/0x90
[   60.328568]  ? exit_to_user_mode_prepare+0x3b/0xd0
[   60.328570]  ? syscall_exit_to_user_mode+0x2a/0x50
[   60.328572]  ? do_syscall_64+0x69/0x90
[   60.328574]  ? do_syscall_64+0x69/0x90
[   60.328575]  ? syscall_exit_to_user_mode+0x2a/0x50
[   60.328577]  ? do_syscall_64+0x69/0x90
[   60.328579]  ? do_syscall_64+0x69/0x90
[   60.328581]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   60.328584] RIP: 0033:0x7fbe8731aaff
[   60.328586] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[   60.328587] RSP: 002b:00007ffe0f929010 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   60.328589] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbe8731aaff
[   60.328590] RDX: 0000000000000000 RSI: 000000000000641f RDI: 000000000000000b
[   60.328591] RBP: 000000000000641f R08: 000055e418227180 R09: 0000000000000000
[   60.328593] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
[   60.328594] R13: 000000000000000b R14: 000055e41820a3e0 R15: 000055e418227180
[   60.328596]  </TASK>
  • If I try to go to another virtual terminal with ctrl+alt+Fx I see nothing. The backlight remains off, and I cannot see any dim image by pointing a flashlight at the screen. So it’s not just the backlight being off.

Booting with nvidia-drm.modeset=1 Behavior #1:

  • The login screen appears, ui elements work as expected.

  • Virtual Terminals at this stage are different than in the nvidia-drm.modeset=0 case:

    • Hitting ctrl+alt+Fx (where x > 1) goes to a black screen with backlight off
    • I cannot get back to VT1 this time. (or if I can, I can’t see that i have, because the backlight remains off).
  • When I attempt a login and enter credentials immediately, the screen goes black with no backlight.

    • After a few moments the fans go crazy. Something seems to be working hard, again.
    • I can ssh to the laptop at this point, and top shows that Xorg is taking up 100% cpu.
    • dmesg shows that Xorg is blocked waiting on nvidia_modeset. This error is repeated forever:
[  140.254729] watchdog: BUG: soft lockup - CPU#2 stuck for 86s! [Xorg:1059]
[  140.254735] Modules linked in: rfcomm ccm snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic nvidia_uvm(POE) cmac algif_hash algif_skcipher af_alg bnep intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic nvidia_drm(POE) ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd nvidia_modeset(POE) cryptd binfmt_misc nls_iso8859_1 rapl snd_soc_avs snd_soc_hda_codec snd_hda_ext_core nvidia(POE) hid_generic mei_hdcp mei_pxp intel_rapl_msr snd_soc_core snd_compress ac97_bus snd_hda_codec_hdmi snd_pcm_dmaengine i915 ath10k_pci snd_hda_intel ath10k_core snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec ath snd_hda_core uvcvideo snd_hwdep drm_buddy videobuf2_vmalloc ttm videobuf2_memops snd_pcm videobuf2_v4l2 snd_seq_midi joydev intel_cstate mac80211 snd_seq_midi_event drm_display_helper input_leds snd_rawmidi videodev usbhid btusb dell_wmi btrtl snd_seq cec dell_smbios btbcm dcdbas btintel btmtk
[  140.254841]  videobuf2_common snd_seq_device ledtrig_audio intel_wmi_thunderbolt mxm_wmi wmi_bmof ee1004 dell_wmi_descriptor serio_raw hid mc bluetooth snd_timer processor_thermal_device_pci_legacy cfg80211 rc_core processor_thermal_device snd ecdh_generic processor_thermal_rfim ecc drm_kms_helper libarc4 soundcore processor_thermal_mbox i2c_algo_bit syscopyarea processor_thermal_rapl mei_me intel_rapl_common sysfillrect intel_pch_thermal sysimgblt mei intel_soc_dts_iosf int3403_thermal int340x_thermal_zone intel_hid int3400_thermal mac_hid sparse_keymap acpi_thermal_rel acpi_pad sch_fq_codel msr parport_pc ppdev lp ramoops parport reed_solomon pstore_blk pstore_zone drm efi_pstore ip_tables x_tables autofs4 nvme ahci nvme_core i2c_i801 alx xhci_pci crc32_pclmul psmouse i2c_smbus nvme_common mdio libahci xhci_pci_renesas video wmi
[  140.254885] CPU: 2 PID: 1059 Comm: Xorg Tainted: P           OEL     6.2.0-26-generic #26~22.04.1-Ubuntu
[  140.254887] Hardware name: Alienware Alienware 15 R3/Alienware 15 R3, BIOS 1.10.0 07/21/2020
[  140.254889] RIP: 0010:_nv001596kms+0x0/0x80 [nvidia_modeset]
[  140.254924] Code: 48 48 8b 53 28 e9 e5 fd ff ff 45 31 c0 e9 e6 fc ff ff 49 c7 44 24 48 00 00 00 00 48 8b 53 28 e9 96 fd ff ff 66 0f 1f 44 00 00 <f3> 0f 1e fa 55 48 89 e5 41 55 49 89 fd 41 54 49 89 f4 53 48 8d 5f
[  140.254925] RSP: 0018:ffffab6d43513a00 EFLAGS: 00000282
[  140.254927] RAX: ffffffffc55edce0 RBX: ffff8aeb044a0e08 RCX: ffff8aeb14df7608
[  140.254928] RDX: ffff8aeb0434c808 RSI: ffff8aeb0434c808 RDI: ffff8aeb044a0e08
[  140.254929] RBP: ffffab6d43513a48 R08: 0000000000000000 R09: 0000000000000000
[  140.254930] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8aeb0434c808
[  140.254931] R13: ffff8aeb049c2808 R14: ffff8aeb049c2968 R15: 0000000000000000
[  140.254933] FS:  00007f528f02ba80(0000) GS:ffff8af26ec80000(0000) knlGS:0000000000000000
[  140.254934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  140.254935] CR2: 00007ffd06139000 CR3: 0000000118a98004 CR4: 00000000003706e0
[  140.254937] Call Trace:
[  140.254938]  <TASK>
[  140.254939]  ? _nv001165kms+0x82/0x3a0 [nvidia_modeset]
[  140.254970]  ? nvkms_call_rm+0x5d/0x90 [nvidia_modeset]
[  140.254985]  _nv002331kms+0x145/0x210 [nvidia_modeset]
[  140.255010]  _nv000529kms+0x160/0x1b0 [nvidia_modeset]
[  140.255030]  _nv002766kms+0x4bf6/0x4cd0 [nvidia_modeset]
[  140.255055]  ? _nv000355kms+0x100/0x100 [nvidia_modeset]
[  140.255070]  nvKmsIoctl+0xf9/0x270 [nvidia_modeset]
[  140.255084]  ? _raw_spin_lock_irqsave+0xe/0x20
[  140.255088]  nvkms_ioctl+0x121/0x190 [nvidia_modeset]
[  140.255103]  nvidia_frontend_unlocked_ioctl+0x55/0xa0 [nvidia]
[  140.255352]  __x64_sys_ioctl+0x9a/0xe0
[  140.255356]  do_syscall_64+0x59/0x90
[  140.255359]  ? handle_mm_fault+0x119/0x330
[  140.255362]  ? lock_mm_and_find_vma+0x44/0x250
[  140.255364]  ? do_user_addr_fault+0x1d0/0x640
[  140.255367]  ? exit_to_user_mode_prepare+0x3b/0xd0
[  140.255370]  ? irqentry_exit_to_user_mode+0x9/0x20
[  140.255372]  ? irqentry_exit+0x43/0x50
[  140.255374]  ? exc_page_fault+0x92/0x1b0
[  140.255376]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  140.255379] RIP: 0033:0x7f528f31aaff
[  140.255381] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
[  140.255383] RSP: 002b:00007ffd061336e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  140.255385] RAX: ffffffffffffffda RBX: 00000000c0106d00 RCX: 00007f528f31aaff
[  140.255386] RDX: 00007ffd06133740 RSI: 00000000c0106d00 RDI: 0000000000000013
[  140.255387] RBP: 00007ffd06133740 R08: 0000000000000000 R09: 00005654791362c0
[  140.255388] R10: 00007ffd0614aab0 R11: 0000000000000246 R12: 0000000000000013
[  140.255389] R13: 00007f528ea1cbc8 R14: 00007ffd06136048 R15: 0000000000000003
[  140.255392]  </TASK>
  • If I don’t login immediately, but instead just wait long enough past the screen dim event on the login screen (screen sleep timeout?), the login screen disappears, the backlight turns off, and the fans go nuts.
    • top shows that Xorg is taking up 100% cpu.
    • dmesg shows that Xorg is blocked waiting on nvidia_modeset, with the same repeated soft lockup call trace as above.

Booting with nvidia-drm.modeset=1 Behavior #2: Super rare; this has only happened once, so far.

  • The login screen appears, ui elements work as expected.
  • Upon logging in, Xorg/gnome starts up and I can use the desktop GUI as if nothing was wrong!
  • From gnome, I can swap back to the login screen with ctrl+alt+F1, and from there can swap back to my logged in gnome session with ctrl+alt+F2.
  • If I try to go to another virtual terminal with ctrl+alt+Fx (where x > 2), I can see the tail end of the kernel log, but it is not a usable terminal. There’s no login prompt and nothing shows up on typing.

Booting with nvidia-drm.modeset=0:

(I’ll attach a nvidia-bug-report.sh output for this scenario as nodrm-full-login.nvidia-bug-report.log.gz)

  • The login screen is a sea of blackness, but the backlight is on.
  • At this point, If I try to go to another virtual terminal with ctrl+alt+Fx (where x > 1), I can see the tail end of the kernel log, but it is not a usable terminal. There’s no login prompt and nothing shows up on typing.
    • I can get back to VT1 where the login screen should be, and it’s still a black screen with backlight on.
  • Back on VT1, even though there’s no login screen displayed I can pretend the login screen is there: (e.g.:hit enter to select default user + type pw + enter)
    • Xorg/gnome starts up and I can use the desktop GUI as if nothing was wrong!
  • From gnome, I can swap back to the login screen with ctrl+alt+F1 and this time it appears just fine! And I can go back to the VT with X11 which continues displaying a gnome environment as one would expect. However, trying to use other VTs gives the same result as above.
  • Instead of logging in immediately at the black login screen, If I don’t login immediately, but wait long enough (screen sleep timeout?)
    • the backlight eventually turns off
    • the “just pretend there’s a login screen there” workaround outlined above does not work
    • Foreshadowing:
      • the fans remain in a steady state
      • top doesn’t show anything taking up a suspicious amount of cpu
      • nothing is logged to dmesg

The promised logs…

drm-full-login.nvidia-bug-report.log.gz (538.7 KB)
nodrm-full-login.nvidia-bug-report.log.gz (599.5 KB)

first time, eh?

your kernel is EOL. update to LTS kernel and reinstall nvidia nonfree

first time, eh?

Yup.

update to LTS kernel and reinstall nvidia nonfree

Done! I updated to kernel 6.4.12 and reinstalled nvidia drivers v535.

UPDATE

I’ve rebooted again and now the behavior is the same, except that i can switch to other VTs. I’m getting hangs on nvidia_modeset again with drm enabled on 6.4.12/535. nvidia-bug-report.sh capture:
nvidia-bug-report.log.gz (443.6 KB)

With drm disabled I can login with the "pretend you can see the login screen that the machine seems to beliveve it is displaying " workaround described above.

Original Post/Deprecated

~~It’s not blocking on nvidia_modeset any more , which is nice. And i can now switch to different VTs, which is even better.

Booting with nvidia-drm.modeset=1:

This is still erratic, but it’s better than before. Again there are only two different behaviors seen when rebooting a few times without changing anything between boots:

Booting with nvidia-drm.modeset=1 Behavior #0:

Same as Behavior #0 in the initial report: no login screen, fans go full-on almost immediately. I have only reproduced this once so far and it was so frozen I couldn’t ssh in. So I can’t say for sure if it was plymouthd blocked on something again, or what might have been blocking it.

Booting with nvidia-drm.modeset=1 Behavior #1:

Login screen shows up and I can login, after which X starts and I can use the computer as usual. Nice. But now it’s logging this multiple times when X starts so i’m probably not really getting drm?

[ 1498.569891] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to grab modeset ownership

Log captured after a successful login with that error.
drm-full-login.nvidia-bug-report.log.gz (536.1 KB)

Booting with nvidia-drm.modeset=0:

Worse than before. The login screen is still black with the backlight on, but now after the “pretend the login screen is there and login with the keyboard as usual” workaround, loginctl confirms that did in-fact log me in but the logged-in X session is now also a black screen with the backlight on.

Logs captured after one of these logins nodrm-full-login.nvidia-bug-report.log.gz (280 KB)~~

The reason for your suffering is

abysmal thermals and poor performance aside, any dual graphics laptop also has issues with actually using dGPU, even in Windows, since discrete GPU is connected through onboard graphics. Alienware and very few others have a different layout, so called MUX switch, which physically connects dGPU to Display.
If your laptop is traditional nvidia-through-intel/ryzen config, study ubuntu or better yet arch wiki on dual graphics.
If it does have a mux (not all Alienware do), research how to actually use that.

You should also be able to see desktop on external monitor, as the DP/HDMI ports are usually directly routed to dGPU.

@generix , Any thoughts on this software issue? I’ve seen lots of useful posts from you on the forum! :)