NVIDIA+GNOME+Wayland+Turning Screen Off = Random Chance of Hanging

Hi, I’m running GNOME on Wayland with the new GBM support, and it almost works except for one issue: when I lock my computer (or really, sleep the displays for any reason, locking is the most common) there’s a random chance of the driver just locking up and refusing to unsleep the monitors.

Unfortunately, I don’t have any other hardware with a NVIDIA GPU to test on, but on my system sleeping the screen is a high accuracy method of reproducing it - after waiting for about 30 minutes there is a 40%-50% chance the driver will have locked up.

When the lockup occurs, I get this in the kernel log (which will keep looping with longer and longer times until I restart the machine):

Mar 29 22:55:01 arch kernel: INFO: task nvidia-modeset/:409 blocked for more than 1228 seconds.
Mar 29 22:55:01 arch kernel:       Tainted: P           OE     5.16.16-zen1-1-zen #1
Mar 29 22:55:01 arch kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 29 22:55:01 arch kernel: task:nvidia-modeset/ state:D stack:    0 pid:  409 ppid:     2 flags:0x00004000
Mar 29 22:55:01 arch kernel: Call Trace:
Mar 29 22:55:01 arch kernel:  <TASK>
Mar 29 22:55:01 arch kernel:  __schedule+0x96f/0x1130
Mar 29 22:55:01 arch kernel:  ? __schedule+0x977/0x1130
Mar 29 22:55:01 arch kernel:  schedule+0x4b/0xc0
Mar 29 22:55:01 arch kernel:  schedule_timeout+0x119/0x150
Mar 29 22:55:01 arch kernel:  __down+0xac/0x100
Mar 29 22:55:01 arch kernel:  down+0x43/0x60
Mar 29 22:55:01 arch kernel:  nvkms_kthread_q_callback+0x7d/0x100 [nvidia_modeset 6c62ffb71642f967e9713a9ea3900a358e1c5665]
Mar 29 22:55:01 arch kernel:  _main_loop+0x9e/0x150 [nvidia_modeset 6c62ffb71642f967e9713a9ea3900a358e1c5665]
Mar 29 22:55:01 arch kernel:  ? nvkms_sema_up+0x10/0x10 [nvidia_modeset 6c62ffb71642f967e9713a9ea3900a358e1c5665]
Mar 29 22:55:01 arch kernel:  kthread+0x1e3/0x210
Mar 29 22:55:01 arch kernel:  ? kthread_unuse_mm+0xa0/0xa0
Mar 29 22:55:01 arch kernel:  ret_from_fork+0x22/0x30
Mar 29 22:55:01 arch kernel:  </TASK>

Note that when I say lockup I just mean the display driver gets stuck and won’t unsleep the monitors, but the system still works and I can SSH in. Also, only the monitors are being put to sleep, I am not suspending/hibernating the system itself.

nvidia-bug-report.log.gz (445.8 KB)

I’m running the zen kernel but this happens on the mainline kernel too. It happens with the 49X.XX driver versions as well so is not a regress in the 5XX.XX drivers.

1 Like

How many/what monitors are connected? Does this also happen with just one monitor attached?

I’m having the same issue as well while running Fedora 36 with GNOME. I am able to reproduce it with either one or two monitors attached. Both are Gigabyte M27Q monitors. I am unable to reproduce this bug under Xorg, but it happens regularly under Wayland.

Interestingly, if I have both monitors plugged in, I can turn off one of the monitors and the other will work fine, but when I turn the monitor back on again, there is a chance both will turn black and the freeze occurs.

The following appears in the log when the issue occurs. These messages also repeat at intervals of about 30 seconds.

Jun 09 17:42:33 fedora kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 10196s! [gnome-shell:3649]
Jun 09 17:42:33 fedora kernel: Modules linked in: tls uinput rfcomm snd_seq_dummy snd_hrtimer vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM ipt_REJECT nft_compat bridge stp llc nf_nat_tftp nft_nat nft_masq nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf>
Jun 09 17:42:33 fedora kernel:  wmi_bmof gigabyte_wmi pcspkr k10temp mei i2c_piix4 rfkill soundcore acpi_cpufreq zram dm_integrity async_xor async_tx dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel igb ghash_clmulni_intel rndis_host nvme cdc_ether ccp sp5100_tco usbnet nvme_core mii dca wmi ip6_tables ip_tables >
Jun 09 17:42:33 fedora kernel: CPU: 5 PID: 3649 Comm: gnome-shell Tainted: P           OEL    5.17.12-300.fc36.x86_64 #1
Jun 09 17:42:33 fedora kernel: Hardware name: System76 Thelio/Thelio, BIOS F33g Z5 04/27/2021
Jun 09 17:42:33 fedora kernel: RIP: 0010:_nv001526kms+0x2/0x70 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel: Code: 48 8b 53 28 e9 f8 fd ff ff 49 c7 47 48 00 00 00 00 48 8b 53 28 e9 b6 fd ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 41 54 <55> 49 89 fc 53 48 8d 5f 38 48 89 f5 48 89 df e8 8a 47 00 00 84 c0
Jun 09 17:42:33 fedora kernel: RSP: 0018:ffffb52c120af758 EFLAGS: 00000293
Jun 09 17:42:33 fedora kernel: RAX: ffffffffc0ede940 RBX: ffff9bdebd327208 RCX: 0000000838f0c005
Jun 09 17:42:33 fedora kernel: RDX: ffff9be613bf8f48 RSI: ffff9bdfb2422c08 RDI: ffff9bdebd327208
Jun 09 17:42:33 fedora kernel: RBP: ffff9bde94d52808 R08: ffffb52c120af5c8 R09: 0000000000000001
Jun 09 17:42:33 fedora kernel: R10: ffff9be2488704c0 R11: 0000000000017ffe R12: 0000000000000000
Jun 09 17:42:33 fedora kernel: R13: ffff9bdfb2422c08 R14: 0000000000000000 R15: ffff9bde88d54888
Jun 09 17:42:33 fedora kernel: FS:  00007f7305a4b5c0(0000) GS:ffff9bed7eb40000(0000) knlGS:0000000000000000
Jun 09 17:42:33 fedora kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 09 17:42:33 fedora kernel: CR2: 00005654c2a42fb0 CR3: 00000001413a8000 CR4: 0000000000750ee0
Jun 09 17:42:33 fedora kernel: PKRU: 55555554
Jun 09 17:42:33 fedora kernel: Call Trace:
Jun 09 17:42:33 fedora kernel:  <TASK>
Jun 09 17:42:33 fedora kernel:  ? _nv001109kms+0xc7/0x390 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel:  ? _nv002194kms+0x134/0x1b0 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel:  ? _nv000521kms+0xd5/0x118 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel:  ? _nv002575kms+0x27f7/0x2e30 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel:  ? alloc_vmap_area+0x84/0x810
Jun 09 17:42:33 fedora kernel:  ? kmem_cache_alloc_node+0x150/0x2f0
Jun 09 17:42:33 fedora kernel:  ? prepare_alloc_pages.constprop.0+0x176/0x190
Jun 09 17:42:33 fedora kernel:  ? __alloc_pages_bulk+0x4ea/0x6d0
Jun 09 17:42:33 fedora kernel:  ? vmap_small_pages_range_noflush+0x2f5/0x4c0
Jun 09 17:42:33 fedora kernel:  ? _nv000539kms+0x50/0x50 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel:  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel:  ? _raw_spin_lock_irqsave+0x25/0x50
Jun 09 17:42:33 fedora kernel:  ? nvkms_ioctl_from_kapi+0x47/0x80 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel:  ? _nv000020kms+0x67f/0x840 [nvidia_modeset]
Jun 09 17:42:33 fedora kernel:  ? nv_drm_atomic_apply_modeset_config.isra.0+0x292/0x510 [nvidia_drm]
Jun 09 17:42:33 fedora kernel:  ? nv_drm_atomic_commit+0xaa/0x310 [nvidia_drm]
Jun 09 17:42:33 fedora kernel:  ? drm_atomic_check_only+0x5a7/0x9f0
Jun 09 17:42:33 fedora kernel:  ? drm_atomic_connector_commit_dpms+0xcb/0xf0
Jun 09 17:42:33 fedora kernel:  ? drm_mode_obj_set_property_ioctl+0x160/0x380
Jun 09 17:42:33 fedora kernel:  ? drm_mode_obj_find_prop_id+0x40/0x40
Jun 09 17:42:33 fedora kernel:  ? drm_ioctl_kernel+0x9e/0x140
Jun 09 17:42:33 fedora kernel:  ? drm_ioctl+0x21c/0x410
Jun 09 17:42:33 fedora kernel:  ? drm_mode_obj_find_prop_id+0x40/0x40
Jun 09 17:42:33 fedora kernel:  ? handle_mm_fault+0xae/0x280
Jun 09 17:42:33 fedora kernel:  ? __x64_sys_ioctl+0x8d/0xc0
Jun 09 17:42:33 fedora kernel:  ? do_syscall_64+0x3a/0x80
Jun 09 17:42:33 fedora kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
Jun 09 17:42:33 fedora kernel:  </TASK>