NVIDIA+GNOME+Wayland+Turning Screen Off = Random Chance of Hanging

Hi, I’m running GNOME on Wayland with the new GBM support, and it almost works except for one issue: when I lock my computer (or really, sleep the displays for any reason, locking is the most common) there’s a random chance of the driver just locking up and refusing to unsleep the monitors.

Unfortunately, I don’t have any other hardware with a NVIDIA GPU to test on, but on my system sleeping the screen is a high accuracy method of reproducing it - after waiting for about 30 minutes there is a 40%-50% chance the driver will have locked up.

When the lockup occurs, I get this in the kernel log (which will keep looping with longer and longer times until I restart the machine):

Mar 29 22:55:01 arch kernel: INFO: task nvidia-modeset/:409 blocked for more than 1228 seconds.
Mar 29 22:55:01 arch kernel:       Tainted: P           OE     5.16.16-zen1-1-zen #1
Mar 29 22:55:01 arch kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 29 22:55:01 arch kernel: task:nvidia-modeset/ state:D stack:    0 pid:  409 ppid:     2 flags:0x00004000
Mar 29 22:55:01 arch kernel: Call Trace:
Mar 29 22:55:01 arch kernel:  <TASK>
Mar 29 22:55:01 arch kernel:  __schedule+0x96f/0x1130
Mar 29 22:55:01 arch kernel:  ? __schedule+0x977/0x1130
Mar 29 22:55:01 arch kernel:  schedule+0x4b/0xc0
Mar 29 22:55:01 arch kernel:  schedule_timeout+0x119/0x150
Mar 29 22:55:01 arch kernel:  __down+0xac/0x100
Mar 29 22:55:01 arch kernel:  down+0x43/0x60
Mar 29 22:55:01 arch kernel:  nvkms_kthread_q_callback+0x7d/0x100 [nvidia_modeset 6c62ffb71642f967e9713a9ea3900a358e1c5665]
Mar 29 22:55:01 arch kernel:  _main_loop+0x9e/0x150 [nvidia_modeset 6c62ffb71642f967e9713a9ea3900a358e1c5665]
Mar 29 22:55:01 arch kernel:  ? nvkms_sema_up+0x10/0x10 [nvidia_modeset 6c62ffb71642f967e9713a9ea3900a358e1c5665]
Mar 29 22:55:01 arch kernel:  kthread+0x1e3/0x210
Mar 29 22:55:01 arch kernel:  ? kthread_unuse_mm+0xa0/0xa0
Mar 29 22:55:01 arch kernel:  ret_from_fork+0x22/0x30
Mar 29 22:55:01 arch kernel:  </TASK>

Note that when I say lockup I just mean the display driver gets stuck and won’t unsleep the monitors, but the system still works and I can SSH in. Also, only the monitors are being put to sleep, I am not suspending/hibernating the system itself.

nvidia-bug-report.log.gz (445.8 KB)

I’m running the zen kernel but this happens on the mainline kernel too. It happens with the 49X.XX driver versions as well so is not a regress in the 5XX.XX drivers.

How many/what monitors are connected? Does this also happen with just one monitor attached?