Graphical corruption and Xorg crash after resume in both Linux 4.19 and 4.20

I am running Linux 4.20-rc6 patched with the PCI_PREF_BASE_UPPER32 quirk from https://devtalk.nvidia.com/default/topic/1017185/linux/problem-with-resume-from-suspend-ubuntu-16-04-gt-940mx-/post/5288884/#5288884 and the Nvidia driver 415.23 on my ASUS GL504GM laptop with a GTX1060. I use PRIME display sink and synchronization to let the nvidia driver access the internal LVDS screen, too. Upon resuming from suspend, the KDE Plasma lockscreen takes a very long time to appear. Upon logging in, KDE complains that the compositor has crashed, and there is graphical corruption such as https://imgur.com/a/X7YFtGi

When an external display is connected to the HDMI output of the GTX1060, graphical corruption also appears. If I suspend and then resume again, Xorg flat-out crashes (all the opened applications are gone). The attached Xorg.0.log.old is the log from the crashed X server. It ends with an (EE) Backtrace: line, as the X server was apparently unable to even output a stacktrace.

The attached nvidia-bug-report-4.20.log.gz is the output of nvidia-bug-report.sh on my patched 4.20-rc6.

I can also reproduce the issue under the stock Arch Linux 4.19 kernel (setting acpi_osi=! acpi_osi=“Windows 2009” to work around PCIe issues makes no difference). The report right after the Xorg crash is attached as nvidia-bug-report-4.19.log.gz

Upon switching batch to the X session from the virtual terminal that I used to create the bug reports, I noticed the Xorg screen was all black save for the mouse pointer. I switched back to the vt and immediately created another bug report, attached as nvidia-bug-report-4.19-later As I noticed lots of nvidia-related in the dmesg output, I also attach it as Xorg-crash-dmesg

I am also experiencing similar crashes when I try to boot with an external display attached. In this case, the display manager gets in a “restart loop”, which makes creating a bug report fairly difficult.

As boot and sleep with an external display work in Windows, I’m quite sure that this is not caused by a hardware issue.

Is this an nvidia driver bug? How should I set up my machine so that it works with an external display and resuming?

nvidia-bug-report-4.19.log.gz (1 MB)
nvidia-bug-report-4.20.log.gz (1010 KB)
nvidia-bug-report-4.19-later.log.gz (1010 KB)
Xorg.0.log.old.txt (58.9 KB)
Xorg-crash-dmesg.txt (117 KB)

Now I found out that my X server also crashes without an external monitor connected: upon the first resume after boot, there is graphical corruption, while after the second resume, X crashes and gets into a “restart loop” due to the display manager trying to start it again, but failing. I could not gather an nvidia-bug-report this time due to being unable to switch to a VT. After rebooting the machine, Xorg.0.log.old did not contain any errors. As a temporary workaround, I masked sleep and hibernation in systemd. However, the situation is extremely annoying, and I would much rather prefer if the driver could be fixed.

Something is still fishy after the workaround: when I rebooted, and tried to connect my external monitor, X crashed. After the display manager restarted X, it only displayed a black screen an the cursor (which I could move between displays just fine). There were no errors in Xorg.0.log or dmesg. I attached an nvidia-bug-report from this incident.
nvidia-bug-report.log.gz (496 KB)

Upgraded to 4.20rc7 today. Suspend is still broken (and I did not try HDMI, as Xorg gets into a “restart loop” even without HDMI upon the second suspend-resume cycle. While I could not run nvidia-bug-report.sh on the “restart looping” system, the dmesg from the failed boot (extracted with journalctl -b -1) is very interesting: https://gist.github.com/kris7t/7b6ed8cce67fb85d0e19fc80cd682115

Please take a look at the Xorg coredumps:

Dec 17 17:42:23 KRiS-Siri systemd-coredump[3849]: Process 3822 (Xorg) of user 0 dumped core.
                                                  
                                                  Stack trace of thread 3822:
                                                  #0  0x00007f3b461c630d _Unwind_IteratePhdrCallback (libgcc_s.so.1)
                                                  #1  0x00007f3b48a36fdf dl_iterate_phdr (libc.so.6)
                                                  #2  0x00007f3b461c7456 _Unwind_Find_FDE (libgcc_s.so.1)
                                                  #3  0x00007f3b461c39d4 uw_frame_state_for (libgcc_s.so.1)
                                                  #4  0x00007f3b461c4bd0 uw_init_context_1 (libgcc_s.so.1)
                                                  #5  0x00007f3b461c5a0c _Unwind_Backtrace (libgcc_s.so.1)
                                                  #6  0x00007f3b48a09cc6 __backtrace (libc.so.6)
                                                  #7  0x000055b10ce491fd xorg_backtrace (Xorg)
                                                  #8  0x000055b10ce49339 n/a (Xorg)
                                                  #9  0x00007f3b48937e00 __restore_rt (libc.so.6)
                                                  #10 0x00007f3b48b17d56 _dl_fixup (ld-linux-x86-64.so.2)
                                                  #11 0x00007f3b48b1e7ae _dl_runtime_resolve_xsavec (ld-linux-x86-64.so.2)
                                                  #12 0x00007f3b442eb24c n/a (libglamoregl.so)
                                                  #13 0x000055b10cec544e n/a (Xorg)
                                                  #14 0x000055b10ce95aa5 n/a (Xorg)
                                                  #15 0x000055b10ceda2df RRCrtcSet (Xorg)
                                                  #16 0x000055b10cedab21 ProcRRSetCrtcConfig (Xorg)
                                                  #17 0x00007f3b4580067d n/a (nvidia_drv.so)

as well as the atomic while schdeling kernel BUGs caused by the nvidia module:

Dec 17 17:41:18 KRiS-Siri kernel: [drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event
Dec 17 17:41:18 KRiS-Siri kernel: [drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event
Dec 17 17:41:18 KRiS-Siri kernel: BUG: scheduling while atomic: Xorg/1810/0x00000003
Dec 17 17:41:18 KRiS-Siri kernel: Modules linked in: ccm ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype nf_nat br_netfilter bridge stp llc rfcomm nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) ipmi_devintf ipmi_msghandler ip6t_rpfilter ip6table_raw ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_comment ipt_REJECT nf_reject_ipv4 xt_tcpudp bnep xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter iptable_raw snd_hda_codec_realtek snd_hda_codec_generic btusb btrtl uvcvideo btbcm btintel videobuf2_vmalloc videobuf2_memops bluetooth videobuf2_v4l2 videobuf2_common videodev media ecdh_generic crc16 arc4 i915 joydev mousedev nls_iso8859_1 nls_cp437 vfat msr fat kvmgt iwlmvm vfio_mdev mdev vfio_iommu_type1 vfio intel_rapl hid_multitouch i2c_algo_bit x86_pkg_temp_thermal intel_powerclamp drm_kms_helper snd_hda_intel coretemp mac80211 kvm_intel snd_hda_codec drm iTCO_wdt kvm iTCO_vendor_support snd_hda_core iwlwifi snd_hwdep snd_pcm
Dec 17 17:41:18 KRiS-Siri kernel:  irqbypass intel_cstate snd_timer idma64 intel_gtt tpm_crb snd agpgart intel_uncore cfg80211 tpm_tis syscopyarea mei_me pcspkr intel_rapl_perf tpm_tis_core intel_lpss_pci input_leds sysfillrect r8168(OE) mxm_wmi i2c_i801 soundcore i2c_hid asus_nb_wmi wmi_bmof intel_lpss processor_thermal_device sysimgblt mei int340x_thermal_zone battery ac tpm fb_sys_fops intel_pch_thermal intel_soc_dts_iosf rng_core evdev int3400_thermal acpi_thermal_rel pcc_cpufreq mac_hid asus_wireless crypto_user ip_tables x_tables btrfs libcrc32c crc32c_generic xor raid6_pq algif_skcipher af_alg dm_crypt dm_mod hid_asus asus_wmi sparse_keymap rfkill sd_mod hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw atkbd libps2 ahci libahci aesni_intel libata sdhci_pci aes_x86_64 cqhci crypto_simd xhci_pci sdhci cryptd glue_helper scsi_mod xhci_hcd mmc_core i8042 wmi serio
Dec 17 17:41:18 KRiS-Siri kernel: Preemption disabled at:
Dec 17 17:41:18 KRiS-Siri kernel: [<0000000000000000>]           (null)
Dec 17 17:41:18 KRiS-Siri kernel: CPU: 5 PID: 1810 Comm: Xorg Tainted: P     U     OE     4.20.0-rc7-kris-kris-g6f1f2c4e9 #1
Dec 17 17:41:18 KRiS-Siri kernel: Hardware name: ASUSTeK COMPUTER INC. Strix GL504GM_GL504GM/GL504GM, BIOS GL504GM.304 08/13/2018
Dec 17 17:41:18 KRiS-Siri kernel: Call Trace:
Dec 17 17:41:18 KRiS-Siri kernel:  dump_stack+0x5c/0x80
Dec 17 17:41:18 KRiS-Siri kernel:  __schedule_bug.cold.14+0x38/0x51
Dec 17 17:41:18 KRiS-Siri kernel:  __schedule+0x6f6/0x8b0
Dec 17 17:41:18 KRiS-Siri kernel:  schedule+0x32/0x90
Dec 17 17:41:18 KRiS-Siri kernel:  schedule_timeout+0x311/0x4a0
Dec 17 17:41:18 KRiS-Siri kernel:  ? resched_curr+0x23/0xd0
Dec 17 17:41:18 KRiS-Siri kernel:  ? check_preempt_curr+0x7a/0x90
Dec 17 17:41:18 KRiS-Siri kernel:  ? ttwu_do_wakeup.isra.5+0x19/0x160
Dec 17 17:41:18 KRiS-Siri kernel:  wait_for_common+0x15f/0x190
Dec 17 17:41:18 KRiS-Siri kernel:  ? wake_up_q+0x70/0x70
Dec 17 17:41:18 KRiS-Siri kernel:  do_coredump+0x35d/0xe98
Dec 17 17:41:18 KRiS-Siri kernel:  get_signal+0x294/0x5b0
Dec 17 17:41:18 KRiS-Siri kernel:  ? page_fault+0x8/0x30
Dec 17 17:41:18 KRiS-Siri kernel:  do_signal+0x36/0x640
Dec 17 17:41:18 KRiS-Siri kernel:  ? _raw_spin_unlock_irqrestore+0x20/0x40
Dec 17 17:41:18 KRiS-Siri kernel:  ? force_sig_fault+0x59/0x80
Dec 17 17:41:18 KRiS-Siri kernel:  ? page_fault+0x8/0x30
Dec 17 17:41:18 KRiS-Siri kernel:  exit_to_usermode_loop+0xbf/0xe0
Dec 17 17:41:18 KRiS-Siri kernel:  prepare_exit_to_usermode+0x64/0x90
Dec 17 17:41:18 KRiS-Siri kernel:  retint_user+0x8/0x8
Dec 17 17:41:18 KRiS-Siri kernel: RIP: 0033:0x7fd9a7913320
Dec 17 17:41:18 KRiS-Siri kernel: Code: 1f 00 81 f9 00 01 00 00 74 d3 89 d7 83 e7 7f 83 ff 01 75 67 64 8b 0c 25 d0 02 00 00 41 39 48 08 74 3e 81 e2 80 00 00 00 89 d6 <f0> 41 0f b1 38 74 16 49 8d 38 48 81 ec 80 00 00 00 e8 8a 6b 00 00
Dec 17 17:41:18 KRiS-Siri kernel: RSP: 002b:00007ffe9b6de5f0 EFLAGS: 00013246
Dec 17 17:41:18 KRiS-Siri kernel: RAX: 0000000000000000 RBX: 00007ffe9b6de950 RCX: 0000000000000712
Dec 17 17:41:18 KRiS-Siri kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
Dec 17 17:41:18 KRiS-Siri kernel: RBP: 00007ffe9b6de7a0 R08: 00007fd9a88d5930 R09: 0000000000000000
Dec 17 17:41:18 KRiS-Siri kernel: R10: 00007ffe9b6de6b0 R11: 0000000000000000 R12: 00007fd9a88d5000
Dec 17 17:41:18 KRiS-Siri kernel: R13: 00007ffe9b6de950 R14: 00007ffe9b6dec00 R15: 00007fd9a5f68a0b

Upon switching from KDE Plasma to GNOME Shell, I can now sometimes resume multiple times from sleep when no HDMI output is connected without Xorg crashing. It did not solve the problem altogether; however, at least it seems the GNOME compositor relies less on broken functionality of the driver than the Plasma one…

The problem seems to lessen if I set

options nvidia-drm modeset=0

Of course, that disables PRIME synchronization and introduces tearing issues. Please look into this issue, as it makes laptops with HDMI outputs wired to Nvidia GPU useless as “laptops” with Linux. I have to treat my machine as a desktop PC and shut it down every time, not to mention the enormous power drain at idle despite enabling PowerMizer.

On a slightly related note, please also fix PowerMizer on laptops! My GPU is stuck with the highest clock unless I kill my desktop compositor (tested with both KDE Plasma and Gnome) or limit performance with NVreg_RegistryDwords, and many other people report similar issues in these forums. Idling in the lowest PowerMizer setting would make my machine able to be used on battery for several hours despite not having PRIME offload support; however, reaching it with a desktop compositor seems all but impossible.