[BUG] nvidia_modeset causes kernel (5.*) xorg crash on RTX 2070 Super card

So there seems to be a bug in the nvidia_modeset stack that is isolated to the RTX Super cards (to my knowledge). I say this since I have two cards, a GTX 1060 and an RTX 2070 Super. When running on the GTX the system seems fine, but the RTX repos the issue. This is a hard bug to reproduce, but essentially something happens in the nvidia stack which causes an infinite loop for applications relying on the GPU (specifically lightdm, google chrome). When this happens the X server essentially grinds to a halt. The experience of sshing into the machine seems to be ok. If I restart the lighdm service (by ssh) it provides a kernel dump of the issue (which I’ve included). The issue seems to repo more frequently (within hrs to a day) when my main monitor goes to sleep (connected via Display Port) and tries to wake back up (some times it doesn’t wake up). If I run a screen savor that prevents the display from going to sleep I can go over a week without seeing the issue.

For Reference:
Nvidia Driver: 435.21
X Server Version Number: 11.0
X Server Vendor Version: 1.20.5
Card: RTX 2070 Super
OS: Arch-Linux (5.3.1 kernel), (I have seen this on kernels < 5.3.1 as well)
Mobo: ROG CROSSHAIR VIII FORMULA, BIOS 1001 09/09/2019
CPU: AMD 3900x
3 Displays (all via display port):

  • Dell U2518D x2
  • Dell U2713H x1 (Main)

Kernel Dump

Oct 08 18:35:48  kernel: nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
Oct 08 18:35:48  kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
Oct 08 18:35:48  kernel: ------------[ cut here ]------------
Oct 08 18:35:48  kernel: Trying to vfree() bad address (00000000f91fb149)
Oct 08 18:35:48  kernel: WARNING: CPU: 11 PID: 759 at mm/vmalloc.c:2228 __vunmap+0x237/0x240
Oct 08 18:35:48  kernel: Modules linked in: edac_mce_amd kvm_amd fuse uvcvideo snd_usb_audio videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common snd_usbmidi_lib videodev snd_rawmidi snd_seq_device mc hid_logitech_hidpp joydev mousedev input_leds hid_logitech_dj nct6775 hwmon_vid snd_hda_codec_hdmi iwlmvm snd_hda_codec_realtek mac80211 nls_iso8859_1 snd_hda_codec_generic kvm nls_cp437 ledtrig_audio vfat snd_hda_intel libarc4 fat btusb snd_hda_codec btrtl ucsi_ccg iwlwifi btbcm typec_ucsi snd_hda_core crct10dif_pclmul btintel eeepc_wmi typec crc32_pclmul asus_wmi snd_hwdep ghash_clmulni_intel sparse_keymap wmi_bmof mxm_wmi bluetooth snd_pcm aesni_intel igb snd_timer aes_x86_64 crypto_simd ccp snd cryptd ecdh_generic sp5100_tco cfg80211 glue_helper pcspkr rng_core ecc soundcore i2c_algo_bit i2c_piix4 i2c_nvidia_gpu rfkill atlantic dca evdev wmi pinctrl_amd mac_hid acpi_cpufreq ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sd_mod hid_generic usbhid hid ahci libahci libata
Oct 08 18:35:48  kernel:  crc32c_intel xhci_pci scsi_mod xhci_hcd nvidia_drm(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart nvidia_uvm(OE) nvidia_modeset(POE) nvidia(POE) ipmi_devintf ipmi_msghandler vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
Oct 08 18:35:48  kernel: CPU: 11 PID: 759 Comm: Xorg Tainted: P           OE     5.3.1-arch1-1-ARCH #1
Oct 08 18:35:48  kernel: Hardware name: System manufacturer System Product Name/ROG CROSSHAIR VIII FORMULA, BIOS 1001 09/09/2019
Oct 08 18:35:48  kernel: RIP: 0010:__vunmap+0x237/0x240
Oct 08 18:35:48  kernel: Code: 69 01 49 8b 7d 20 e8 58 02 fd ff 4c 89 ef 5b 5d 41 5c 41 5d 41 5e e9 58 ce 02 00 48 89 fe 48 c7 c7 c8 3f ee 9f e8 48 41 e4 ff <0f> 0b eb aa c3 0f 1f 40 00 0f 1f 44 00 00 53 31 db 48 87 5f f8 48
Oct 08 18:35:48  kernel: RSP: 0018:ffff98d88294fb00 EFLAGS: 00010286
Oct 08 18:35:48  kernel: RAX: 0000000000000000 RBX: ffff93afb008c008 RCX: 0000000000000000
Oct 08 18:35:48  kernel: RDX: 0000000000000001 RSI: 0000000000000096 RDI: 00000000ffffffff
Oct 08 18:35:48  kernel: RBP: 0000000000000960 R08: 00000000000005dc R09: 0000000000000001
Oct 08 18:35:48  kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff93afab57f960
Oct 08 18:35:48  kernel: R13: 0000000000000004 R14: ffff93afb008f008 R15: ffff93afb008c008
Oct 08 18:35:48  kernel: FS:  00007fcf93d5ddc0(0000) GS:ffff93afbeac0000(0000) knlGS:0000000000000000
Oct 08 18:35:48  kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 08 18:35:48  kernel: CR2: 0000555e47e62000 CR3: 0000000782c0a000 CR4: 0000000000340ee0
Oct 08 18:35:48  kernel: Call Trace:
Oct 08 18:35:48  kernel:  _nv002407kms+0xea/0x150 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv000323kms+0x2d/0x1d0 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv002216kms+0x2d5/0x6d0 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv000564kms+0x6b/0x90 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv000325kms+0x92/0xb0 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? nvKmsClose+0xab/0x170 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? nvkms_close_common+0x1e/0x60 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? nvkms_close+0x6a/0x90 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? nvidia_frontend_close+0x2b/0x50 [nvidia]
Oct 08 18:35:48  kernel:  ? __fput+0xae/0x240
Oct 08 18:35:48  kernel:  ? task_work_run+0x93/0xb0
Oct 08 18:35:48  kernel:  ? do_exit+0x300/0xb00
Oct 08 18:35:48  kernel:  ? free_one_page+0xac/0x480
Oct 08 18:35:48  kernel:  ? sched_clock_cpu+0x10/0xd0
Oct 08 18:35:48  kernel:  ? do_group_exit+0x33/0xa0
Oct 08 18:35:48  kernel:  ? get_signal+0x136/0x8d0
Oct 08 18:35:48  kernel:  ? up+0x40/0x60
Oct 08 18:35:48  kernel:  ? nvkms_ioctl_common+0x49/0x80 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? do_signal+0x43/0x680
Oct 08 18:35:48  kernel:  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
Oct 08 18:35:48  kernel:  ? exit_to_usermode_loop+0xbe/0x110
Oct 08 18:35:48  kernel:  ? do_syscall_64+0x189/0x1c0
Oct 08 18:35:48  kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 08 18:35:48  kernel: ---[ end trace 7a8cbed99bb5eb74 ]---
Oct 08 18:35:48  kernel: BUG: kernel NULL pointer dereference, address: 0000000000000038
Oct 08 18:35:48  kernel: #PF: supervisor read access in kernel mode
Oct 08 18:35:48  kernel: #PF: error_code(0x0000) - not-present page
Oct 08 18:35:48  kernel: PGD 0 P4D 0 
Oct 08 18:35:48  kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Oct 08 18:35:48  kernel: CPU: 6 PID: 374 Comm: nvidia-modeset/ Tainted: P        W  OE     5.3.1-arch1-1-ARCH #1
Oct 08 18:35:48  kernel: Hardware name: System manufacturer System Product Name/ROG CROSSHAIR VIII FORMULA, BIOS 1001 09/09/2019
Oct 08 18:35:48  kernel: RIP: 0010:_nv002228kms+0x40/0x80 [nvidia_modeset]
Oct 08 18:35:48  kernel: Code: 01 c8 39 c1 73 4f 0f 1f 40 00 89 ca 48 69 d2 08 0a 00 00 49 03 90 d0 02 00 00 48 8d 42 38 48 81 c2 d8 00 00 00 0f 1f 44 00 00 <80> 38 00 74 0a 80 78 01 00 75 04 c6 46 02 00 48 83 c0 14 48 39 c2
Oct 08 18:35:48  kernel: RSP: 0018:ffff98d8809cf5c0 EFLAGS: 00010206
Oct 08 18:35:48  kernel: RAX: 0000000000000038 RBX: ffff98d8809cf734 RCX: 0000000000000000
Oct 08 18:35:48  kernel: RDX: 00000000000000d8 RSI: ffff98d8809cf7ee RDI: ffff93afb008f008
Oct 08 18:35:48  kernel: RBP: ffff93afb008f008 R08: ffff93afb008c008 R09: 00000000000001e0
Oct 08 18:35:48  kernel: R10: 000000000000008f R11: ffff98d8809cf8b8 R12: ffff93afb01cc608
Oct 08 18:35:48  kernel: R13: ffff98d8809cfa64 R14: 0000000000000000 R15: 0000000000000000
Oct 08 18:35:48  kernel: FS:  0000000000000000(0000) GS:ffff93afbe980000(0000) knlGS:0000000000000000
Oct 08 18:35:48  kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 08 18:35:48  kernel: CR2: 0000000000000038 CR3: 0000000feb2de000 CR4: 0000000000340ee0
Oct 08 18:35:48  kernel: Call Trace:
Oct 08 18:35:48  kernel:  ? _nv000066kms+0x18f/0x1e0 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv002237kms+0x249/0x4f0 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv002365kms+0x53/0x60 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv000650kms+0xf6/0x360 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? cpumask_next_and+0x19/0x20
Oct 08 18:35:48  kernel:  ? load_balance+0x1ba/0xb40
Oct 08 18:35:48  kernel:  ? update_curr+0x108/0x1f0
Oct 08 18:35:48  kernel:  ? __switch_to_asm+0x34/0x70
Oct 08 18:35:48  kernel:  ? _nv002642kms+0x502/0x8d0 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv002642kms+0x4d3/0x8d0 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _raw_spin_lock_irqsave+0x26/0x50
Oct 08 18:35:48  kernel:  ? _raw_spin_lock_irqsave+0x26/0x50
Oct 08 18:35:48  kernel:  ? _raw_spin_lock_irqsave+0x26/0x50
Oct 08 18:35:48  kernel:  ? _nv000650kms+0x40/0x40 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv000652kms+0x2a/0x40 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _raw_spin_lock_irqsave+0x26/0x50
Oct 08 18:35:48  kernel:  ? nvkms_ioctl_common+0x3b/0x80 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? nvkms_ioctl_from_kapi+0xa/0x10 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? _nv000367kms+0x7a/0x210 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? nv_drm_connector_get_modes+0xd4/0x150 [nvidia_drm]
Oct 08 18:35:48  kernel:  ? drm_helper_probe_single_connector_modes+0x17b/0x6e0 [drm_kms_helper]
Oct 08 18:35:48  kernel:  ? nv_drm_output_poll_changed+0x85/0xd0 [nvidia_drm]
Oct 08 18:35:48  kernel:  ? drm_kms_helper_hotplug_event+0x26/0x30 [drm_kms_helper]
Oct 08 18:35:48  kernel:  ? nv_drm_event_callback+0x4a/0x90 [nvidia_drm]
Oct 08 18:35:48  kernel:  ? nvKmsKapiHandleEventQueueChange+0xc7/0x100 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? preempt_count_add+0x68/0xa0
Oct 08 18:35:48  kernel:  ? _main_loop+0x83/0x130 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? kthread+0xfb/0x130
Oct 08 18:35:48  kernel:  ? _raw_q_schedule+0x70/0x70 [nvidia_modeset]
Oct 08 18:35:48  kernel:  ? kthread_park+0x80/0x80
Oct 08 18:35:48  kernel:  ? ret_from_fork+0x22/0x40
Oct 08 18:35:48  kernel: Modules linked in: edac_mce_amd kvm_amd fuse uvcvideo snd_usb_audio videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common snd_usbmidi_lib videodev snd_rawmidi snd_seq_device mc hid_logitech_hidpp joydev mousedev input_leds hid_logitech_dj nct6775 hwmon_vid snd_hda_codec_hdmi iwlmvm snd_hda_codec_realtek mac80211 nls_iso8859_1 snd_hda_codec_generic kvm nls_cp437 ledtrig_audio vfat snd_hda_intel libarc4 fat btusb snd_hda_codec btrtl ucsi_ccg iwlwifi btbcm typec_ucsi snd_hda_core crct10dif_pclmul btintel eeepc_wmi typec crc32_pclmul asus_wmi snd_hwdep ghash_clmulni_intel sparse_keymap wmi_bmof mxm_wmi bluetooth snd_pcm aesni_intel igb snd_timer aes_x86_64 crypto_simd ccp snd cryptd ecdh_generic sp5100_tco cfg80211 glue_helper pcspkr rng_core ecc soundcore i2c_algo_bit i2c_piix4 i2c_nvidia_gpu rfkill atlantic dca evdev wmi pinctrl_amd mac_hid acpi_cpufreq ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sd_mod hid_generic usbhid hid ahci libahci libata
Oct 08 18:35:48  kernel:  crc32c_intel xhci_pci scsi_mod xhci_hcd nvidia_drm(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart nvidia_uvm(OE) nvidia_modeset(POE) nvidia(POE) ipmi_devintf ipmi_msghandler vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
Oct 08 18:35:48  kernel: CR2: 0000000000000038
Oct 08 18:35:48  kernel: ---[ end trace 7a8cbed99bb5eb75 ]---
Oct 08 18:35:48  kernel: RIP: 0010:_nv002228kms+0x40/0x80 [nvidia_modeset]
Oct 08 18:35:48  kernel: Code: 01 c8 39 c1 73 4f 0f 1f 40 00 89 ca 48 69 d2 08 0a 00 00 49 03 90 d0 02 00 00 48 8d 42 38 48 81 c2 d8 00 00 00 0f 1f 44 00 00 <80> 38 00 74 0a 80 78 01 00 75 04 c6 46 02 00 48 83 c0 14 48 39 c2
Oct 08 18:35:48  kernel: RSP: 0018:ffff98d8809cf5c0 EFLAGS: 00010206
Oct 08 18:35:48  kernel: RAX: 0000000000000038 RBX: ffff98d8809cf734 RCX: 0000000000000000
Oct 08 18:35:48  kernel: RDX: 00000000000000d8 RSI: ffff98d8809cf7ee RDI: ffff93afb008f008
Oct 08 18:35:48  kernel: RBP: ffff93afb008f008 R08: ffff93afb008c008 R09: 00000000000001e0
Oct 08 18:35:48  kernel: R10: 000000000000008f R11: ffff98d8809cf8b8 R12: ffff93afb01cc608
Oct 08 18:35:48  kernel: R13: ffff98d8809cfa64 R14: 0000000000000000 R15: 0000000000000000
Oct 08 18:35:48  kernel: FS:  0000000000000000(0000) GS:ffff93afbe980000(0000) knlGS:0000000000000000
Oct 08 18:35:48  kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 08 18:35:48  kernel: CR2: 0000000000000038 CR3: 0000000feb2de000 CR4: 0000000000340ee0
Oct 08 18:35:48  kernel: audit: type=1131 audit(1570584948.328:202): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=lightdm comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 08 18:35:48  audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=lightdm comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Oct 08 18:35:48  systemd[1]: lightdm.service: Succeeded.

Try beta drivers 440.26.

Ok, I will this weekend. Hopefully i can confirm the fix (if it works) within a week. Also no ram was being leaked, I also saw the malloc issue, but it also seems it had trouble freeing ram as well

Trying to vfree() bad address (00000000f91fb149)

Had another instance of this occur on the 440 version. Took about 1 week to reproduce, but this time I could at least use sleep mode on my monitors. I’ve attached the nvidia bug script output.
nvidia-bug-report.log.gz (14 MB)