With current nVidia drivers (at least both 570.144 and 575.51.02), I experience frequent kernel BUGs when suspending the system. At this point the system becomes unusable, only way to recover is sync/unmount/reboot using Alt+SysRq.
This is on a hybrid graphics laptop running “modern” X11 with the nVidia dGPU as an additional XRandR provider, so no way to unload the nvidia driver before suspending even when I’m not using it.
This used to work ‘better’ (ie I would not experience regular crashes when trying to suspend at all) up until some weeks/few months ago.
I’ve already tried enabling nvidia-persistenced, no change. NVreg_PreserveVideoMemoryAllocations is set to 1 by default on my system - tried with 0, no change. With the 570.144 driver, the system appears to oops consistently on the second suspend - i.e. after having successfully resumed from suspend once. With 575.51.02, so far it consistently oopses on the first attempt to suspend.
I’ll try to downgrade to 550.163.01 next (which is the next lower that my distribution easily offers).
As far as I’ve seen, the general shape of the stack trace nv_uvm_suspend -...-> nv_kthread_q_flush -...-> nvstatusToString -...-> *BANG*is always the same.
nvidia-bug-report.log.gz (759.9 KB)
The kernel message:
list_add corruption. prev is NULL.
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:25!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 6 UID: 0 PID: 13547 Comm: nvidia-sleep.sh Tainted: P O 6.12.21-gentoo #1
Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
Hardware name: LENOVO 80RV/Lenovo ideapad 700-17ISK, BIOS E5CN63WW 06/14/2018
RIP: 0010:__list_add_valid_or_report.cold+0xc/0x5b
Code: 83 ce ff bf 10 00 00 00 e8 a3 fe ff ff 48 c7 c7 05 38 27 82 e8 07 7a fe ff e9 fc 29 ac ff 48 c7 c7 98 73 2f 82 e8 f6 79 fe ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 98 74 2f 82 e8 df 79 fe
RSP: 0018:ffffc9000b737be8 EFLAGS: 00010046
RAX: 0000000000000022 RBX: ffffc90020d9c2f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8226f0f2 RDI: 00000000ffffffff
RBP: ffffc9000b737c20 R08: 0000000000000000 R09: ffffffff8250bb08
R10: ffffffff8253bb48 R11: 0000000000000003 R12: 0000000000000246
R13: ffffc90020d9c300 R14: 0000000000000000 R15: ffff8881064c8000
FS: 00007f470bf48c40(0000) GS:ffff88883ed00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000557a16e62128 CR3: 0000000353d2a006 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? __die_body.cold+0x19/0x27
? die+0x25/0x40
? do_trap+0xc1/0x110
? do_error_trap+0x65/0x90
? __list_add_valid_or_report.cold+0xc/0x5b
? exc_invalid_op+0x4c/0x60
? __list_add_valid_or_report.cold+0xc/0x5b
? asm_exc_invalid_op+0x16/0x20
? __list_add_valid_or_report.cold+0xc/0x5b
? __list_add_valid_or_report.cold+0xc/0x5b
nvstatusToString+0x1e8/0x270 [nvidia_uvm]
nv_kthread_q_flush+0x66/0x110 [nvidia_uvm]
? nvstatusToString+0x260/0x270 [nvidia_uvm]
uvm_tools_exit+0xff/0x2c0 [nvidia_uvm]
uvm_suspend_entry+0x8c/0x290 [nvidia_uvm]
nv_uvm_suspend+0x25/0x40 [nvidia]
nv_set_system_power_state+0x3be/0x470 [nvidia]
nv_teardown_pat_support+0x4ad/0x1de0 [nvidia]
proc_reg_write+0x4d/0x90
? preempt_count_add+0x42/0xa0
vfs_write+0xf2/0x490
ksys_write+0x64/0xe0
do_syscall_64+0x80/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f470c08aa74
Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d f5 65 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
RSP: 002b:00007ffceab968a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f470c08aa74
RDX: 0000000000000008 RSI: 0000557a16e61d20 RDI: 0000000000000001
RBP: 00007f470c16a5c0 R08: 0000000000000410 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000000008
R13: 0000557a16e61d20 R14: 0000000000000008 R15: 00007f470c167ea0
</TASK>
Modules linked in: binfmt_misc rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device nvidia_uvm(PO) bfq ec_sys uhid algif_hash algif_skcipher af_alg bnep tpm_crb tpm_tis tpm_tis_core tpm libaescfb rng_core thermal tiny_power_button snd_hda_codec_hdmi snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component vboxnetadp(O) vboxnetflt(O) vboxdrv(O) btusb btrtl uvcvideo uvc btintel videobuf2_vmalloc btbcm videobuf2_memops btmtk videobuf2_v4l2 videobuf2_common cmac videodev mc bluetooth ecdh_generic ecc joydev mousedev nls_ascii nls_cp437 vfat fat intel_rapl_msr coretemp intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp evdev iwlmvm kvm_intel mac80211 mei_hdcp mei_pxp ee1004 snd_hda_intel snd_intel_dspcfg kvm crct10dif_pclmul polyval_clmulni polyval_generic snd_hda_codec ghash_clmulni_intel libarc4 snd_hda_core sha512_ssse3 sha256_ssse3 snd_hwdep sha1_ssse3 i915 snd_pcm sha1_generic intel_pmc_core snd_timer rapl i2c_algo_bit iwlwifi intel_lpss_pci
pinctrl_sunrisepoint snd intel_vsec drm_buddy intel_cstate intel_lpss pinctrl_intel pmt_telemetry intel_uncore ac ideapad_laptop idma64 soundcore acpi_pad wdat_wdt pwm_lpss i2c_i801 drm_display_helper pmt_class psmouse efi_pstore pcspkr mei_me button i2c_smbus cec cfg80211 intel_pch_thermal input_leds rc_core mac_hid intel_gtt mei serio_raw nvidia_drm(PO) asus_wmi nvidia_modeset(PO) battery platform_profile sparse_keymap rfkill intel_wmi_thunderbolt wmi_bmof nvidia(PO) drm_ttm_helper ttm video wmi backlight dummy loop fuse dm_mod configfs nfnetlink efivarfs dmi_sysfs ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid sd_mod atkbd r8169 libps2 vivaldi_fmap ahci xhci_pci realtek nvme libahci crc32_pclmul led_class crc32c_intel xhci_hcd mdio_devres libata nvme_core hwmon libphy scsi_mod usbcore scsi_common usb_common i8042 serio
---[ end trace 0000000000000000 ]---