Nvidia-uvm module bug on suspend

The following kernel bug report occurs intermittently when entering system suspend.

------------[ cut here ]------------
list_add corruption. prev is NULL.
WARNING: CPU: 0 PID: 11528 at lib/list_debug.c:25 __list_add_valid_or_report+0x42/0xa0
Modules linked in: hid_logitech_hidpp uhid rfcomm snd_seq_dummy snd_hrtimer snd_seq ccm cmac algif_hash algif_skcipher af_alg bnep snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi snd_seq_device nvidia_drm(POE) nvidia_uvm(POE) nvidia_modeset(POE) btusb btrtl btintel uvcvideo btbcm videobuf2_vmalloc btmtk uvc videobuf2_memops videobuf2_v4l2 bluetooth videodev videobuf2_common ecdh_generic mc crc16 intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_soc_avs snd_hda_codec_hdmi snd_soc_hda_codec iwlmvm snd_hda_ext_core snd_ctl_led kvm snd_hda_codec_realtek snd_soc_core mac80211 snd_hda_codec_generic irqbypass crct10dif_pclmul snd_compress crc32_pclmul ac97_bus joydev polyval_clmulni snd_pcm_dmaengine polyval_generic mousedev dell_rbtn libarc4 gf128mul snd_hda_intel ghash_clmulni_intel snd_intel_dspcfg sha512_ssse3 snd_intel_sdw_acpi aesni_intel dell_laptop snd_hda_codec crypto_simd snd_hda_core cryptd hid_multitouch dell_wmi snd_hwdep iTCO_wdt rapl nls_iso8859_1 iwlwifi
 intel_pmc_bxt dell_smbios snd_pcm ee1004 processor_thermal_device_pci_legacy intel_cstate vfat iTCO_vendor_support dcdbas processor_thermal_device mei_wdt mei_pxp mei_hdcp fat intel_rapl_msr dell_smm_hwmon intel_uncore psmouse dell_wmi_descriptor ledtrig_audio wmi_bmof pcspkr intel_wmi_thunderbolt i2c_i801 processor_thermal_rfim snd_timer cfg80211 i2c_smbus processor_thermal_mbox snd intel_lpss_pci mei_me i2c_hid_acpi processor_thermal_rapl intel_lpss intel_rapl_common int3403_thermal int3400_thermal soundcore rfkill mei i2c_hid intel_hid idma64 intel_soc_dts_iosf intel_pch_thermal acpi_thermal_rel int340x_thermal_zone nvidia(POE) sparse_keymap acpi_pad mac_hid i2c_dev fuse crypto_user loop dm_mod ip_tables x_tables usbhid btrfs i915 blake2b_generic libcrc32c crc32c_generic xor raid6_pq serio_raw i2c_algo_bit atkbd rtsx_pci_sdmmc drm_buddy libps2 mmc_core vivaldi_fmap ttm nvme intel_gtt crc32c_intel mxm_wmi nvme_core drm_display_helper xhci_pci rtsx_pci nvme_common cec xhci_pci_renesas i8042 video serio
 wmi
CPU: 0 PID: 11528 Comm: nvidia-sleep.sh Tainted: P          IOE      6.6.1-arch1-1 #1 be166a630cd909acf8820643140e9106c6ea80e6
Hardware name: Dell Inc. Precision 5520/0X41RR, BIOS 1.18.0 11/17/2019
RIP: 0010:__list_add_valid_or_report+0x42/0xa0
Code: 75 41 4c 8b 02 49 39 c0 75 4c 48 39 fa 74 60 49 39 f8 74 5b b8 01 00 00 00 c3 cc cc cc cc 48 c7 c7 98 33 49 95 e8 7e 14 a6 ff <0f> 0b 31 c0 c3 cc cc cc cc 48 c7 c7 c0 33 49 95 e8 69 14 a6 ff 0f
RSP: 0018:ffffc900079d3ba8 EFLAGS: 00010082
RAX: 0000000000000000 RBX: ffffc900011d12b0 RCX: 0000000000000027
RDX: ffff88846e421708 RSI: 0000000000000001 RDI: ffff88846e421700
RBP: ffffc900079d3be0 R08: 0000000000000000 R09: ffffc900079d3a30
R10: 0000000000000003 R11: ffffffff95cca3c8 R12: 0000000000000246
R13: ffffc900011d12c0 R14: 0000000000000000 R15: ffff88811fb88000
FS:  00007f07fffd3740(0000) GS:ffff88846e400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0800226650 CR3: 00000004622fe001 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? __list_add_valid_or_report+0x42/0xa0
 ? __warn+0x81/0x130
 ? __list_add_valid_or_report+0x42/0xa0
 ? report_bug+0x171/0x1a0
 ? prb_read_valid+0x1b/0x30
 ? handle_bug+0x3c/0x80
 ? exc_invalid_op+0x17/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? __list_add_valid_or_report+0x42/0xa0
 ? __list_add_valid_or_report+0x42/0xa0
 _raw_q_schedule+0x3d/0xa0 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 nv_kthread_q_flush+0x7b/0x140 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 ? __pfx__q_flush_function+0x10/0x10 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 uvm_suspend+0x9f/0x190 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 uvm_suspend_entry.part.0+0x4e/0xa0 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 ? kmem_cache_free+0x22/0x3a0
 nv_uvm_suspend+0x2e/0x50 [nvidia 3ce5cfbe99895ad472b4c9f14570b8cea8f3f96a]
 nv_set_system_power_state+0x3bb/0x470 [nvidia 3ce5cfbe99895ad472b4c9f14570b8cea8f3f96a]
 nv_procfs_write_suspend+0xe8/0x160 [nvidia 3ce5cfbe99895ad472b4c9f14570b8cea8f3f96a]
 proc_reg_write+0x5a/0xa0
 vfs_write+0xef/0x420
 ksys_write+0x6f/0xf0
 do_syscall_64+0x5d/0x90
 ? handle_mm_fault+0xa2/0x360
 ? do_user_addr_fault+0x30f/0x660
 ? exc_page_fault+0x7f/0x180
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7f0800151034
Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d 35 c3 0d 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
RSP: 002b:00007ffc4486d4e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f0800151034
RDX: 0000000000000008 RSI: 000055d8112e65b0 RDI: 0000000000000001
RBP: 000055d8112e65b0 R08: 0000000000000410 R09: 0000000000000001
R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000008
R13: 00007f08002265c0 R14: 00007f0800223f20 R15: 0000000000000000
 </TASK>
---[ end trace 0000000000000000 ]---
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 11528 Comm: nvidia-sleep.sh Tainted: P        W IOE      6.6.1-arch1-1 #1 be166a630cd909acf8820643140e9106c6ea80e6
Hardware name: Dell Inc. Precision 5520/0X41RR, BIOS 1.18.0 11/17/2019
RIP: 0010:__list_del_entry_valid_or_report+0x4/0xe0
Code: 48 89 c1 48 89 fe 48 c7 c7 88 34 49 95 e8 24 14 a6 ff 0f 0b eb a4 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa <48> 8b 17 48 8b 4f 08 48 85 d2 74 3e 48 85 c9 74 51 48 b8 00 01 00
RSP: 0018:ffffc900079d3b80 EFLAGS: 00010007
RAX: ffffc900011d12d0 RBX: 0000000000000000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 0000000000000292 RDI: 0000000000000000
RBP: ffffc900079d3be0 R08: 0000000000000000 R09: ffffc900079d3a30
R10: 0000000000000003 R11: ffffffff95cca3c8 R12: 0000000000000246
R13: ffffc900011d12c0 R14: 0000000000000000 R15: ffff88811fb88000
FS:  00007f07fffd3740(0000) GS:ffff88846e400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000004622fe001 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? __die+0x23/0x70
 ? page_fault_oops+0x171/0x4e0
 ? __list_add_valid_or_report+0x42/0xa0
 ? __warn+0x9b/0x130
 ? exc_page_fault+0x7f/0x180
 ? asm_exc_page_fault+0x26/0x30
 ? __list_del_entry_valid_or_report+0x4/0xe0
 __up.isra.0+0xe/0x50
 up+0x44/0x60
 _raw_q_schedule+0x64/0xa0 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 nv_kthread_q_flush+0x7b/0x140 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 ? __pfx__q_flush_function+0x10/0x10 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 uvm_suspend+0x9f/0x190 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 uvm_suspend_entry.part.0+0x4e/0xa0 [nvidia_uvm 56e3a52a4ae3c6eebb72d1602e42807b69a9ce07]
 ? kmem_cache_free+0x22/0x3a0
 nv_uvm_suspend+0x2e/0x50 [nvidia 3ce5cfbe99895ad472b4c9f14570b8cea8f3f96a]
 nv_set_system_power_state+0x3bb/0x470 [nvidia 3ce5cfbe99895ad472b4c9f14570b8cea8f3f96a]
 nv_procfs_write_suspend+0xe8/0x160 [nvidia 3ce5cfbe99895ad472b4c9f14570b8cea8f3f96a]
 proc_reg_write+0x5a/0xa0
 vfs_write+0xef/0x420
 ksys_write+0x6f/0xf0
 do_syscall_64+0x5d/0x90
 ? handle_mm_fault+0xa2/0x360
 ? do_user_addr_fault+0x30f/0x660
 ? exc_page_fault+0x7f/0x180
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7f0800151034
Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d 35 c3 0d 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
RSP: 002b:00007ffc4486d4e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f0800151034
RDX: 0000000000000008 RSI: 000055d8112e65b0 RDI: 0000000000000001
RBP: 000055d8112e65b0 R08: 0000000000000410 R09: 0000000000000001
R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000008
R13: 00007f08002265c0 R14: 00007f0800223f20 R15: 0000000000000000
 </TASK>
Modules linked in: hid_logitech_hidpp uhid rfcomm snd_seq_dummy snd_hrtimer snd_seq ccm cmac algif_hash algif_skcipher af_alg bnep snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi snd_seq_device nvidia_drm(POE) nvidia_uvm(POE) nvidia_modeset(POE) btusb btrtl btintel uvcvideo btbcm videobuf2_vmalloc btmtk uvc videobuf2_memops videobuf2_v4l2 bluetooth videodev videobuf2_common ecdh_generic mc crc16 intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_soc_avs snd_hda_codec_hdmi snd_soc_hda_codec iwlmvm snd_hda_ext_core snd_ctl_led kvm snd_hda_codec_realtek snd_soc_core mac80211 snd_hda_codec_generic irqbypass crct10dif_pclmul snd_compress crc32_pclmul ac97_bus joydev polyval_clmulni snd_pcm_dmaengine polyval_generic mousedev dell_rbtn libarc4 gf128mul snd_hda_intel ghash_clmulni_intel snd_intel_dspcfg sha512_ssse3 snd_intel_sdw_acpi aesni_intel dell_laptop snd_hda_codec crypto_simd snd_hda_core cryptd hid_multitouch dell_wmi snd_hwdep iTCO_wdt rapl nls_iso8859_1 iwlwifi
 intel_pmc_bxt dell_smbios snd_pcm ee1004 processor_thermal_device_pci_legacy intel_cstate vfat iTCO_vendor_support dcdbas processor_thermal_device mei_wdt mei_pxp mei_hdcp fat intel_rapl_msr dell_smm_hwmon intel_uncore psmouse dell_wmi_descriptor ledtrig_audio wmi_bmof pcspkr intel_wmi_thunderbolt i2c_i801 processor_thermal_rfim snd_timer cfg80211 i2c_smbus processor_thermal_mbox snd intel_lpss_pci mei_me i2c_hid_acpi processor_thermal_rapl intel_lpss intel_rapl_common int3403_thermal int3400_thermal soundcore rfkill mei i2c_hid intel_hid idma64 intel_soc_dts_iosf intel_pch_thermal acpi_thermal_rel int340x_thermal_zone nvidia(POE) sparse_keymap acpi_pad mac_hid i2c_dev fuse crypto_user loop dm_mod ip_tables x_tables usbhid btrfs i915 blake2b_generic libcrc32c crc32c_generic xor raid6_pq serio_raw i2c_algo_bit atkbd rtsx_pci_sdmmc drm_buddy libps2 mmc_core vivaldi_fmap ttm nvme intel_gtt crc32c_intel mxm_wmi nvme_core drm_display_helper xhci_pci rtsx_pci nvme_common cec xhci_pci_renesas i8042 video serio
 wmi
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
RIP: 0010:__list_del_entry_valid_or_report+0x4/0xe0
Code: 48 89 c1 48 89 fe 48 c7 c7 88 34 49 95 e8 24 14 a6 ff 0f 0b eb a4 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa <48> 8b 17 48 8b 4f 08 48 85 d2 74 3e 48 85 c9 74 51 48 b8 00 01 00
RSP: 0018:ffffc900079d3b80 EFLAGS: 00010007
RAX: ffffc900011d12d0 RBX: 0000000000000000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 0000000000000292 RDI: 0000000000000000
RBP: ffffc900079d3be0 R08: 0000000000000000 R09: ffffc900079d3a30
R10: 0000000000000003 R11: ffffffff95cca3c8 R12: 0000000000000246
R13: ffffc900011d12c0 R14: 0000000000000000 R15: ffff88811fb88000
FS:  00007f07fffd3740(0000) GS:ffff88846e400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000004622fe001 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
note: nvidia-sleep.sh[11528] exited with irqs disabled
note: nvidia-sleep.sh[11528] exited with preempt_count 1
PM: suspend entry (deep)
Filesystems sync: 0.054 seconds

I have option NVreg_DynamicPowerManagement=0x02, but otherwise nothing else is set.

System can be interacted with after this happens but is mostly unusable for many other drivers (networking, etc.) and requires a reboot.

Workaround is to blacklist nvidia-uvm. Driver version is 545.29.02.

nvidia-bug-report.log.gz (714.0 KB)

Please check if setting nvidia module parameter
NVreg_PreserveVideoMemoryAllocations=1
avoids the crash. Although the uvm module rarely survives a suspend/resume cycle.

I tried this and exactly the same trace. Though, I found I needed to get a process listed in nvidia-smi to trigger the crash.

As in, this is a feature, or a bug?

I’m not really getting your config, seems there’s nothing running on the nvidia gpu. Does enabling nvidia-persistenced to start on boot avoid the crash?

I can’t get much to work after triggering the bug, but I do have a terminal. I’m attaching the bug report in a configuration that would seem to trigger the issue if I were to perform a system suspend. After triggering it, the bug reporter hangs with 100% cpu usage doing /proc/driver/nvidia/./version.

nvidia-bug-report.log.gz (669.3 KB)

I have found that I can avoid the bug by manually stopping all processes running on the gpu.

No. Exactly the same.

Ok, so as a summary:

  • the nvidia gpu is not used for graphics, only compute
  • suspend works when no compute job is running
  • crash happens on suspend (not resume)

Yes. After much trial and error, this seems to be the state that triggers it.

I’ve also noted this in the logs after the kernel state dump:

note: nvidia-sleep.sh[64711] exited with irqs disabled
note: nvidia-sleep.sh[64711] exited with preempt_count 1
PM: suspend entry (deep)

So it seems to be connected to the nvidia-sleep.sh process when triggering suspend. After this, the system appears to try and continue suspending but appears to fail.

Maybe your tmpfs is not large enough to store the vmem contents, please try setting nvidia module option NVreg_TemporaryFilePath=/var/tmp so it writes to disk.

Here is /proc/drivers/nvidia/params:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 1
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 2
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 1
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: "/var/tmp"
ExcludedGpus: ""

Ok, so you had already set that without any change.

Yes. There is more than 50GB free on that partition.

Did you try earlier driver versions where suspend worked or failed the same way? Rather send a bug report to linux-bugs[at]nvidia.com

I’ve always seen this since I started needing this driver for the gpu computing. I just didn’t understand the conditions needed to trigger it. The earliest version that I have available to test is 535.54.03. A very similar kernel bug trace comes up.

OK.

Looks like this email address is monitored by /dev/null.

Similar issue has been reported on another thread and is being discussed within team.

BUG: nvidia_uvm needs to be removed and re-inserted in order to work after wakeup from suspend - Graphics / Linux / Linux - NVIDIA Developer Forums