Getting kernel NULL pointer dereference when unloading (modprobe -r) nvidia_drm

Summary
I always get a kernel NULL pointer dereference when unloading the nvidia drm module.

Description
This is a laptop device. configured in an Optimus arrangement and Prime Render Offload with Intel as the main gpu.

When nvidia-drm is loaded with modesetting, the system will never reach deep enough c-states and will waste from 2-5 watts of power on whatever power mode it is (idle, plugged in, battery, etc…) even if the nvidia driver (is said to be) turned off. Hence, I always load it with no modesetting (i.e., default parameters).

But when I use it as graphics processor, it always shows screen tearing (in fullscreen mode) as if there is no VSYNC and there is no support for the Vulkan extension VK_EXT_external_memory_dma_buf. So, my plan was to let the laptop run with modesetting off most of the time and when it needs to use NVIDIA as a graphics processor, I would unload and reload it with modesetting. When the application is done, unload and reload it with modesetting off.

Additional information
This being CachyOS, they provide a bit more flavors on their kernel builds, one having built the latest release candidate version of the kernel and either being built by clang or gcc.

The nomenclature of their kernel names like 6.12.0-rc6-1-cachyos-rc-gcc means it is based on the 6th rc of the 6.12 Linux Kernel, first package release, built with cachyos patches, release candidate version, built with gcc.

And, I used the prepackaged dkms module using the linux-cachyos-rc-gcc-nvidia-open package.

This bug is reproducible even with the clang built versions.
This bug is NOT reproducible on the proprietary variant of the kernel module.
This bug is NOT reproducible on the latest lts kernel version (as of this writing, 6.6.59).

Reproduction Steps

  1. Use the nvidia open gpu kernel modules. This bug does not occur on the Proprietary version.
  2. Blacklist nvidia-drm to prevent it from being loaded up from boot.
  3. Load nvidia-drm with default parameters (i.e., modeset=0 fbdev=0).
  4. Wait for a minute.
  5. Unload nvidia-drm (e.g., modprobe -r nvidia-drm)
  6. Wait for a minute.
  7. Load nvidia-drm with modesetting on (i.e., modeset=1 fbdev=0).
  8. Wait for a minute.
  9. Unload nvidia-drm (e.g., modprobe -r nvidia-drm)

Nvidia Module Version
565.57.01 (NVIDIA Open GPU Kernel Modules)

Does not occur with the Proprietary variant

Other Information

Key Value
NVIDIA GPU NVIDIA GeForce RTX 3050 Laptop GPU
Linux Distro CachyOS
Linux Kernel version 6.12.0-rc6-1-cachyos-rc-gcc
Architecture x86_64
Hardware GIGABYTE G5 GD (11th Gen Intel)
Desktop Environment KDE Plasma Wayland 6.2.80

Kernel Log:

[  196.807698] nvidia_modeset: module uses symbols nvidia_get_rm_ops from proprietary module nvidia, inheriting taint.
[  196.815581] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  565.57.01  Release Build  (notroot@7db9e82b58f7)  Mon Nov  4 15:48:40 UTC 2024
[  196.828019] nvidia_drm: module uses symbols nvKmsKapiF32ToF16 from proprietary module nvidia_modeset, inheriting taint.
[  196.830152] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  198.737455] [drm] Initialized nvidia-drm 0.0.0 for 0000:01:00.0 on minor 1
[  198.737782] nvidia 0000:01:00.0: [drm] Cannot find any crtc or sizes

[  198.738113] Registered the nv-hotplug-helper DRM client.
[  241.344496] [drm] [nvidia-drm] [GPU ID 0x00000100] Unloading driver
[  241.747380] nvidia-modeset: Unloading
[  241.827699] nvidia_modeset: module uses symbols nvidia_get_rm_ops from proprietary module nvidia, inheriting taint.
[  241.847639] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  565.57.01  Release Build  (notroot@7db9e82b58f7)  Mon Nov  4 15:48:40 UTC 2024
[  241.858849] nvidia_drm: module uses symbols nvKmsKapiF32ToF16 from proprietary module nvidia_modeset, inheriting taint.
[  241.861882] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  241.861886] [drm] Initialized nvidia-drm 0.0.0 for 0000:01:00.0 on minor 1
[  241.861890] Failed to initialize the nv-hotplug-helper DRM client.
[  268.583029] BUG: kernel NULL pointer dereference, address: 00000000000000a8
[  268.583035] #PF: supervisor read access in kernel mode
[  268.583036] #PF: error_code(0x0000) - not-present page
[  268.583037] PGD 0 P4D 0 
[  268.583039] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  268.583041] CPU: 8 UID: 0 PID: 70806 Comm: modprobe Tainted: P S   U     OE      6.12.0-rc6-1-cachyos-rc-gcc #1 4192acde4edb66f2f1f68c607ab66f58138591f5
[  268.583044] Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  268.583045] Hardware name: GIGABYTE G5 GD/G5 GD, BIOS FB10 03/22/2022
[  268.583046] RIP: 0010:drm_client_dev_unregister+0xd/0xf0
[  268.583051] Code: dd 4c 00 49 c7 c4 f4 ff ff ff eb 87 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 47 30 <8b> 80 a8 00 00 00 23 47 68 a8 02 75 05 c3 cc cc cc cc 41 57 41 56
[  268.583052] RSP: 0018:ffffa9b2a9d8bc70 EFLAGS: 00010246
[  268.583054] RAX: 0000000000000000 RBX: ffff998cee754000 RCX: 0000000000000002
[  268.583055] RDX: 0000000000000000 RSI: ffffa9b2a9d8bcd0 RDI: ffff998cee754000
[  268.583056] RBP: ffff998cee754000 R08: 000000000000006d R09: ffffa9b2a9d8bcc8
[  268.583057] R10: fefefefefefefeff R11: 0000000000000037 R12: 0000000000000800
[  268.583057] R13: 00000000000000b0 R14: 0000000000000000 R15: 0000000000000000
[  268.583058] FS:  00007dc0ba996740(0000) GS:ffff998eef600000(0000) knlGS:0000000000000000
[  268.583059] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  268.583060] CR2: 00000000000000a8 CR3: 0000000298c76002 CR4: 0000000000f72ef0
[  268.583062] PKRU: 55555554
[  268.583063] Call Trace:
[  268.583064]  <TASK>
[  268.583067]  ? __die_body.cold+0x8/0x12
[  268.583069]  ? page_fault_oops+0x15a/0x2e0
[  268.583072]  ? exc_page_fault+0x81/0x190
[  268.583075]  ? asm_exc_page_fault+0x26/0x30
[  268.583079]  ? drm_client_dev_unregister+0xd/0xf0
[  268.583081]  drm_dev_unregister+0x21/0x1c0
[  268.583084]  nv_drm_remove_devices+0x2d/0x60 [nvidia_drm 713ad65fe3ef08e6e23794e19a16790721d8c08f]
[  268.583097]  __do_sys_delete_module+0x1d1/0x310
[  268.583100]  do_syscall_64+0x82/0x190
[  268.583103]  ? __x64_sys_openat+0x1f5/0x230
[  268.583105]  ? syscall_exit_to_user_mode+0x10/0x210
[  268.583107]  ? do_syscall_64+0x8e/0x190
[  268.583109]  ? __x64_sys_openat+0x1f5/0x230
[  268.583110]  ? syscall_exit_to_user_mode+0x10/0x210
[  268.583112]  ? do_syscall_64+0x8e/0x190
[  268.583113]  ? syscall_exit_to_user_mode+0x10/0x210
[  268.583115]  ? do_syscall_64+0x8e/0x190
[  268.583117]  ? exc_page_fault+0x81/0x190
[  268.583118]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  268.583120] RIP: 0033:0x7dc0ba2fe26b
[  268.583167] Code: 73 01 c3 48 8b 0d bd 4a 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 4a 0c 00 f7 d8 64 89 01 48
[  268.583169] RSP: 002b:00007ffcdd9e1e98 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
[  268.583170] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007dc0ba2fe26b
[  268.583171] RDX: 000000000000000a RSI: 0000000000000800 RDI: 00005ef675df0f38
[  268.583172] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  268.583173] R10: 00007dc0ba36f900 R11: 0000000000000246 R12: 0000000000000000
[  268.583174] R13: 00007ffcdd9e1ec0 R14: 00005ef675df0ed0 R15: 0000000000000000
[  268.583175]  </TASK>
[  268.583176] Modules linked in: nvidia_drm(POE-) nvidia_modeset(POE) uhid ccm blowfish_generic blowfish_x86_64 blowfish_common des_generic des3_ede_x86_64 libdes cast5_avx_x86_64 cast5_generic cast_common lrw camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic xts snd_seq_dummy snd_hrtimer snd_seq rfcomm snd_seq_device cmac algif_hash algif_skcipher af_alg bnep vfat fat ext4 mbcache jbd2 pkcs8_key_parser nvidia_uvm(POE) snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_acpi_intel_match soundwire_generic_allocation snd_soc_acpi soundwire_bus snd_soc_avs snd_soc_hda_codec snd_hda_ext_core snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine
[  268.583211]  snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common iwlmvm intel_uncore_frequency snd_hda_codec_realtek intel_uncore_frequency_common snd_hda_codec_generic intel_tcc_cooling snd_hda_scodec_component x86_pkg_temp_thermal intel_powerclamp joydev coretemp mac80211 mousedev kvm_intel libarc4 snd_hda_intel ptp pps_core snd_intel_dspcfg uvcvideo snd_intel_sdw_acpi btusb kvm videobuf2_vmalloc btrtl snd_hda_codec uvc btintel videobuf2_memops iwlwifi hid_multitouch videobuf2_v4l2 snd_hda_core btbcm hid_generic videobuf2_common rapl btmtk snd_hwdep mei_pxp mei_hdcp ee1004 snd_pcm videodev r8169 intel_cstate bluetooth snd_timer cfg80211 realtek i2c_i801 mc intel_lpss_pci intel_pmc_core spi_nor mdio_devres mei_me i2c_smbus snd intel_lpss i2c_hid_acpi intel_hid pmt_telemetry crc16 intel_uncore psmouse pcspkr mtd i2c_mux libphy soundcore intel_vsec mei rfkill idma64 i2c_hid sparse_keymap pmt_class pinctrl_tigerlake acpi_pad mac_hid nvidia(POE) i2c_dev crypto_user loop nfnetlink zram 842_decompress 842_compress
[  268.583258]  lz4hc_compress lz4_compress ip_tables x_tables btrfs blake2b_generic xor raid6_pq xe drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec xfs libcrc32c crc32c_generic dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel serio_raw sha512_ssse3 atkbd sha256_ssse3 libps2 sha1_ssse3 sdhci_pci aesni_intel vivaldi_fmap gf128mul nvme cqhci sdhci crypto_simd nvme_core i8042 spi_intel_pci cryptd mmc_core spi_intel nvme_auth serio i915 i2c_algo_bit drm_buddy video mxm_wmi wmi ttm intel_gtt drm_display_helper cec
[  268.583288] Unloaded tainted modules: nvidia_modeset(POE):1 nvidia_drm(POE):1 [last unloaded: nvidia_modeset(POE)]
[  268.583293] CR2: 00000000000000a8
[  268.583294] ---[ end trace 0000000000000000 ]---
[  268.583295] RIP: 0010:drm_client_dev_unregister+0xd/0xf0
[  268.583297] Code: dd 4c 00 49 c7 c4 f4 ff ff ff eb 87 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 47 30 <8b> 80 a8 00 00 00 23 47 68 a8 02 75 05 c3 cc cc cc cc 41 57 41 56
[  268.583298] RSP: 0018:ffffa9b2a9d8bc70 EFLAGS: 00010246
[  268.583300] RAX: 0000000000000000 RBX: ffff998cee754000 RCX: 0000000000000002
[  268.583300] RDX: 0000000000000000 RSI: ffffa9b2a9d8bcd0 RDI: ffff998cee754000
[  268.583301] RBP: ffff998cee754000 R08: 000000000000006d R09: ffffa9b2a9d8bcc8
[  268.583302] R10: fefefefefefefeff R11: 0000000000000037 R12: 0000000000000800
[  268.583303] R13: 00000000000000b0 R14: 0000000000000000 R15: 0000000000000000
[  268.583304] FS:  00007dc0ba996740(0000) GS:ffff998eef600000(0000) knlGS:0000000000000000
[  268.583305] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  268.583306] CR2: 00000000000000a8 CR3: 0000000298c76002 CR4: 0000000000f72ef0
[  268.583307] PKRU: 55555554
[  268.583307] note: modprobe[70806] exited with irqs disabled

Contents of /etc/modprobe.d/nvidia.conf:

options nvidia NVreg_EnableGpuFirmware=1
options nvidia NVreg_EnablePCIeGen3=1
options nvidia NVreg_UsePageAttributeTable=1
options nvidia NVreg_InitializeSystemMemoryAllocations=0
options nvidia NVreg_DynamicPowerManagementVideoMemoryThreshold=2097152
options nvidia NVreg_DynamicPowerManagement=2
options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia NVreg_EnableResizableBar=1
blacklist nvidia_drm
blacklist nvidia_modeset

Hi @AmbiguousDolphin ,

Thank you for reporting the issue. I have filed a bug - NVBug 4948653 to track this internally. We are trying to reproduce this on our systems. I’ll share Engineering feedback when available.

1 Like

I need to rectify this. Doing further testing on the Linux Kernel 6.12-rc6 shows that even the proprietary variant is affected. I don’t know why it did not occur in my first 2 tests on this. It happened on the third try. Tester’s error, I guess?

This is the error produced when replicating this on 6.12-rc6 and the proprietary module.
It appears to be the same error as before, affecting nv_drm_remove_devices.

[34892.677935] Oops: general protection fault, probably for non-canonical address 0xf7894cf720a23438: 0000 [#1] PREEMPT SMP NOPTI
[34892.677942] CPU: 1 UID: 0 PID: 13905 Comm: modprobe Tainted: P S   U     OE      6.12.0-rc6-1-cachyos-rc #1 8d96b88e6f86eab41c7a7e0840888c1adbc3ce89
[34892.677949] Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[34892.677951] Hardware name: GIGABYTE G5 GD/G5 GD, BIOS FB10 03/22/2022
[34892.677952] RIP: 0010:drm_panic_unregister+0x44/0x70
[34892.677958] Code: b7 c8 02 00 00 48 81 c3 c8 02 00 00 eb 0b 0f 1f 84 00 00 00 00 00 4d 8b 36 49 39 de 74 21 49 8b 86 c8 04 00 00 48 85 c0 74 ec <48> 83 78 50 00 74 e5 49 8d be 20 05 00 00 e8 b9 98 60 ff eb d7 5b
[34892.677961] RSP: 0018:ffffb979cdac3cc0 EFLAGS: 00010286
[34892.677964] RAX: f7894cf720a233e8 RBX: ffff9d8d0404ce20 RCX: 0000000000000000
[34892.677966] RDX: ffffb979cdac3d1c RSI: 0000000000000800 RDI: ffff9d8d0404cb58
[34892.677968] RBP: ffffb979cdac3dc0 R08: 000000000000006d R09: fefefefefefefeff
[34892.677970] R10: 65646f6dc0006d72 R11: ffffffffc744e1a0 R12: ffffffffc754c0c0
[34892.677971] R13: 0000000000000000 R14: ffffffffc03fe1f0 R15: 0000000000000000
[34892.677973] FS:  0000749ea129f740(0000) GS:ffff9d8f2f280000(0000) knlGS:0000000000000000
[34892.677976] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34892.677978] CR2: 00007ffe18c9bd68 CR3: 00000001a9f1e002 CR4: 0000000000f72ef0
[34892.677980] PKRU: 55555554
[34892.677981] Call Trace:
[34892.677984]  <TASK>
[34892.677987]  ? __die_body+0x6a/0xb0
[34892.677991]  ? die_addr+0xa4/0xd0
[34892.677994]  ? exc_general_protection+0x165/0x210
[34892.678000]  ? asm_exc_general_protection+0x26/0x30
[34892.678006]  ? __pfx_active_work+0x10/0x10 [i915 6019df75f496f93ae11183b1788a86884b3ec2aa]
[34892.678124]  ? _nv000028kms+0x130/0x130 [nvidia_modeset fab7da08bccc9d7814af6bf62f6711a0dbacee8a]
[34892.678150]  ? drm_panic_unregister+0x44/0x70
[34892.678153]  drm_dev_unregister+0x1a/0x220
[34892.678159]  nv_drm_remove_devices+0x3f/0x70 [nvidia_drm 156869ab895e441e9640aa5f75a1aeb8926f76b7]
[34892.678165]  __se_sys_delete_module+0x257/0x3a0
[34892.678171]  ? syscall_exit_to_user_mode+0x97/0xc0
[34892.678175]  do_syscall_64+0x88/0x170
[34892.678177]  ? kmem_cache_free+0x19b/0x350
[34892.678181]  ? syscall_exit_to_user_mode+0x97/0xc0
[34892.678183]  ? do_syscall_64+0x94/0x170
[34892.678186]  ? __x64_sys_read+0x79/0xe0
[34892.678191]  ? syscall_exit_to_user_mode+0x97/0xc0
[34892.678193]  ? do_syscall_64+0x94/0x170
[34892.678195]  ? exc_page_fault+0x6b/0x110
[34892.678198]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[34892.678203] RIP: 0033:0x749ea0b3126b
[34892.678241] Code: 73 01 c3 48 8b 0d bd 4a 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 4a 0c 00 f7 d8 64 89 01 48
[34892.678244] RSP: 002b:00007ffe18c9dd68 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
[34892.678247] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000749ea0b3126b
[34892.678249] RDX: 000000000000000a RSI: 0000000000000800 RDI: 00005c5bca253f28
[34892.678251] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[34892.678252] R10: 0000749ea0ba2900 R11: 0000000000000246 R12: 0000000000000000
[34892.678254] R13: 00007ffe18c9dd90 R14: 00005c5bca253ec0 R15: 0000000000000000
[34892.678257]  </TASK>
[34892.678258] Modules linked in: nvidia_drm(POE-) nvidia_modeset(POE) ccm blowfish_generic blowfish_x86_64 blowfish_common des_generic des3_ede_x86_64 libdes cast5_avx_x86_64 cast5_generic cast_common lrw camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic xts snd_seq_dummy snd_hrtimer snd_seq rfcomm snd_seq_device uhid cmac algif_hash algif_skcipher af_alg bnep vfat fat ext4 vboxnetflt(OE) vboxnetadp(OE) mbcache jbd2 vboxdrv(OE) pkcs8_key_parser nvidia_uvm(POE) snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda snd_sof_intel_hda_mlink snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_acpi_intel_match soundwire_generic_allocation snd_soc_acpi soundwire_bus snd_soc_avs snd_soc_hda_codec snd_hda_ext_core snd_soc_core
[34892.678308]  ac97_bus snd_pcm_dmaengine snd_compress snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common iwlmvm intel_uncore_frequency snd_hda_codec_realtek intel_uncore_frequency_common intel_tcc_cooling snd_hda_scodec_component x86_pkg_temp_thermal snd_hda_codec_generic intel_powerclamp coretemp mac80211 joydev mousedev uvcvideo kvm_intel snd_hda_intel uvc libarc4 snd_intel_dspcfg videobuf2_vmalloc ptp snd_intel_sdw_acpi btusb videobuf2_memops pps_core snd_hda_codec kvm videobuf2_v4l2 btbcm hid_multitouch snd_hda_core btintel videobuf2_common hid_generic snd_hwdep r8169 mei_pxp btrtl ee1004 mei_hdcp iwlwifi videodev snd_pcm btmtk realtek rapl clevo_acpi(OE) tuxedo_io(OE) clevo_wmi(OE) intel_cstate intel_uncore bluetooth cfg80211 mc intel_lpss_pci mei_me snd_timer mdio_devres intel_pmc_core i2c_i801 tuxedo_keyboard(OE) spi_nor intel_lpss pmt_telemetry snd intel_hid i2c_smbus i2c_hid_acpi tuxedo_compatibility_check(OE) crc16 pcspkr psmouse mtd i2c_mux libphy soundcore mei idma64 rfkill intel_vsec i2c_hid
[34892.678377]  pinctrl_tigerlake pmt_class sparse_keymap led_class_multicolor acpi_pad mac_hid nvidia(POE) i2c_dev crypto_user loop nfnetlink zram 842_decompress 842_compress lz4hc_compress lz4_compress ip_tables x_tables btrfs raid6_pq xor xe gpu_sched drm_ttm_helper drm_gpuvm drm_exec drm_suballoc_helper xfs libcrc32c crc32c_generic dm_crypt cbc encrypted_keys trusted tee asn1_encoder dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 serio_raw atkbd sha256_ssse3 libps2 sdhci_pci sha1_ssse3 vivaldi_fmap aesni_intel cqhci nvme gf128mul sdhci crypto_simd nvme_core i8042 spi_intel_pci cryptd mmc_core spi_intel nvme_auth serio i915 drm_buddy intel_gtt ttm mxm_wmi i2c_algo_bit video wmi drm_display_helper cec
[34892.678429] Unloaded tainted modules: nvidia_modeset(POE):1 nvidia_drm(POE):1 tuxedo_nb02_nvidia_power_ctrl(OE):1 [last unloaded: nvidia_modeset(POE)]
[34892.678439] ---[ end trace 0000000000000000 ]---
[34892.678441] RIP: 0010:drm_panic_unregister+0x44/0x70
[34892.678444] Code: b7 c8 02 00 00 48 81 c3 c8 02 00 00 eb 0b 0f 1f 84 00 00 00 00 00 4d 8b 36 49 39 de 74 21 49 8b 86 c8 04 00 00 48 85 c0 74 ec <48> 83 78 50 00 74 e5 49 8d be 20 05 00 00 e8 b9 98 60 ff eb d7 5b
[34892.678446] RSP: 0018:ffffb979cdac3cc0 EFLAGS: 00010286
[34892.678449] RAX: f7894cf720a233e8 RBX: ffff9d8d0404ce20 RCX: 0000000000000000
[34892.678450] RDX: ffffb979cdac3d1c RSI: 0000000000000800 RDI: ffff9d8d0404cb58
[34892.678452] RBP: ffffb979cdac3dc0 R08: 000000000000006d R09: fefefefefefefeff
[34892.678453] R10: 65646f6dc0006d72 R11: ffffffffc744e1a0 R12: ffffffffc754c0c0
[34892.678455] R13: 0000000000000000 R14: ffffffffc03fe1f0 R15: 0000000000000000
[34892.678456] FS:  0000749ea129f740(0000) GS:ffff9d8f2f280000(0000) knlGS:0000000000000000
[34892.678458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34892.678460] CR2: 00007ffe18c9bd68 CR3: 00000001a9f1e002 CR4: 0000000000f72ef0
[34892.678462] PKRU: 55555554

I’ll hazard a guess that there is a use-after-free happening, or possibly a double free. I think so because both variants fault during the releasing of memory which should always just work (unless it has been already freed before)

I also find the existence of drm_panic_unregister in the call stack suspicious, because it does not occur in the open source variant. The open source variant shows drm_dev_unregister instead, which is executed after drm_panic_unregister.

EDIT: Nov 10, 2024
Further testing seem to reveal the following:

  1. This error occurs almost always when unloading nvidia_drm when it’s currently loaded without modesetting (i.e., modeset=0).
  2. I also have initially thought it may be because the GPU is on lowest power state (hence the wait for n minutes step), but it does not seem to be the case. By waking up the GPU (like from calling nvidia-smi) then immediately unloading the module, the error still occured. It does not appear that power states affect the error.
  3. In hindsight, I will never need to unload the module (thus avoiding this whole error), if the GPU just turns off and not use additional power when not being used. My guess is that something is up with modeset=1 that increments the usage counter in the module unnecessarily in PRIME setups similar to what nvidia-persistenced does and prevents the device from truly being put to sleep.
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.