Loading nvidia-drm causes `intel_gpu_top` to fail

This is for a side issue I discovered when troubleshooting my main issue that I cannot suspend my laptop when running in X11 (works fine on Wayland) while the nvidia driver loaded since it causes (from my deduction) various kernel NULL pointer dereferences on seemingly random files in the sysfs and nvidia modeset to fail.

Anyway.
I am running this (clevo-based, G5 GD) laptop on an Arch Linux system on Linux 6.8.2 and nvidia 550.67 with default settings except for nvidia_drm.modeset=1:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.67  Tue Mar 12 23:54:15 UTC 2024
GCC version:  gcc version 13.2.1 20230801 (GCC)

I discovered that just simply loading nvidia-drm breaks some applications like intel_gpu_top.

$ sudo intel_gpu_top
Killed

Going through strace shows that intel_gpu_top tried to read this file.

newfstatat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/link/clkpm", {st_mode=S_IFREG|0644, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/link/clkpm", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
read(3,  <unfinished ...>)              = ?
+++ killed by SIGKILL +++
Killed

Running lspci shows that file is for the NVIDIA GPU that I have on the laptop, based on the PCI ID

01:00.0 VGA compatible controller: NVIDIA Corporation GA107M [GeForce RTX 3050 Mobile] (rev a1)

While we could argue that the intel tool has no business using nvidia gpu’s files in the sysfs but it also breaks any command simply reading to it.

$ cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/link/clkpm
Killed

And each time a process is Killed by reading this file, it outputs a BUG message in the kernel logs:

[ 3353.906820] BUG: kernel NULL pointer dereference, address: 0000000000000035
[ 3353.906832] #PF: supervisor read access in kernel mode
[ 3353.906836] #PF: error_code(0x0000) - not-present page
[ 3353.906841] PGD 0 P4D 0 
[ 3353.906846] Oops: 0000 [#6] PREEMPT SMP NOPTI
[ 3353.906853] CPU: 1 PID: 19190 Comm: cat Tainted: P S   UD    OE      6.8.2-zen1-1-zen #1 f39f6ad161ee51987d86f9a59935c226bcd77920
[ 3353.906861] Hardware name: GIGABYTE G5 GD/G5 GD, BIOS FB10 03/22/2022
[ 3353.906865] RIP: 0010:clkpm_show+0x47/0x70
[ 3353.906875] Code: 43 48 2d c0 00 00 00 48 8b 50 10 48 8b 42 10 48 85 c0 74 16 48 8b 42 38 48 85 c0 74 0d 80 78 6c 00 74 2a 48 8b 80 b0 00 00 00 <0f> b6 50 35 48 c7 c6 9d 6e 20 b2 83 e2 01 e8 e6 56 d2 ff 48 98 c3
[ 3353.906880] RSP: 0018:ffff9beec66bfd68 EFLAGS: 00010202
[ 3353.906886] RAX: 0000000000000000 RBX: ffffffffb2b46a20 RCX: ffffffffb2b46a20
[ 3353.906890] RDX: ffff89d482349000 RSI: ffffffffb2b46a20 RDI: ffff89d48158e000
[ 3353.906894] RBP: ffffffffb1d5d850 R08: ffff89d48235d0c0 R09: ffff89d72845c6c0
[ 3353.906897] R10: ffff9beec66bfda0 R11: ffff89d48158e000 R12: ffff9beec66bfe20
[ 3353.906901] R13: ffff9beec66bfdf8 R14: 0000000000000001 R15: ffff9beec66bfe90
[ 3353.906905] FS:  00007faac0e90740(0000) GS:ffff89d7ef440000(0000) knlGS:0000000000000000
[ 3353.906909] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3353.906913] CR2: 0000000000000035 CR3: 000000010534c002 CR4: 0000000000f70ef0
[ 3353.906917] PKRU: 55555554
[ 3353.906920] Call Trace:
[ 3353.906925]  <TASK>
[ 3353.906930]  ? __die+0x10f/0x120
[ 3353.906937]  ? page_fault_oops+0x171/0x4e0
[ 3353.906944]  ? mas_store_prealloc+0x80/0x130
[ 3353.906954]  ? exc_page_fault+0x7f/0x180
[ 3353.906962]  ? asm_exc_page_fault+0x26/0x30
[ 3353.906972]  ? clkpm_show+0x47/0x70
[ 3353.906978]  dev_attr_show+0x19/0x60
[ 3353.906987]  sysfs_kf_seq_show+0xa8/0x100
[ 3353.906995]  seq_read_iter+0x11f/0x480
[ 3353.907001]  vfs_read+0x272/0x470
[ 3353.907011]  __x64_sys_read+0x74/0xf0
[ 3353.907019]  do_syscall_64+0x86/0x170
[ 3353.907027]  ? do_user_addr_fault+0x5cd/0xb50
[ 3353.907034]  ? exc_page_fault+0x7f/0x180
[ 3353.907042]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3353.907049] RIP: 0033:0x7faac0d1fa81
[ 3353.907114] Code: f7 d8 64 89 02 b8 ff ff ff ff eb b2 67 e8 07 a4 01 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 80 3d 05 e6 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 53 48 83 ec 20 48 89
[ 3353.907119] RSP: 002b:00007ffd330c2448 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 3353.907125] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007faac0d1fa81
[ 3353.907129] RDX: 0000000000020000 RSI: 00007faac0eb2000 RDI: 0000000000000003
[ 3353.907132] RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000990
[ 3353.907136] R10: 0000000000000022 R11: 0000000000000246 R12: 00007faac0eb2000
[ 3353.907139] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[ 3353.907144]  </TASK>
[ 3353.907146] Modules linked in: uinput nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) uhid ccm snd_seq_dummy rfcomm snd_hrtimer snd_seq snd_seq_device cmac algif_hash algif_skcipher af_alg bnep btusb btrtl btintel btbcm btmtk bluetooth ecdh_generic tcp_bbr sch_cake uvcvideo videobuf2_vmalloc uvc videobuf2_memops videobuf2_v4l2 videodev videobuf2_common vboxnetflt(OE) mc vboxnetadp(OE) vboxdrv(OE) pkcs8_key_parser snd_sof_pci_intel_tgl snd_hda_codec_hdmi snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_msr intel_rapl_common snd_sof intel_uncore_frequency intel_uncore_frequency_common snd_sof_utils intel_tcc_cooling snd_soc_hdac_hda x86_pkg_temp_thermal snd_hda_ext_core intel_powerclamp snd_soc_acpi_intel_match coretemp snd_soc_acpi soundwire_generic_allocation kvm_intel iwlmvm snd_hda_codec_realtek soundwire_bus joydev kvm snd_hda_codec_generic mousedev snd_soc_core mac80211 irqbypass crct10dif_pclmul snd_compress crc32_pclmul
[ 3353.907256]  ac97_bus polyval_clmulni snd_pcm_dmaengine polyval_generic gf128mul snd_hda_intel libarc4 snd_intel_dspcfg ptp ghash_clmulni_intel pps_core snd_intel_sdw_acpi sha512_ssse3 sha256_ssse3 snd_hda_codec sha1_ssse3 aesni_intel snd_hda_core crypto_simd hid_multitouch snd_hwdep r8169 cryptd iwlwifi hid_generic iTCO_wdt snd_pcm intel_pmc_bxt realtek rapl spi_nor mei_pxp mei_hdcp ee1004 iTCO_vendor_support snd_timer intel_cstate mdio_devres intel_uncore cfg80211 clevo_wmi(OE) snd psmouse mxm_wmi mtd intel_lpss_pci pcspkr i2c_i801 mei_me libphy intel_lpss i2c_hid_acpi intel_pmc_core soundcore i2c_smbus mei clevo_acpi(OE) i2c_hid rfkill idma64 tuxedo_io(OE) intel_vsec tuxedo_keyboard(OE) tuxedo_compatibility_check(OE) pmt_telemetry led_class_multicolor intel_hid pmt_class pinctrl_tigerlake sparse_keymap acpi_pad mac_hid i2c_dev crypto_user acpi_call(OE) fuse dm_mod loop nfnetlink zram ip_tables x_tables serio_raw sdhci_pci atkbd libps2 cqhci vivaldi_fmap sdhci spi_intel_pci mmc_core xhci_pci spi_intel i8042
[ 3353.907385]  xhci_pci_renesas serio nvme nvme_core nvme_auth vfat fat ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2 i915 intel_gtt xe drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec i2c_algo_bit drm_buddy video wmi ttm drm_display_helper cec
[ 3353.907425] Unloaded tainted modules: nvidia(POE):4 nvidia_modeset(POE):2 nvidia_drm(POE):2 nvidia_uvm(POE):1 tuxedo_nb02_nvidia_power_ctrl(OE):1 [last unloaded: nvidia(POE)]
[ 3353.907445] CR2: 0000000000000035
[ 3353.907450] ---[ end trace 0000000000000000 ]---
[ 3353.907453] RIP: 0010:clkpm_show+0x47/0x70
[ 3353.907460] Code: 43 48 2d c0 00 00 00 48 8b 50 10 48 8b 42 10 48 85 c0 74 16 48 8b 42 38 48 85 c0 74 0d 80 78 6c 00 74 2a 48 8b 80 b0 00 00 00 <0f> b6 50 35 48 c7 c6 9d 6e 20 b2 83 e2 01 e8 e6 56 d2 ff 48 98 c3
[ 3353.907464] RSP: 0018:ffff9beec264bd60 EFLAGS: 00010202
[ 3353.907469] RAX: 0000000000000000 RBX: ffffffffb2b46a20 RCX: ffffffffb2b46a20
[ 3353.907472] RDX: ffff89d482349000 RSI: ffffffffb2b46a20 RDI: ffff89d484078000
[ 3353.907476] RBP: ffffffffb1d5d850 R08: ffff89d48235d0c0 R09: ffff89d52baa2000
[ 3353.907479] R10: ffff9beec264bd98 R11: ffff89d484078000 R12: ffff9beec264be18
[ 3353.907483] R13: ffff9beec264bdf0 R14: 0000000000000001 R15: ffff9beec264be88
[ 3353.907486] FS:  00007faac0e90740(0000) GS:ffff89d7ef440000(0000) knlGS:0000000000000000
[ 3353.907490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3353.907494] CR2: 0000000000000035 CR3: 000000010534c002 CR4: 0000000000f70ef0
[ 3353.907497] PKRU: 55555554

I tested this a couple of times with all nvidia drivers loaded and unloaded, and the results is consistent that while the nvidia drivers (nvidia_drm specifically) are loaded, it breaks, but when unloaded, everything works normally as intended.

I’ll probably try this with a different kernel and with the “open” flavor but I don’t have any hopes it would fix that (and that suspend problem I mentioned at the beginning that made me look in to this).

Any suggestions as to how to resolve this?

EDIT: Attached nvidia-bug-report.log.gz (6.4 MB)

EDIT: Removed the host name and user name.

EDIT2: I tried without the modeset parameter in nvidia_drm such that cat /sys/module/nvidia_drm/parameters/modeset returns N, but the issue still persists.

Hi there @AmbiguousDolphin, welcome back to the NVIDIA developer forums.

I think it might be best if we move this topic over into the dedicated Linux category since this feels beyond a simple driver issue.

If you are ok with that I can move this post.

Be sure to also read this pinned message and follow the instructions.

Thanks!

Thanks for the response. I apologize the inappropriate assignment of category.

I moved it to the Linux Category and even uploaded the output of nvidia-bug-report.sh.

No worries, it was not the wrong category, it is just that I think that for your particular issue there are more knowledgeable people in this area of the forums.

Only other reference to this i found:
https://forum.manjaro.org/t/intel-gpu-top-kernel-null-pointer-dereference/145277
Which was from a manjaro 6.4 kernel, user said he also tried a 5.15 kernel.
I have same notebook as you, just with a 3050ti, running a 5.15 gentoo kernel and don’t have that oops, working fine.
So I’m leaning to this being an arch/manjaro kernel issue.

Thanks for the additional info. I’ll try this using an Ubuntu distro on the similarly recent linux kernel 6.8.

Just a question though, have you tried a 6.8 kernel in your gentoo install?

Now 6.8.4, 550.67:

cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/link/clkpm
1

lspci -t -v -d 10de:*
-[0000:00]---01.0-[01]--+-00.0  NVIDIA Corporation GA107M [GeForce RTX 3050 Ti Mobile]
                        \-00.1  NVIDIA Corporation Device 2291

So, no general issue, independent of kernel version and hardware.

Yep. You are right about that.

I am not entirely sure why but upon trying multiple versions (all Arch based: Cachy (6.8), EndeavorOS (6.4)) all of them worked fine… Except for my current installation of Arch (6.8).

I ended up reinstalling the whole thing and now it works fine.