Hello,
I had a setting that works fine:
Fedora 40 Kernel 6.8.2, Nvidia 535 (latest version of 535)
When I upgrade Nvidia driver to 550 (latest version). I got problem with Kernel 6.8.2/6.10/6.11 (latest kernel).
Below is the trace:
What I am doing is boot up the destop without login X.
I had used nvidia-smi to poll the stats (get the temperature) periodically. The desktop will just crash within a few minutes.
[ 1330.403801] list_add corruption. next->prev should be prev (ffff9ea911a6d788), but was 0000000000000000. (next=ffff9ea956d91a58).
[ 1330.405296] ------------[ cut here ]------------
[ 1330.406269] kernel BUG at lib/list_debug.c:29!
[ 1330.407206] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 1330.408081] CPU: 6 UID: 1000 PID: 130616 Comm: nvidia-smi Kdump: loaded Tainted: P OE 6.11.11-300.fc41.x86_64 #1
[ 1330.409006] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 1330.409947] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Pro4, BIOS P1.40 08/12/2019
[ 1330.410796] RIP: 0010:__list_add_valid_or_report.cold+0x4b/0x5b
[ 1330.411660] Code: fb ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 68 47 c3 91 e8 32 a6 fb ff 0f 0b 48 89 c1 48 c7 c7 10 47 c3 91 e8 21 a6 fb ff <0f> 0b 48 c7 c7 e8 46 c3 91 e8 13 a6 fb ff 0f 0b 48 89 fe 48 c7 c7
[ 1330.412616] RSP: 0018:ffffae59d0e0fab0 EFLAGS: 00010246
[ 1330.413480] RAX: 0000000000000075 RBX: ffff9ea956d91800 RCX: 0000000000000000
[ 1330.414329] RDX: 0000000000000000 RSI: ffff9ebf9ef21900 RDI: ffff9ebf9ef21900
[ 1330.415183] RBP: ffff9ea956d91a58 R08: 0000000000000000 R09: 0000000000000008
[ 1330.416017] R10: ffff9ea10c9fa000 R11: ffffae59c2049220 R12: ffff9ea956d91a58
[ 1330.416886] R13: ffff9ea911a6d788 R14: ffff9ea911a6d000 R15: ffff9ea911a6d770
[ 1330.417681] FS: 00007fc371d4e740(0000) GS:ffff9ebf9ef00000(0000) knlGS:0000000000000000
[ 1330.418468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1330.419293] CR2: 00007fff1822ffe0 CR3: 0000000f42e1e000 CR4: 0000000000350ef0
[ 1330.420170] Call Trace:
[ 1330.420982] <TASK>
[ 1330.421756] ? __die_body.cold+0x19/0x27
[ 1330.422516] ? die+0x2e/0x50
[ 1330.423282] ? do_trap+0xca/0x110
[ 1330.424064] ? do_error_trap+0x6a/0x90
[ 1330.424804] ? __list_add_valid_or_report.cold+0x4b/0x5b
[ 1330.425551] ? exc_invalid_op+0x50/0x70
[ 1330.426291] ? __list_add_valid_or_report.cold+0x4b/0x5b
[ 1330.427038] ? asm_exc_invalid_op+0x1a/0x20
[ 1330.427769] ? __list_add_valid_or_report.cold+0x4b/0x5b
[ 1330.428511] nvidia_open+0x2b4/0x500 [nvidia]
[ 1330.429494] chrdev_open+0xcb/0x240
[ 1330.430190] ? __pfx_chrdev_open+0x10/0x10
[ 1330.430927] do_dentry_open+0x25a/0x4f0
[ 1330.431623] vfs_open+0x34/0xf0
[ 1330.432312] path_openat+0xb54/0x11f0
[ 1330.433014] ? srso_return_thunk+0x5/0x5f
[ 1330.433683] ? __audit_filter_op+0xd8/0x160
[ 1330.434387] do_filp_open+0xc4/0x170
[ 1330.435048] do_sys_openat2+0xae/0xe0
[ 1330.435691] ? __audit_syscall_entry+0xee/0x140
[ 1330.436321] __x64_sys_openat+0x55/0xa0
[ 1330.437031] do_syscall_64+0x82/0x160
[ 1330.437691] ? srso_return_thunk+0x5/0x5f
[ 1330.438348] ? do_syscall_64+0x8e/0x160
[ 1330.438966] ? srso_return_thunk+0x5/0x5f
[ 1330.439633] ? __count_memcg_events+0x75/0x130
[ 1330.440288] ? srso_return_thunk+0x5/0x5f
[ 1330.440930] ? count_memcg_events.constprop.0+0x1a/0x30
[ 1330.441509] ? srso_return_thunk+0x5/0x5f
[ 1330.442153] ? handle_mm_fault+0x21b/0x330
[ 1330.442935] ? srso_return_thunk+0x5/0x5f
[ 1330.443698] ? do_user_addr_fault+0x55a/0x7b0
[ 1330.444403] ? srso_return_thunk+0x5/0x5f
[ 1330.445092] ? exc_page_fault+0x7e/0x180
[ 1330.445798] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1330.446490] RIP: 0033:0x7fc371e3e6e3
[ 1330.447157] Code: 83 e2 40 75 52 89 f0 f7 d0 a9 00 00 41 00 74 47 80 3d 90 99 10 00 00 74 62 89 da 4c 89 e6 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 81 00 00 00 48 8b 55 b8 64 48 2b 14 25 28
[ 1330.447687] RSP: 002b:00007fff18230640 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[ 1330.448201] RAX: ffffffffffffffda RBX: 0000000000080802 RCX: 00007fc371e3e6e3
[ 1330.448693] RDX: 0000000000080802 RSI: 00007fff182306e0 RDI: 00000000ffffff9c
[ 1330.449184] RBP: 00007fff182306b0 R08: 0000000000000064 R09: 00000000ffffffff
[ 1330.449674] R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff182306e0
[ 1330.450155] R13: 0000000000000000 R14: 0000000000000802 R15: 00007fc371ba8ea0
[ 1330.450633] </TASK>
[ 1330.451096] Modules linked in: nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_ttm_helper ttm video iommufd rfcomm vmnet(OE) parport_pc vmmon(OE) parport nft_nat nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 vxlan ip6_udp_tunnel udp_tunnel veth ip_set xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_filter ip6table_raw ip6table_mangle ip6_tables iptable_raw iptable_mangle xt_CT xt_mark rpcsec_gss_krb5 nfsv4 dns_resolver nfs netfs xt_conntrack xt_comment nft_compat iptable_filter iptable_nat ip_tables br_netfilter overlay rpcrdma rdma_cm iw_cm ib_cm ib_core nft_reject_ipv4 snd_seq_dummy snd_hrtimer mpt3sas raid_class scsi_transport_sas mptctl mptbase openvswitch nsh nf_conncount psample 8021q garp mrp bridge stp llc nft_masq nft_chain_nat nf_nat nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct vmw_vsock_vmci_transport nf_log_syslog vmw_vmci nft_log bnep nf_tables nct6775 nct6775_core hwmon_vid binfmt_misc btusb btrtl btintel btbcm btmtk vfat
[ 1330.451191] bluetooth bcache snd_hda_codec_realtek fat snd_usb_audio rfkill snd_hda_scodec_component snd_usbmidi_lib snd_hda_codec_hdmi intel_rapl_msr ee1004 snd_hda_codec_generic amd_atl intel_rapl_common snd_ump xfs edac_mce_amd snd_hda_intel snd_rawmidi mc snd_intel_dspcfg kvm_amd snd_intel_sdw_acpi snd_hda_codec snd_hda_core kvm snd_hwdep snd_seq snd_seq_device rapl snd_pcm snd_timer wmi_bmof snd acpi_cpufreq k10temp igb i2c_piix4 i2c_smbus i2c_algo_bit soundcore dca tcp_htcp nfsd loop nfs_acl lockd auth_rpcgss grace dm_multipath sunrpc nfnetlink raid0 crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic raid1 ghash_clmulni_intel nvme sha512_ssse3 sha256_ssse3 sha1_ssse3 sp5100_tco nvme_core megaraid_sas nvme_auth wmi target_core_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua nf_conntrack_pptp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 vhost_net tun tap fuse ecryptfs vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb
[ 1330.453686] Unloaded tainted modules: nvidia(POE):1 nvidia_uvm(POE):1 nvidia_modeset(POE):1 nvidia_drm(POE):1 vmnet(OE):3 vmmon(OE):3 [last unloaded: vfio]
Is it related to iso_exit() on module unload?
https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.6.58
Bluetooth: Call iso_exit() on module unload
commit d458cd1221e9e56da3b2cc5518ad3225caa91f20 upstream.
If iso_init() has been called, iso_exit() must be called on module
unload. Without that, the struct proto that iso_init() registered with
proto_register() becomes invalid, which could cause unpredictable
problems later. In my case, with CONFIG_LIST_HARDENED and
CONFIG_BUG_ON_DATA_CORRUPTION enabled, loading the module again usually
triggers this BUG():
list_add corruption. next->prev should be prev (ffffffffb5355fd0),
but was 0000000000000068. (next=ffffffffc0a010d0).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:29!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 4159 Comm: modprobe Not tainted 6.10.11-4+bt2-ao-desktop #1
RIP: 0010:__list_add_valid_or_report+0x61/0xa0
...
__list_add_valid_or_report+0x61/0xa0
proto_register+0x299/0x320
hci_sock_init+0x16/0xc0 [bluetooth]
bt_init+0x68/0xd0 [bluetooth]
__pfx_bt_init+0x10/0x10 [bluetooth]
do_one_initcall+0x80/0x2f0
do_init_module+0x8b/0x230
__do_sys_init_module+0x15f/0x190
do_syscall_64+0x68/0x110
...