Hello NVidia forums!
I’ve been having trouble for a couple of months with my nvidia driver crashing at seemingly random when used.
dmesg_output.txt (104.0 KB)
The interesting bit:
[34913.105993] general protection fault, probably for non-canonical address 0x957e4d36efcde60d: 0000 [#1] SMP NOPTI
[34913.106006] CPU: 21 PID: 969637 Comm: code Tainted: P OE 5.13.0-25-generic #26-Ubuntu
[34913.106013] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99E-ITX/ac, BIOS P3.80 04/06/2018
[34913.106016] RIP: 0010:_nv028370rm+0x35/0x90 [nvidia]
[34913.106859] Code: 57 10 31 c0 48 85 d2 74 2e 48 8b 4f 08 31 c0 48 85 c9 74 0d 48 63 41 14 48 89 d6 48 29 c6 48 89 f0 48 3b 57 18 48 89 07 74 1b <48> 8b 42 08 48 89 47 10 b8 01 00 00 00 48 83 c4 08 c3 66 0f 1f 84
[34913.106865] RSP: 0018:ffffa66dce32bb50 EFLAGS: 00010a16
[34913.106871] RAX: 957e4d370332cc4b RBX: ffff95cf91bf9c30 RCX: ffff95cf45553980
[34913.106875] RDX: 957e4d36efcde605 RSI: 957e4d370332cc4b RDI: ffff95caced45d00
[34913.106878] RBP: ffff95caced45d00 R08: 0000000000000020 R09: ffff95caced45d08
[34913.106882] R10: ffff95d016214008 R11: ffff95d0825aac00 R12: ffff95caddf11d38
[34913.106885] R13: 520d791c3191216d R14: ffff95caddf11d38 R15: ffff95cf45554c10
[34913.106889] FS: 0000000000000000(0000) GS:ffff95d17f740000(0000) knlGS:0000000000000000
[34913.106894] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34913.106897] CR2: 0000369400744000 CR3: 00000005d7010001 CR4: 00000000003706e0
[34913.106902] Call Trace:
[34913.106908] ? _nv034904rm+0xa8/0xe0 [nvidia]
[34913.107703] ? _nv014538rm+0x31b/0x7f0 [nvidia]
[34913.108500] ? _nv035206rm+0xac/0xe0 [nvidia]
[34913.109094] ? _nv036729rm+0xb0/0x140 [nvidia]
[34913.109899] ? _nv036728rm+0x30f/0x4f0 [nvidia]
[34913.110689] ? _nv036723rm+0x60/0x70 [nvidia]
[34913.111434] ? _nv036724rm+0x7b/0xb0 [nvidia]
[34913.112176] ? _nv035114rm+0x40/0xe0 [nvidia]
[34913.112711] ? _nv000693rm+0x68/0x80 [nvidia]
[34913.113366] ? rm_cleanup_file_private+0xea/0x170 [nvidia]
[34913.114000] ? fsnotify+0x2bd/0x370
[34913.114011] ? nvidia_close+0x156/0x320 [nvidia]
[34913.114399] ? __call_rcu+0xa4/0x260
[34913.114408] ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[34913.114794] ? __fput+0x9c/0x250
[34913.114800] ? ____fput+0xe/0x10
[34913.114804] ? task_work_run+0x6d/0xa0
[34913.114810] ? do_exit+0x224/0x3d0
[34913.114817] ? do_group_exit+0x3b/0xb0
[34913.114823] ? __x64_sys_exit_group+0x18/0x20
[34913.114828] ? do_syscall_64+0x61/0xb0
[34913.114836] ? syscall_exit_to_user_mode+0x27/0x50
[34913.114844] ? __x64_sys_close+0x11/0x40
[34913.114851] ? do_syscall_64+0x6e/0xb0
[34913.114855] ? do_syscall_64+0x6e/0xb0
[34913.114861] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[34913.114868] Modules linked in: rfcomm xt_multiport xt_nat xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc veth nft_counter xt_tcpudp nft_compat nf_tables nfnetlink cmac algif_hash algif_skcipher overlay af_alg bnep binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common sb_edac zfs(PO) x86_pkg_temp_thermal intel_powerclamp zunicode(PO) zzstd(O) coretemp zlua(O) zavl(PO) icp(PO) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi kvm_intel zcommon(PO) znvpair(PO) btusb spl(O) snd_hda_intel btrtl btbcm snd_intel_dspcfg snd_intel_sdw_acpi btintel kvm bluetooth snd_hda_codec snd_hda_core ecdh_generic wl(POE) ecc rapl snd_hwdep snd_pcm ucsi_ccg mxm_wmi typec_ucsi snd_timer mei_me intel_cstate efi_pstore cfg80211 typec snd mei soundcore mac_hid nvidia_uvm(POE) sch_fq_codel msr parport_pc ppdev
[34913.114978] lp parport sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper crct10dif_pclmul syscopyarea sysfillrect crc32_pclmul sysimgblt fb_sys_fops ghash_clmulni_intel cec aesni_intel rc_core crypto_simd cryptd igb i2c_i801 drm i2c_smbus e1000e dca ahci lpc_ich xhci_pci i2c_nvidia_gpu i2c_algo_bit libahci xhci_pci_renesas wmi
[34913.115071] —[ end trace 59d1b267b655e4e3 ]—
[34913.213164] RIP: 0010:_nv028370rm+0x35/0x90 [nvidia]
[34913.213651] Code: 57 10 31 c0 48 85 d2 74 2e 48 8b 4f 08 31 c0 48 85 c9 74 0d 48 63 41 14 48 89 d6 48 29 c6 48 89 f0 48 3b 57 18 48 89 07 74 1b <48> 8b 42 08 48 89 47 10 b8 01 00 00 00 48 83 c4 08 c3 66 0f 1f 84
[34913.213654] RSP: 0018:ffffa66dce32bb50 EFLAGS: 00010a16
[34913.213658] RAX: 957e4d370332cc4b RBX: ffff95cf91bf9c30 RCX: ffff95cf45553980
[34913.213660] RDX: 957e4d36efcde605 RSI: 957e4d370332cc4b RDI: ffff95caced45d00
[34913.213662] RBP: ffff95caced45d00 R08: 0000000000000020 R09: ffff95caced45d08
[34913.213664] R10: ffff95d016214008 R11: ffff95d0825aac00 R12: ffff95caddf11d38
[34913.213665] R13: 520d791c3191216d R14: ffff95caddf11d38 R15: ffff95cf45554c10
[34913.213668] FS: 0000000000000000(0000) GS:ffff95d17f740000(0000) knlGS:0000000000000000
[34913.213670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34913.213672] CR2: 0000369400744000 CR3: 0000000243626003 CR4: 00000000003706e0
[34913.213675] Fixing recursive fault but reboot is needed!
Im running a GTX 1660 Super.
03:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] (rev a1)
03:00.1 Audio device: NVIDIA Corporation TU116 High Definition Audio Controller (rev a1)
03:00.2 USB controller: NVIDIA Corporation TU116 USB 3.1 Host Controller (rev a1)
03:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU116 USB Type-C UCSI Controller (rev a1)
What happens is that this “recursive fault” sometimes happens after its been left running for a while, sometimes days, sometimes minutes.
This causes the nvidia driver to become unresponsive and eventually locks up cpus and makes the entire system unresponsive.
Its responsive a bit after the fault occurs, but utilities like nvidia-smi stops working, and i’ve had various desktops crash and auto-restart themselves.
What i’ve been able to confirm is that it exists on the both the 470 and 495 driver, and across multiple linux kernels (unsure of the exact scope).
I’ve had this happen when running chromium, msedge and now vscode.
If anyone had an idea on whats wrong i would greatly appreciate it.
Thanks.