NVIDIA RTX 3060 "Falls off the Bus" in current linux kernel with any nvidia driver (nouveau/nvidia/open)

I can’t use the bug report script in the ssh session, so here’s the journalctrl instead: http://0x0.st/8ahw.txt
After three different journal ctrls, I reseated the GPU, and this one finally provided a stacktrace from the crashed GPU:

Mar 02 02:06:15 TheSaekoMeinaShrine kernel: ------------[ cut here ]------------
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: WARNING: CPU: 11 PID: 1081 at nvidia/nv.c:4946 nvidia_dev_put+0xa4/0xb0 [nvidia]
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: Modules linked in: ntfs3 snd_seq_dummy snd_hrtimer snd_seq snd_seq_device nct6775 nct6775_core hwmon_vid vfat fat amd_atl intel_rapl_msr intel_rapl_common nvidia_drm(POE) nvidia_uvm(POE) nvidia_modeset(POE) kvm_amd snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi kvm crct10dif_pclmul snd_hda_intel crc32_pclmul snd_intel_dspcfg polyval_clmulni snd_intel_sdw_acpi polyval_generic ghash_clmulni_intel snd_hda_codec sha512_ssse3 jc42 snd_hda_core eeepc_wmi sha256_ssse3 asus_wmi sha1_ssse3 ee1004 snd_hwdep platform_profile aesni_intel snd_pcm i8042 gf128mul r8169 sp5100_tco crypto_simd sparse_keymap snd_timer realtek drm_ttm_helper serio mdio_devres cryptd snd i2c_piix4 ttm rfkill rapl pcspkr wmi_bmof i2c_smbus video libphy ccp soundcore k10temp nvidia(POE) gpio_amdpt mousedev gpio_generic joydev mac_hid crypto_user loop dm_mod nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 hid_generic usbhid nvme nvme_core uas crc32c_intel usb_storage nvme_auth wmi
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: CPU: 11 UID: 1000 PID: 1081 Comm: Discord Tainted: P           OE      6.13.5-arch1-1 #1 a7601aaf9729ecd670c97714fd422c8e98fdc244
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: Hardware name: ASUS System Product Name/TUF GAMING B550-PLUS, BIOS 3607 03/18/2024
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RIP: 0010:nvidia_dev_put+0xa4/0xb0 [nvidia]
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: Code: 89 de 4c 89 e7 e8 8c 93 cb 00 85 c0 75 1c 5b 48 89 ef 5d 41 5c e9 0c 6a 4d df 5b 48 c7 c7 50 e2 6d c1 5d 41 5c e9 fc 69 4d df <0f> 0b eb e0 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RSP: 0018:ffffb15d8680fd40 EFLAGS: 00010202
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RAX: 0000000000000026 RBX: ffff91731f2fe000 RCX: ffffb15d8680fcc0
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffb15d8680fc70
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RBP: ffff91731f2fe6a8 R08: 0000000000000000 R09: ffffb15d8680fce8
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: R10: 00000000802a001e R11: 0000000000000000 R12: ffff91731c358000
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: R13: ffffffffc16de3a0 R14: ffff9173005bfd80 R15: 0000000000000000
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: FS:  0000000000000000(0000) GS:ffff9181ef180000(0000) knlGS:0000000000000000
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: CR2: 000077452e2d8d38 CR3: 000000011e486000 CR4: 0000000000f50ef0
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: PKRU: 55555558
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: Call Trace:
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  <TASK>
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? nvidia_dev_put+0xa4/0xb0 [nvidia d3c25ee3e4d528ddd85c07f2efb39903df4c0eb5]
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? __warn.cold+0x93/0xf6
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? nvidia_dev_put+0xa4/0xb0 [nvidia d3c25ee3e4d528ddd85c07f2efb39903df4c0eb5]
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? report_bug+0xff/0x140
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? handle_bug+0x58/0x90
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? exc_invalid_op+0x17/0x70
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? asm_exc_invalid_op+0x1a/0x20
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? nvidia_dev_put+0xa4/0xb0 [nvidia d3c25ee3e4d528ddd85c07f2efb39903df4c0eb5]
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  nvidia_close+0x182/0x270 [nvidia d3c25ee3e4d528ddd85c07f2efb39903df4c0eb5]
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  __fput+0xe1/0x2a0
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  task_work_run+0x5c/0x90
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  do_exit+0x31b/0xad0
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  do_group_exit+0x30/0x80
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  __x64_sys_exit_group+0x18/0x20
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  x64_sys_call+0xff0/0x1500
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  do_syscall_64+0x82/0x190
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? syscall_exit_to_user_mode+0x37/0x1c0
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? do_syscall_64+0x8e/0x190
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? do_syscall_64+0x8e/0x190
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? do_syscall_64+0x8e/0x190
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RIP: 0033:0x7409a6ad8f1d
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: Code: Unable to access opcode bytes at 0x7409a6ad8ef3.
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RSP: 002b:00007ffe049afb38 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RAX: ffffffffffffffda RBX: 00001d2c00a23880 RCX: 00007409a6ad8f1d
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RDX: 00000000000000e7 RSI: fffffffffffffa10 RDI: 0000000000000001
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: RBP: 00007ffe049afbd0 R08: 0000000000000000 R09: 0000000000000000
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe049afb60
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: R13: 000061bc2acc0818 R14: 00007ffe049afb48 R15: 00000000000003fe
Mar 02 02:06:15 TheSaekoMeinaShrine kernel:  </TASK>
Mar 02 02:06:15 TheSaekoMeinaShrine kernel: ---[ end trace 0000000000000000 ]---

If you want to see the other logs, including xorg, etc. You can find them on the Arch Forums: NVIDIA RTX 3060 and Monitor Turning Off / Kernel & Hardware / Arch Linux Forums

So it looks like the problem was the PSU. From what I can tell, the PSU was starting to not be able to send power to the GPU at certain intervals, this was causing the GPU to become “unseated” despite never leaving the PCIe slot. The reason I think this was the case is, after one of the restarts, I had xorg set the refresh rate to 165. Rebooted, it was set to 165, but the system immediately crashed not a minute later. So, on next boot, I quickly set it back to 59. rebooted again. However, after about 5 minutes, the entire system started to whine, and it was getting increasingly louder, it was actually absolutely terrifying. At its peak, it did a click, and the whole system shut down.

I grabbed my rubber gloves, flipped the PSU switch, since it was still getting a charge as evidence to my phone and mouse still getting charge, and I took apart the system. I smelled around, and there was clearly some faint burning plastic smell somewhere, and since the GPU hadn’t been under load recently, I don’t think it was coming from it, and I even smelled it too, and it didn’t smell. The motherboard in general seemed to have smelled a bit, and the PSU I couldn’t tell.

After taking out the GPU, I attempted to turn on the system safely, but… the motherboard and PSU refused to accept a power on signal. Meaning it’s likely the motherboard and PSU failed together.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.