GPU has fallen off the bus issues on daily basis (RTX 4090)

Since I’ve bought new PC this year, my NVIDIA GPUs (or one of them) aren’t stable and they keep crashing at random usage (mainly due to ‘fallen of the bus’ error). Especially after running any CUDA software.
I’ve upgraded BIOS to the latest and this still happens.
GPUs are not overheating, temperature is low due to watercooling system.

Last time when I tried to repair at the Scan shop in the UK, however they couldn’t reproduce the same issues using standard or stress benchmarking tools.

When this happen, any changing of terminals or attempt to reboot the system ends up with frozen system, although SysRq combinations still works (which I can restart system with REISUB Magic SysRq key combination).

Specs:

  • GPUs: NVIDIA GeForce RTX 4090 x5, watercooled
  • Motherboard: Pro WS WRX90E-SAGE SE
  • BIOS: Upgraded to latest firmware 0803 (30/08/2024)
  • OS: Ubuntu 22.04 with 550.127 NVIDIA firmware and drivers

More details below.

Error from today literally after 10 minutes of uptime:

pcieport 0000:20:01.1: DPC: containment event, status:0x1f01 source:0x0000
NVRM: GPU at PCI:0000:21:00: GPU-0535a00b-ecd6-8908-7591-0b6e0d4df252
pcieport 0000:20:01.1: DPC: unmasked uncorrectable error detected
NVRM: Xid (PCI:0000:21:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
pcieport 0000:20:01.1: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:20:01.1:   device [1022:14ab] error status/mask=00000020/00000000
NVRM: GPU 0000:21:00.0: GPU has fallen off the bus.
pcieport 0000:20:01.1:    [ 5] SDES                   (First)
nvidia 0000:21:00.0: AER: can't recover (no error_detected callback)
snd_hda_intel 0000:21:00.1: AER: can't recover (no error_detected callback)
pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
pcieport 0000:20:01.1: retraining failed
pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
pcieport 0000:20:01.1: retraining failed
nvidia 0000:21:00.0: not ready 1023ms after DPC; waiting
nvidia 0000:21:00.0: not ready 2047ms after DPC; waiting
nvidia 0000:21:00.0: not ready 4095ms after DPC; waiting
nvidia 0000:21:00.0: not ready 8191ms after DPC; waiting
nvidia 0000:21:00.0: not ready 16383ms after DPC; waiting
nvidia 0000:21:00.0: not ready 32767ms after DPC; waiting
nvidia 0000:21:00.0: not ready 65535ms after DPC; giving up
pcieport 0000:20:01.1: AER: subordinate device reset failed
pcieport 0000:20:01.1: AER: device recovery failed
nvidia-modeset: ERROR: GPU:4: Failed detecting connected display devices
...
pcieport 0000:00:05.1: AER: Correctable error message received from 0000:02:00.0
i40e 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
i40e 0000:02:00.0:   device [8086:15ff] error status/mask=00001000/00000000
i40e 0000:02:00.0:    [12] Timeout
$ nvidia-smi
Unable to determine the device handle for GPU0000:21:00.0: Unknown Error
$ lspci -vvv
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 25

00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 56
        IOMMU group: 26
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00001000-00001fff [size=4K]
        Memory behind bridge: f3000000-f40fffff [size=17M]
        Prefetchable memory behind bridge: 00000100f0000000-0000010101ffffff [size=288M]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: <access denied>
        Kernel driver in use: pcieport
...
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ZOTAC International (MCO) Ltd. Device 1675
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 318
        IOMMU group: 35
        Region 0: Memory at f3000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 100f0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 10100000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 1000 [size=128]
        Expansion ROM at f4000000 [virtual] [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

01:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 1675
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 54
        IOMMU group: 35
        Region 0: Memory at f4080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
...
20:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        IOMMU group: 39

20:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01) (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 61
        IOMMU group: 40
        Bus: primary=20, secondary=21, subordinate=21, sec-latency=0
        I/O behind bridge: 00004000-00004fff [size=4K]
        Memory behind bridge: b1000000-b20fffff [size=17M]
        Prefetchable memory behind bridge: 0000014800000000-0000014811ffffff [size=288M]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: <access denied>
        Kernel driver in use: pcieport
...
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

21:00.1 Audio device: NVIDIA Corporation Device 22ba (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

Other related errors when system is frozen:

nvidia-modeset: ERROR: GPU:4: Error while waiting for GPU progress: 0x0000c77d:0 2:0:4048:4040
INFO: task nvidia-modeset/:2308 blocked for more than 122 seconds.
      Tainted: P           OE      6.8.0-49-generic #49~22.04.1-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvidia-modeset/ state:D stack:0     pid:2308  tgid:2308  ppid:2      flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x27c/0x6a0
 schedule+0x33/0x110
 schedule_timeout+0x157/0x170
 ___down_common+0xfd/0x160
 ? srso_alias_return_thunk+0x5/0xfbef5
 __down_common+0x22/0xd0
 __down+0x1d/0x30
 down+0x54/0x80
 nvkms_kthread_q_callback+0x9a/0x160 [nvidia_modeset]
 _main_loop+0x7f/0x140 [nvidia_modeset]
 ? __pfx__main_loop+0x10/0x10 [nvidia_modeset]
 kthread+0xef/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x44/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
...
runnable tasks:
 S            task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
 S irq/315-nvidia  2316      4563.283918 E      4566.281895         3.000000     50053.035885   1432738    49         0.000000     50053.035885         0.000000         0.000000 0 0 /
 S         nvidia  2317      9256.322720 E      9259.321759         3.000000         0.005277         2   120         0.000000         0.005277         0.000000         0.000000 0 0 /

Logs:

Some bug-report files are stuck for few days at processing/uploading, so I can’t upload them for some reason.

One of the most common causes of “GPU has fallen off the bus”, errors is unstable/underprovisioned power supply.

For some reason, only the lspci attachment is accessable. Is the same card failing or any of the the 5?

If the latter, have you tried testing using progressively fewer cards?

Did the Scan shop test using the same applications you’re having trouble with?

If only one card is failing, try swapping slots with one of the others and seeing if the fault moves to another PCIe ID.

  • There are 2x EVGA SuperNOVA 2000 PSUs which should provide enough power supply.
  • Each GPU report less than 30W out of 450W (via nvidia-smi), temperature is around 32-36C for each which is very low.
  • Initially at the shop they’ve suggested and used Furmark GUI program for benchmark, but this was using only 1 GPU. So instead, I’ve tested using wilicc/gpu-burn project from GitHub which run stress test for all GPUs at once and I could reproduce the issue almost each time after few minutes run. I’ve suggested them to use it as well, however they couldn’t reproduce it for some reason, no issues found. However, after few days of testing (including 3h burn tests, testing each individually), they only manage to reproduce it once when they were swapping GPUs, and they claimed it was due to one card was not fully seated. After another 3h burn, no issues were found. After this, I still got these fallen-of-the-bus errors very often.
  • I don’t want to play with too much swapping slots myself, I’m not so proficient with watercooling systems. Eventually I could disable them via BIOS or motherboard if possible, however I don’t have clear repeatable steps to reproduce it, so I won’t know if that would work, as it may crash at random.
  • I’m going to play with some driver settings and see if that helps.

I did the following changes to test the configuration suggested by variety of sources:

  1. Changed in /etc/default/grub, the following line:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash initcall_debug loglevel=7 pcie_aspm=off"
  1. I’ve put the following options to nvidia driver at /etc/modprobe.d/nvidia.conf (suggested by GPT):
options nvidia NVreg_RegisterPCIDriverOnEarlyBoot=1
options nvidia NVreg_EnablePCIeGen3=1
options nouveau modeset=0
blacklist nouveau
  1. I’ve tried to change this, but didn’t work, probably because pcie_aspm is already disabled.
$ cat /sys/module/pcie_aspm/parameters/policy
[default] performance powersave powersupersave 
$ echo “performance” | sudo tee /sys/module/pcie_aspm/parameters/policy
“performance”
tee: /sys/module/pcie_aspm/parameters/policy: Operation not permitted
  1. The file at /etc/modprobe.d/nvidia-graphics-drivers-kms.conf is unchanged:
$ cat /etc/modprobe.d/nvidia-graphics-drivers-kms.conf
# This file was generated by nvidia-driver-550
# Set value to 0 to disable modesetting
options nvidia-drm modeset=1
  1. I’ve enabled all SysRq values, so I can use some of the combinations during the next crash:
$ echo 1 | sudo tee /proc/sys/kernel/sysrq
1
$ cat /proc/sys/kernel/sysrq
1

After changing above (especially 1&2), I’ve noticed it’s a bit more stable (didn’t crash within 1 day, yet), but I may be wrong, I need more testing, as I’ve mentioned before, it’s crashing by random (sometimes more often like within 15 minutes after reboot, or few hours after, than the other times, like few days).


Btw. Before the changes, for the past few days, I had this timeout error which was bombarding kern.log like hundreds of times a minute, not anymore.

pcieport 0000:00:05.1: AER: Correctable error message received from 0000:02:00.0
i40e 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
i40e 0000:02:00.0:   device [8086:15ff] error status/mask=00001000/00000000
i40e 0000:02:00.0:    [12] Timeout               

Based on another thread, it was suggested to disable pcie_aspm=off, despite cards were reporting as already disabled. So I’m not sure if this potentially could help. I’ll let you know. I’m monitoring kern.log constantly as I use PC (sudo tail -f /var/log/kern.log).

It worked few days without crash. It crashed today while on web-browser, screen froze in the middle of video playing (both monitors, each connected to different GPU).

watchdog: BUG: soft lockup - CPU#25 stuck for 26s! [chrome:92688]
CPU: 25 PID: 92688 Comm: chrome Tainted: P           OE      6.8.0-49-generic #49~22.04.1-Ubuntu
Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 0803 08/30/2024
RIP: 0010:_nv046532rm+0x102/0x180 [nvidia]
Code: 00 0f 1f 80 00 00 00 00 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 <f3> 90 f3 90 f3 90 83 e8 01 75 d3 48 89 df e8 6b ec ff ff 4c 39 f0
...
Call Trace:
 <IRQ>
 ? show_regs+0x6d/0x80
 ? watchdog_timer_fn+0x206/0x290
 ? __pfx_watchdog_timer_fn+0x10/0x10
 ? __hrtimer_run_queues+0x10f/0x2a0
 ? clockevents_program_event+0xb3/0x140
 ? hrtimer_interrupt+0xf6/0x250
 ? __sysvec_apic_timer_interrupt+0x4e/0x150
 ? sysvec_apic_timer_interrupt+0x8d/0xd0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
 ? _nv046532rm+0x102/0x180 [nvidia]
 ? _nv046532rm+0x115/0x180 [nvidia]
 ? down+0x36/0x80
 ? _nv013493rm+0x33/0xb0 [nvidia]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? _nv013496rm+0x2c4/0x4e0 [nvidia]
 ? _nv043117rm+0x83e/0x1120 [nvidia]
 ? _nv050883rm+0xb06/0x2330 [nvidia]
 ? _nv004567rm+0xcd/0x1c0 [nvidia]
 ? _nv003877rm+0x4b/0x80 [nvidia]
 ? _nv045275rm+0x98/0x1b0 [nvidia]
 ? _nv010971rm+0x27b/0x5f0 [nvidia]
 ? _nv047216rm+0x2a5/0xac0 [nvidia]
 ? _nv047214rm+0x224/0x3a0 [nvidia]
 ? _nv045395rm+0x16f/0x320 [nvidia]
 ? _nv045396rm+0x5c/0x90 [nvidia]
 ? _nv014117rm+0x26/0x30 [nvidia]
 ? _nv014106rm+0xc1/0x130 [nvidia]
 ? _nv014139rm+0x52/0x90 [nvidia]
 ? security_capable+0x44/0x80
 ? _nv012663rm+0xc8/0x120 [nvidia]
 ? _nv000681rm+0x63/0x70 [nvidia]
 ? _nv000599rm+0x31/0x40 [nvidia]
 ? _nv000731rm+0x240/0xeb0 [nvidia]
 ? rm_ioctl+0x58/0xb0 [nvidia]
 ? nvidia_unlocked_ioctl+0x69c/0x920 [nvidia]
 ? __x64_sys_ioctl+0xa0/0xf0
 ? x64_sys_call+0xa68/0x24b0
 ? do_syscall_64+0x81/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? check_heap_object+0x18b/0x1e0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? nvidia_unlocked_ioctl+0x166/0x920 [nvidia]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __x64_sys_ioctl+0xbb/0xf0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? syscall_exit_to_user_mode+0x83/0x260
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_syscall_64+0x8d/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? irqentry_exit_to_user_mode+0x78/0x260
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? irqentry_exit+0x43/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? exc_page_fault+0x94/0x1b0
 ? entry_SYSCALL_64_after_hwframe+0x78/0x80
 </TASK>
watchdog: BUG: soft lockup - CPU#25 stuck for 52s! [chrome:92688]
...
watchdog: BUG: soft lockup - CPU#25 stuck for 4459s! [chrome:92688]
RIP: 0010:_nv046532rm+0xe8/0x180 [nvidia]
Code: 39 d6 4c 0f 42 f2 48 83 c0 01 49 39 c0 75 c3 4d 85 f6 75 3b eb 46 b8 64 00 00 00 0f 1f 80 00 00 00 00 f3 90 f3 90 f3 90 f3 90 <f3> 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90
RSP: 0018:ff53f77946517898 EFLAGS: 00000202
...
S irq/313-nvidia  2540     11307.617080 E     11310.615437         3.000000     13716.169231    310887    49         0.000000     13716.169231         0.000000         0.000000 0 0 /

Same call trace for few hours of freeze.
I think this time it was a different crash, not fallen of the bus, but still NVIDIA related.
I’ll continue testing.

Another similar freeze, but not fallen-of-the-buss, when working with Chrome, no any indication before the freeze, just mouse pointer slowed down for few seconds, then it stopped.

Logs from kern.log with some stack trace produced via SysReq during the freeze:

watchdog: BUG: soft lockup - CPU#46 stuck for 26s! [chrome:396570]
CPU: 46 PID: 396570 Comm: chrome Tainted: P           OE      6.8.0-49-generic #49~22.04.1-Ubuntu
Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 0803 08/30/2024
RIP: 0010:_nv046532rm+0xea/0x180 [nvidia]
Code: 4c 0f 42 f2 48 83 c0 01 49 39 c0 75 c3 4d 85 f6 75 3b eb 46 b8 64 00 00 00 0f 1f 80 00 00 00 00 f3 90 f3 90 f3 90 f3 90 f3 90 <f3> 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90
RSP: 0018:ff31bdd7ef673748 EFLAGS: 00000206
...
FS:  00007670eac6a500(0000) GS:ff1f5a663dd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00001874186f4008 CR3: 000000062512e003 CR4: 0000000000f71ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? show_regs+0x6d/0x80
 ? watchdog_timer_fn+0x206/0x290
 ? __pfx_watchdog_timer_fn+0x10/0x10
 ? __hrtimer_run_queues+0x10f/0x2a0
 ? rcu_core+0x1d2/0x390
 ? hrtimer_interrupt+0xf6/0x250
 ? __sysvec_apic_timer_interrupt+0x4e/0x150
 ? sysvec_apic_timer_interrupt+0x8d/0xd0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
 ? _nv046532rm+0xea/0x180 [nvidia]
 ? _nv046532rm+0x115/0x180 [nvidia]
 ? down+0x36/0x80
 ? _nv013493rm+0x33/0xb0 [nvidia]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? _nv013496rm+0x2c4/0x4e0 [nvidia]
 ? _nv043117rm+0x83e/0x1120 [nvidia]
 ? _nv050883rm+0xb06/0x2330 [nvidia]
 ? _nv004567rm+0xcd/0x1c0 [nvidia]
 ? _nv003877rm+0x4b/0x80 [nvidia]
 ? _nv045275rm+0x98/0x1b0 [nvidia]
 ? _nv010971rm+0x27b/0x5f0 [nvidia]
 ? _nv047216rm+0x2a5/0xac0 [nvidia]
 ? _nv047214rm+0x224/0x3a0 [nvidia]
 ? _nv045395rm+0x16f/0x320 [nvidia]
 ? _nv045396rm+0x5c/0x90 [nvidia]
 ? _nv014117rm+0x26/0x30 [nvidia]
 ? _nv014139rm+0x52/0x90 [nvidia]
 ? security_capable+0x44/0x80
 ? _nv012663rm+0xc8/0x120 [nvidia]
 ? _nv000681rm+0x63/0x70 [nvidia]
 ? _nv000599rm+0x31/0x40 [nvidia]
 ? _nv000731rm+0x240/0xeb0 [nvidia]
 ? rm_ioctl+0x58/0xb0 [nvidia]
 ? nvidia_unlocked_ioctl+0x69c/0x920 [nvidia]
 ? __x64_sys_ioctl+0xa0/0xf0
 ? x64_sys_call+0xa68/0x24b0
 ? do_syscall_64+0x81/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? nvidia_unlocked_ioctl+0x166/0x920 [nvidia]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __x64_sys_ioctl+0xbb/0xf0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? syscall_exit_to_user_mode+0x83/0x260
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_syscall_64+0x8d/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? check_heap_object+0x18b/0x1e0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? nvidia_unlocked_ioctl+0x166/0x920 [nvidia]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __x64_sys_ioctl+0xbb/0xf0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? syscall_exit_to_user_mode+0x83/0x260
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_syscall_64+0x8d/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? rcu_core_si+0xe/0x20
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? handle_softirqs+0xd8/0x340
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? irqentry_exit_to_user_mode+0x78/0x260
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? irqentry_exit+0x43/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? entry_SYSCALL_64_after_hwframe+0x78/0x80
 </TASK>
...
watchdog: BUG: soft lockup - CPU#46 stuck for 235s! [chrome:396570]

Similar freeze as the last time. Screen was frozen for over 8h.

watchdog: BUG: soft lockup - CPU#43 stuck for 26s! [CanvasRenderer:1651934]
Modules linked in: xt_nat xt_tcpudp veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv
 raid6_pq libcrc32c dm_mirror dm_region_hash dm_log hid_generic cdc_ether usbnet mii usbhid uas hid usb_storage mfd_aaeon asus_wmi video
CPU: 43 PID: 1651934 Comm: CanvasRenderer Tainted: P           OE      6.8.0-49-generic #49~22.04.1-Ubuntu
Hardware name: ASUS System Product Name/Pro WS WRX90E-SAGE SE, BIOS 0803 08/30/2024
RIP: 0010:_nv046532rm+0xf8/0x180 [nvidia]
Code: 85 f6 75 3b eb 46 b8 64 00 00 00 0f 1f 80 00 00 00 00 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 f3 90 <f3>
RSP: 0018:ff6dd339e32977a8 EFLAGS: 00000206
RAX: 000000000000003c RBX: ff3c91f1ee7fe688 RCX: ff3c91f252e10e68
RDX: ff3c91f215ad8008 RSI: ff6dd339a46c1004 RDI: ff3c91f23f459008
RBP: ff3c91f6a9155220 R08: 0000000000095b45 R09: ff3c91f252e00008
...
Call Trace:
 <IRQ>
 ? show_regs+0x6d/0x80
 ? watchdog_timer_fn+0x206/0x290
 ? __pfx_watchdog_timer_fn+0x10/0x10
 ? __hrtimer_run_queues+0x10f/0x2a0
 ? rcu_core+0x1d2/0x390
 ? hrtimer_interrupt+0xf6/0x250
 ? __sysvec_apic_timer_interrupt+0x4e/0x150
 ? sysvec_apic_timer_interrupt+0x8d/0xd0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
 ? _nv046532rm+0xf8/0x180 [nvidia]
 ? _nv046532rm+0x115/0x180 [nvidia]
 ? down+0x36/0x80
 ? _nv013493rm+0x33/0xb0 [nvidia]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? _nv013496rm+0x2c4/0x4e0 [nvidia]
 ? _nv043117rm+0x83e/0x1120 [nvidia]
...
watchdog: BUG: soft lockup - CPU#43 stuck for 29597s! [CanvasRenderer:1651934]
Modules linked in: xt_nat xt_tcpudp veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv
 raid6_pq libcrc32c dm_mirror dm_region_hash dm_log hid_generic cdc_ether usbnet mii usbhid uas hid usb_storage mfd_aaeon asus_wmi video
CPU: 43 PID: 1651934 Comm: CanvasRenderer Tainted: P           OEL     6.8.0-49-generic #49~22.04.1-Ubuntu
RIP: 0010:_nv046532rm+0xee/0x180 [nvidia]
CR2: 0000762764ce2000 CR3: 000000054dc66006 CR4: 0000000000f71ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? show_regs+0x6d/0x80
 ? watchdog_timer_fn+0x206/0x290
 ? __pfx_watchdog_timer_fn+0x10/0x10
 ? __hrtimer_run_queues+0x10f/0x2a0
 ? clockevents_program_event+0xb3/0x140
 ? hrtimer_interrupt+0xf6/0x250
 ? __sysvec_apic_timer_interrupt+0x4e/0x150
 ? sysvec_apic_timer_interrupt+0x8d/0xd0
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
 ? _nv046532rm+0xee/0x180 [nvidia]
 ? _nv046532rm+0x115/0x180 [nvidia]
 ? down+0x36/0x80
 ? _nv013493rm+0x33/0xb0 [nvidia]
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? _nv013496rm+0x2c4/0x4e0 [nvidia]
 ? _nv043117rm+0x83e/0x1120 [nvidia]
 ? _nv050883rm+0xb06/0x2330 [nvidia]
 ? _nv004567rm+0xcd/0x1c0 [nvidia]
 ? _nv003877rm+0x4b/0x80 [nvidia]
 ? _nv045275rm+0x98/0x1b0 [nvidia]
 ? _nv010971rm+0x27b/0x5f0 [nvidia]
 ? _nv047216rm+0x2a5/0xac0 [nvidia]
 ? _nv047214rm+0x224/0x3a0 [nvidia]
 ? _nv045395rm+0x16f/0x320 [nvidia]
 ? _nv045396rm+0x5c/0x90 [nvidia]
 ? _nv014117rm+0x26/0x30 [nvidia]
 ? _nv014139rm+0x52/0x90 [nvidia]
 ? security_capable+0x44/0x80
 ? _nv012663rm+0xc8/0x120 [nvidia]
 ? _nv000681rm+0x63/0x70 [nvidia]
 ? _nv000599rm+0x31/0x40 [nvidia]
 ? _nv000731rm+0x240/0xeb0 [nvidia]
 ? rm_ioctl+0x58/0xb0 [nvidia]
 ? nvidia_unlocked_ioctl+0x69c/0x920 [nvidia]
 ? __x64_sys_ioctl+0xa0/0xf0
 ? x64_sys_call+0xa68/0x24b0
 ? do_syscall_64+0x81/0x170
 ? do_syscall_64+0x81/0x170
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __mod_memcg_lruvec_state+0xa9/0x1b0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __mod_lruvec_state+0x36/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __lruvec_stat_mod_folio+0x70/0xc0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? set_ptes.constprop.0+0x2b/0xb0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_anonymous_page+0x1a3/0x430
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? handle_pte_fault+0x1cb/0x1d0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __handle_mm_fault+0x64e/0x790
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __count_memcg_events+0x80/0x130
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? count_memcg_events.constprop.0+0x2a/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? handle_mm_fault+0xad/0x380
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_user_addr_fault+0x337/0x670
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? irqentry_exit_to_user_mode+0x78/0x260
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? irqentry_exit+0x43/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? exc_page_fault+0x94/0x1b0
 ? entry_SYSCALL_64_after_hwframe+0x78/0x80
 </TASK>

The above configuration helped, so it happens less often. However fallen of the bus happened again after few weeks.

NVRM: GPU at PCI:0000:21:00: GPU-0535a00b-ecd6-8908-7591-0b6e0d4df252
NVRM: Xid (PCI:0000:21:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:21:00.0: GPU has fallen off the bus.
NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
nvidia-modeset: ERROR: GPU:4: Failed detecting connected display devices
$ nvidia-smi
Unable to determine the device handle for GPU0000:21:00.0: Unknown Error

Report log: Processing: nvidia-bug-report-20241212.log.gz…

The error:

Unable to determine the device handle for GPU0000:21:00.0

is common to this post and your first post in this thread. The next step, if you haven’t already, would be to remove it from the system. It will probably mean a bit of trial and error, using lspci to see which is gone, to find out which card is on 0000:21:00.0.

I realise it’s watercooled, but carefully unplugging the power connector, unplugging from slot, with a folded sheet of paper wrapped around the card, across slot edge connecter, to stop shorts occuring, should work.