Since I’ve bought new PC this year, my NVIDIA GPUs (or one of them) aren’t stable and they keep crashing at random usage (mainly due to ‘fallen of the bus’ error). Especially after running any CUDA software.
I’ve upgraded BIOS to the latest and this still happens.
GPUs are not overheating, temperature is low due to watercooling system.
Last time when I tried to repair at the Scan shop in the UK, however they couldn’t reproduce the same issues using standard or stress benchmarking tools.
When this happen, any changing of terminals or attempt to reboot the system ends up with frozen system, although SysRq combinations still works (which I can restart system with REISUB Magic SysRq key combination).
Specs:
- GPUs: NVIDIA GeForce RTX 4090 x5, watercooled
- Motherboard: Pro WS WRX90E-SAGE SE
- BIOS: Upgraded to latest firmware 0803 (30/08/2024)
- OS: Ubuntu 22.04 with 550.127 NVIDIA firmware and drivers
More details below.
Error from today literally after 10 minutes of uptime:
pcieport 0000:20:01.1: DPC: containment event, status:0x1f01 source:0x0000
NVRM: GPU at PCI:0000:21:00: GPU-0535a00b-ecd6-8908-7591-0b6e0d4df252
pcieport 0000:20:01.1: DPC: unmasked uncorrectable error detected
NVRM: Xid (PCI:0000:21:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
pcieport 0000:20:01.1: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:20:01.1: device [1022:14ab] error status/mask=00000020/00000000
NVRM: GPU 0000:21:00.0: GPU has fallen off the bus.
pcieport 0000:20:01.1: [ 5] SDES (First)
nvidia 0000:21:00.0: AER: can't recover (no error_detected callback)
snd_hda_intel 0000:21:00.1: AER: can't recover (no error_detected callback)
pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
pcieport 0000:20:01.1: retraining failed
pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
pcieport 0000:20:01.1: retraining failed
nvidia 0000:21:00.0: not ready 1023ms after DPC; waiting
nvidia 0000:21:00.0: not ready 2047ms after DPC; waiting
nvidia 0000:21:00.0: not ready 4095ms after DPC; waiting
nvidia 0000:21:00.0: not ready 8191ms after DPC; waiting
nvidia 0000:21:00.0: not ready 16383ms after DPC; waiting
nvidia 0000:21:00.0: not ready 32767ms after DPC; waiting
nvidia 0000:21:00.0: not ready 65535ms after DPC; giving up
pcieport 0000:20:01.1: AER: subordinate device reset failed
pcieport 0000:20:01.1: AER: device recovery failed
nvidia-modeset: ERROR: GPU:4: Failed detecting connected display devices
...
pcieport 0000:00:05.1: AER: Correctable error message received from 0000:02:00.0
i40e 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID)
i40e 0000:02:00.0: device [8086:15ff] error status/mask=00001000/00000000
i40e 0000:02:00.0: [12] Timeout
$ nvidia-smi
Unable to determine the device handle for GPU0000:21:00.0: Unknown Error
$ lspci -vvv
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
IOMMU group: 25
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin ? routed to IRQ 56
IOMMU group: 26
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 00001000-00001fff [size=4K]
Memory behind bridge: f3000000-f40fffff [size=17M]
Prefetchable memory behind bridge: 00000100f0000000-0000010101ffffff [size=288M]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: <access denied>
Kernel driver in use: pcieport
...
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1) (prog-if 00 [VGA controller])
Subsystem: ZOTAC International (MCO) Ltd. Device 1675
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 318
IOMMU group: 35
Region 0: Memory at f3000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 100f0000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 10100000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at 1000 [size=128]
Expansion ROM at f4000000 [virtual] [disabled] [size=512K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device: NVIDIA Corporation Device 22ba (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. Device 1675
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 54
IOMMU group: 35
Region 0: Memory at f4080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
...
20:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 149f (rev 01)
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
IOMMU group: 39
20:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 14ab (rev 01) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin ? routed to IRQ 61
IOMMU group: 40
Bus: primary=20, secondary=21, subordinate=21, sec-latency=0
I/O behind bridge: 00004000-00004fff [size=4K]
Memory behind bridge: b1000000-b20fffff [size=17M]
Prefetchable memory behind bridge: 0000014800000000-0000014811ffffff [size=288M]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: <access denied>
Kernel driver in use: pcieport
...
21:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
21:00.1 Audio device: NVIDIA Corporation Device 22ba (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
Other related errors when system is frozen:
nvidia-modeset: ERROR: GPU:4: Error while waiting for GPU progress: 0x0000c77d:0 2:0:4048:4040
INFO: task nvidia-modeset/:2308 blocked for more than 122 seconds.
Tainted: P OE 6.8.0-49-generic #49~22.04.1-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvidia-modeset/ state:D stack:0 pid:2308 tgid:2308 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x27c/0x6a0
schedule+0x33/0x110
schedule_timeout+0x157/0x170
___down_common+0xfd/0x160
? srso_alias_return_thunk+0x5/0xfbef5
__down_common+0x22/0xd0
__down+0x1d/0x30
down+0x54/0x80
nvkms_kthread_q_callback+0x9a/0x160 [nvidia_modeset]
_main_loop+0x7f/0x140 [nvidia_modeset]
? __pfx__main_loop+0x10/0x10 [nvidia_modeset]
kthread+0xef/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x44/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
...
runnable tasks:
S task PID tree-key switches prio wait-time sum-exec sum-sleep
S irq/315-nvidia 2316 4563.283918 E 4566.281895 3.000000 50053.035885 1432738 49 0.000000 50053.035885 0.000000 0.000000 0 0 /
S nvidia 2317 9256.322720 E 9259.321759 3.000000 0.005277 2 120 0.000000 0.005277 0.000000 0.000000 0 0 /
Logs:
- lspci-vvv.20251125.log (51.7 KB)
- Processing: nvidia-bug-report.20241125-02.log.gz…
- Processing: nvidia-bug-report.20241125.log.gz…
- Uploading: nvidia-bug-report.20241117.log.gz…
- Uploading: nvidia-bug-report.20240616.log.gz…
- Uploading: nvidia-bug-report.20240510.log.gz…
Some bug-report files are stuck for few days at processing/uploading, so I can’t upload them for some reason.