GPU has fallen off the bus twice in a month

fabrice.bacchella · January 6, 2025, 3:37pm

Twice we’ve lost our GPU on a brand new HPE Cray XD670.
The first time I upgraded all the firmware for it, hopping this problem was corrected, but it happens again.
I’m running on Rocky 9.5, and updated everything after the first problem. So every thing should be up to date.

/proc/cmdline is BOOT_IMAGE=(hd1,gpt2)/boot/vmlinuz-5.14.0-503.19.1.el9_5.x86_64 root=UUID=.... ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M ipv6.disable=1 net.ifnames=0 selinux=0 console=ttyS0,115200n8 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau psi=1

uname -a is Linux XXXX 5.14.0-503.19.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

modinfo nvidia is

filename:       /lib/modules/5.14.0-503.19.1.el9_5.x86_64/extra/nvidia.ko.xz
import_ns:      DMA_BUF
alias:          char-major-195-*
version:        565.57.01
supported:      external
license:        Dual MIT/GPL
firmware:       nvidia/565.57.01/gsp_tu10x.bin
firmware:       nvidia/565.57.01/gsp_ga10x.bin
rhelversion:    9.5
srcversion:     A009FF0B705D0A73BFBE867
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        drm
retpoline:      Y
name:           nvidia
vermagic:       5.14.0-503.19.1.el9_5.x86_64 SMP preempt mod_unload modversions

Peoples using it are doing through podman conteners, as users.
The first time, it fails with:

21 Dec  2024, 05:02:40.022 NVRM: Xid (PCI:0000:18:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
21 Dec  2024, 05:02:40.022 NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
21 Dec  2024, 05:02:40.022 NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
21 Dec  2024, 05:02:40.022 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
21 Dec  2024, 05:02:40.022 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
21 Dec  2024, 05:02:40.022 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
21 Dec  2024, 05:02:40.022 NVRM: Xid (PCI:0000:18:00): 154, pid='<unknown>', name=<unknown>, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
21 Dec  2024, 05:02:40.022 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274"
21 Dec  2024, 05:02:40.022 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274"
21 Dec  2024, 05:02:40.022 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274"
21 Dec  2024, 05:02:40.022 NVRM: GPU Board Serial Number: 1654923008111
21 Dec  2024, 05:02:40.022 NVRM: GPU 0000:18:00.0: GPU serial number is 1654923008111.
21 Dec  2024, 05:02:40.022 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
21 Dec  2024, 05:02:40.022 NVRM: prbEncStartAlloc: Can't allocate memory for protocol buffers."
21 Dec  2024, 05:02:40.022 NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
21 Dec  2024, 05:02:40.022 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
21 Dec  2024, 05:02:40.022 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
21 Dec  2024, 05:02:40.022 NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ journal.c:2239
21 Dec  2024, 05:02:40.021 NVRM: GPU at PCI:0000:18:00: GPU-ece4b50d-b21b-4a6a-8ec6-fecc012bb807

And the second was:

04 Jan  2025, 20:31:44.699 NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
04 Jan  2025, 20:31:44.699 NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
04 Jan  2025, 20:31:44.699 NVRM: GPU Board Serial Number: 1654923008111
04 Jan  2025, 20:31:44.699 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
04 Jan  2025, 20:31:44.699 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
04 Jan  2025, 20:31:44.699 NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
04 Jan  2025, 20:31:44.699 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
04 Jan  2025, 20:31:44.699 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
04 Jan  2025, 20:31:44.699 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
04 Jan  2025, 20:31:44.699 NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ journal.c:2239
04 Jan  2025, 20:31:44.699 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
04 Jan  2025, 20:31:44.699 NVRM: Xid (PCI:0000:18:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
04 Jan  2025, 20:31:44.699 NVRM: GPU 0000:18:00.0: GPU serial number is 1654923008111.
04 Jan  2025, 20:31:44.699 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
04 Jan  2025, 20:31:44.699 NVRM: prbEncStartAlloc: Can't allocate memory for protocol buffers.
04 Jan  2025, 20:31:44.699 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
04 Jan  2025, 20:31:44.699 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
04 Jan  2025, 20:31:44.699 NVRM: Xid (PCI:0000:18:00): 154, pid='<unknown>', name=<unknown>, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
04 Jan  2025, 20:31:44.698 NVRM: GPU at PCI:0000:18:00: GPU-ece4b50d-b21b-4a6a-8ec6-fecc012bb807

nvidia-bug-report.log.gz (4.4 MB)

Topic		Replies	Views
GPU has fallen of the bus Linux	15	7413	July 19, 2019
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	9	2039	April 26, 2025
GPU has fallen off the bus Linux	0	269	August 20, 2024
GPU has fallen off the bus Linux	1	987	September 21, 2021
Xid: 79, GPU has fallen off the bus (Arch linux, linux-ck-skylake 5.7.19, Nvidia 960, Driver: 455.23.04) Linux	0	438	October 9, 2020
GPU has fallen of the bus, nvidia-361.28, kernel 4.2.0 Linux	1	1591	February 28, 2016
GPU fallen off bus Linux ubuntu , gpu , debugging-and-troubleshooting	2	1270	May 27, 2022
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	39	10919	March 18, 2025
Issue of GPU has fallen off the bus Linux	3	196	April 3, 2025
Ubuntu 22.04 - GPU Falls off Bus - Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux	10	2317	February 12, 2024

GPU has fallen off the bus twice in a month

Related topics