GPU has fallen off the bus twice in a month

Twice we’ve lost our GPU on a brand new HPE Cray XD670.
The first time I upgraded all the firmware for it, hopping this problem was corrected, but it happens again.
I’m running on Rocky 9.5, and updated everything after the first problem. So every thing should be up to date.

/proc/cmdline is BOOT_IMAGE=(hd1,gpt2)/boot/vmlinuz-5.14.0-503.19.1.el9_5.x86_64 root=UUID=.... ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M ipv6.disable=1 net.ifnames=0 selinux=0 console=ttyS0,115200n8 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau psi=1

uname -a is Linux XXXX 5.14.0-503.19.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Dec 19 12:55:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

modinfo nvidia is

filename:       /lib/modules/5.14.0-503.19.1.el9_5.x86_64/extra/nvidia.ko.xz
import_ns:      DMA_BUF
alias:          char-major-195-*
version:        565.57.01
supported:      external
license:        Dual MIT/GPL
firmware:       nvidia/565.57.01/gsp_tu10x.bin
firmware:       nvidia/565.57.01/gsp_ga10x.bin
rhelversion:    9.5
srcversion:     A009FF0B705D0A73BFBE867
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        drm
retpoline:      Y
name:           nvidia
vermagic:       5.14.0-503.19.1.el9_5.x86_64 SMP preempt mod_unload modversions 

Peoples using it are doing through podman conteners, as users.
The first time, it fails with:

21 Dec  2024, 05:02:40.022 NVRM: Xid (PCI:0000:18:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
21 Dec  2024, 05:02:40.022 NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
21 Dec  2024, 05:02:40.022 NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
21 Dec  2024, 05:02:40.022 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
21 Dec  2024, 05:02:40.022 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
21 Dec  2024, 05:02:40.022 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
21 Dec  2024, 05:02:40.022 NVRM: Xid (PCI:0000:18:00): 154, pid='<unknown>', name=<unknown>, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
21 Dec  2024, 05:02:40.022 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274"
21 Dec  2024, 05:02:40.022 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274"
21 Dec  2024, 05:02:40.022 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274"
21 Dec  2024, 05:02:40.022 NVRM: GPU Board Serial Number: 1654923008111
21 Dec  2024, 05:02:40.022 NVRM: GPU 0000:18:00.0: GPU serial number is 1654923008111.
21 Dec  2024, 05:02:40.022 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
21 Dec  2024, 05:02:40.022 NVRM: prbEncStartAlloc: Can't allocate memory for protocol buffers."
21 Dec  2024, 05:02:40.022 NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
21 Dec  2024, 05:02:40.022 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
21 Dec  2024, 05:02:40.022 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
21 Dec  2024, 05:02:40.022 NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ journal.c:2239
21 Dec  2024, 05:02:40.021 NVRM: GPU at PCI:0000:18:00: GPU-ece4b50d-b21b-4a6a-8ec6-fecc012bb807

And the second was:

04 Jan  2025, 20:31:44.699 NVRM: GPU 0000:18:00.0: GPU has fallen off the bus.
04 Jan  2025, 20:31:44.699 NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
04 Jan  2025, 20:31:44.699 NVRM: GPU Board Serial Number: 1654923008111
04 Jan  2025, 20:31:44.699 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
04 Jan  2025, 20:31:44.699 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
04 Jan  2025, 20:31:44.699 NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
04 Jan  2025, 20:31:44.699 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
04 Jan  2025, 20:31:44.699 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
04 Jan  2025, 20:31:44.699 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
04 Jan  2025, 20:31:44.699 NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ journal.c:2239
04 Jan  2025, 20:31:44.699 NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
04 Jan  2025, 20:31:44.699 NVRM: Xid (PCI:0000:18:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
04 Jan  2025, 20:31:44.699 NVRM: GPU 0000:18:00.0: GPU serial number is 1654923008111.
04 Jan  2025, 20:31:44.699 NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
04 Jan  2025, 20:31:44.699 NVRM: prbEncStartAlloc: Can't allocate memory for protocol buffers.
04 Jan  2025, 20:31:44.699 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
04 Jan  2025, 20:31:44.699 NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
04 Jan  2025, 20:31:44.699 NVRM: Xid (PCI:0000:18:00): 154, pid='<unknown>', name=<unknown>, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
04 Jan  2025, 20:31:44.698 NVRM: GPU at PCI:0000:18:00: GPU-ece4b50d-b21b-4a6a-8ec6-fecc012bb807

nvidia-bug-report.log.gz (4.4 MB)

1 Like