GPU has fallen off the bus (L40S)

I have a machine with 8 L40S and I consistently get one or two that disappear. GPUs are not overheating, temperature and utilization are low when this happens. Running nvidia-smi will return Unable to determine the device handle for GPU… Unknown Error for the GPUs that have crashed. I have tried nvidia-smi drain and nvidia-smi --gpu-reset, but nothing seems to bring the GPUs back other than a system reboot. Even then, it’s only temporary and the GPUs crash again.

Possibly related errors from the bug report.

Sep 05 17:29:20 debian-s1906 kernel: NVRM: GPU at PCI:0000:25:00: GPU-4b88f852-d76f-f94d-d556-cc4dcdaeb183

Sep 05 17:29:20 debian-s1906 kernel: NVRM: GPU Board Serial Number: 1323523031514

Sep 05 17:29:20 debian-s1906 kernel: NVRM: Xid (PCI:0000:25:00): 79, GPU has fallen off the bus.

Sep 05 17:29:20 debian-s1906 kernel: NVRM: GPU 0000:25:00.0: GPU has fallen off the bus.

Sep 05 17:29:20 debian-s1906 kernel: NVRM: GPU 0000:25:00.0: GPU serial number is 1323523031514.

Sep 05 17:29:20 debian-s1906 kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: prbEncStartAlloc: Can’t allocate memory for protocol buffers.

Sep 05 17:29:20 debian-s1906 kernel: NVRM: A GPU crash dump has been created. If possible, please run

                                 NVRM: nvidia-bug-report.sh as root to collect this data before

                                 NVRM: the NVIDIA kernel module is unloaded.

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273

Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ journal.c:2270

Sep 05 17:29:20 debian-s1906 kernel: NVRM: Xid (PCI:0000:25:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)

nvidia-bug-report.log.gz (3.5 MB)

Hi @neterall1 , thanks for reporting this issue. GPU fallen off the bus issues could be due loose connections or PSU -related issues, apart from a bug in the driver. Could you try checking the connections / reseating the GPUs once. I don’t see anything in the bug report, but I will check again and update. Thanks.