I have a machine with 8 L40S and I consistently get one or two that disappear. GPUs are not overheating, temperature and utilization are low when this happens. Running nvidia-smi will return Unable to determine the device handle for GPU… Unknown Error for the GPUs that have crashed. I have tried nvidia-smi drain and nvidia-smi --gpu-reset, but nothing seems to bring the GPUs back other than a system reboot. Even then, it’s only temporary and the GPUs crash again.
Possibly related errors from the bug report.
Sep 05 17:29:20 debian-s1906 kernel: NVRM: GPU at PCI:0000:25:00: GPU-4b88f852-d76f-f94d-d556-cc4dcdaeb183
Sep 05 17:29:20 debian-s1906 kernel: NVRM: GPU Board Serial Number: 1323523031514
Sep 05 17:29:20 debian-s1906 kernel: NVRM: Xid (PCI:0000:25:00): 79, GPU has fallen off the bus.
Sep 05 17:29:20 debian-s1906 kernel: NVRM: GPU 0000:25:00.0: GPU has fallen off the bus.
Sep 05 17:29:20 debian-s1906 kernel: NVRM: GPU 0000:25:00.0: GPU serial number is 1323523031514.
Sep 05 17:29:20 debian-s1906 kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: prbEncStartAlloc: Can’t allocate memory for protocol buffers.
Sep 05 17:29:20 debian-s1906 kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded.Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:273
Sep 05 17:29:20 debian-s1906 kernel: NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ journal.c:2270
Sep 05 17:29:20 debian-s1906 kernel: NVRM: Xid (PCI:0000:25:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
nvidia-bug-report.log.gz (3.5 MB)