nvidia-smi
Unable to determine the device handle for GPU0000:06:00.0: Unknown Error
lspci | grep NVIDIA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)
05:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)
nvidia-debugdump --list
Found 2 NVIDIA devices
Device ID: 0
Device name: NVIDIA GeForce RTX 3090
GPU internal ID: GPU-a3d80cbb-da8a-2369-8f5b-17116261085c
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error
dmesg |grep NVRM
[ 6.962381] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 560.35.03 Release Build (dvs-builder@U16-I1-N07-12-3) Fri Aug 16 21:42:42 UTC 2024
[ 4468.225307] NVRM: GPU at PCI:0000:05:00: GPU-75ee6c97-91d3-2803-0a3d-a33b640ddc4f
[ 4468.225323] NVRM: Xid (PCI:0000:05:00): 79, pid=‘’, name=, GPU has fallen off the bus.
[ 4468.225332] NVRM: GPU 0000:05:00.0: GPU has fallen off the bus.
[ 4468.225363] NVRM: prbEncStartAlloc: Can’t allocate memory for protocol buffers.
[ 4468.225371] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 4468.225674] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[ 4468.225684] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[ 4468.225746] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[ 4468.225749] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[ 4468.225755] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[ 4468.225758] NVRM: nvCheckOkFailedNoLog: Check failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from nvdEngineDumpCallbackHelper(pGpu, pPrbEnc, pNvDumpState, pEngineCallback) @ nv_debug_dump.c:274
[ 4468.225805] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ journal.c:2239
[153140.656837] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
[153140.657702] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 78!
bug report
nvidia-bug-report.log.gz (465.9 KB)
The situation keeps happening. How can I improve it?