my server is Dell7525
OS:
Linux version 5.19.0-35-generic (buildd@lcy02-amd64-020)
Ubuntu 11.3.0-1ubuntu1~22.04
Driver Version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GPUs:
2 × NVIDIA Corporation GA102 [GeForce RTX 3090]
when i run the commands:
# nvidia-smi
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error
# nvidia-debugdump --list
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
The same problems have happened twice, and after reboot the system twice, it will work normally for several hours and then break down.
The debug file is:
nvidia-bug-report.log.gz (119.9 KB)
How can i look for the causes for the bug in the report files? It’s too complex and unreadable for me.
The nvidia-bug-report.log is truncated, it doesn’t contain any dmesg logs from the crash.
I reboot the machine, but it repeats outputs:
nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4040
Then i restart and it work normally, the full nvidia bug report is :
nvidia-bug-report.log.gz (626.2 KB)
In the report file, i found:
NVRM: Xid (PCI:0000:21:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:21:00.0: GPU has fallen off the bus.
Maybe is the power’s problem?
Thanks, i regenerate the bug log in below. Can you help to look at the full log?
You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1500
Did you already fix this issue?
I have got the same:
nvidia-debugdump --list
Found 4 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
[root@node02]# less /var/log/messages | grep NVRM
Sep 17 20:27:58 node02 kernel: NVRM: GPU at PCI:0000:41:00: GPU-c80141d8-2ecb-d4bd-f000-943a0b30b0d5
Sep 17 20:27:58 node02 kernel: NVRM: GPU Board Serial Number: 1654622009200
Sep 17 20:27:58 node02 kernel: NVRM: Xid (PCI:0000:41:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU serial number is 1654622009200.
Sep 17 20:27:58 node02 kernel: NVRM: A GPU crash dump has been created. If possible, please run#012NVRM: nvidia-bug-report.sh as root to collect this data before#012NVRM: the NVIDIA kernel module is unloaded.
nvidia-smi just reports:
Unable to determine the device handle for GPU0000:41:00.0: Unknown Error
I can exclude overheating as the temperatures were monitored and fine all the time (max 65°C). Will try if reseating helps - which other issues could cause this?