my server is Dell7525
OS:
Linux version 5.19.0-35-generic (buildd@lcy02-amd64-020)
Ubuntu 11.3.0-1ubuntu1~22.04
Driver Version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GPUs:
2 × NVIDIA Corporation GA102 [GeForce RTX 3090]
when i run the commands:
# nvidia-smi
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error
# nvidia-debugdump --list
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
The same problems have happened twice, and after reboot the system twice, it will work normally for several hours and then break down.
The debug file is: nvidia-bug-report.log.gz (119.9 KB)
How can i look for the causes for the bug in the report files? It’s too complex and unreadable for me.
You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1500
nvidia-debugdump --list
Found 4 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
[root@node02]# less /var/log/messages | grep NVRM
Sep 17 20:27:58 node02 kernel: NVRM: GPU at PCI:0000:41:00: GPU-c80141d8-2ecb-d4bd-f000-943a0b30b0d5
Sep 17 20:27:58 node02 kernel: NVRM: GPU Board Serial Number: 1654622009200
Sep 17 20:27:58 node02 kernel: NVRM: Xid (PCI:0000:41:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU serial number is 1654622009200.
Sep 17 20:27:58 node02 kernel: NVRM: A GPU crash dump has been created. If possible, please run#012NVRM: nvidia-bug-report.sh as root to collect this data before#012NVRM: the NVIDIA kernel module is unloaded.
nvidia-smi just reports:
Unable to determine the device handle for GPU0000:41:00.0: Unknown Error
I can exclude overheating as the temperatures were monitored and fine all the time (max 65°C). Will try if reseating helps - which other issues could cause this?
Hi, I’m getting the same error and couldn’t figure out the cause. Can you please help? This is my log after running nvidia-bug-report.sh. nvidia-bug-report.log.gz (1.6 MB)
Hi, I’m getting the same error and couldn’t figure out the cause. Can you please help? This is my log after running nvidia-bug-report.sh. I have two L40S and the problems always appears only in one gpu.
The L40 shut down without any other error. Please check its power connectors/swap with the other one. If that yields nothing, it’s likely broken, check warranty.
You might also want to monitor tempreatures though it looked fine in the other log.
After several experiments, I realized that the problem is temperature related. The GPU with the error reaches 99 degrees Celsius and after that it shuts down, which I assume is the automatic protection process.