Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error

my server is Dell7525
OS:
Linux version 5.19.0-35-generic (buildd@lcy02-amd64-020)
Ubuntu 11.3.0-1ubuntu1~22.04
Driver Version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GPUs:
2 × NVIDIA Corporation GA102 [GeForce RTX 3090]
when i run the commands:

# nvidia-smi
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error
# nvidia-debugdump --list
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

The same problems have happened twice, and after reboot the system twice, it will work normally for several hours and then break down.
The debug file is:
nvidia-bug-report.log.gz (119.9 KB)

How can i look for the causes for the bug in the report files? It’s too complex and unreadable for me.

The nvidia-bug-report.log is truncated, it doesn’t contain any dmesg logs from the crash.

I reboot the machine, but it repeats outputs:

nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4040

Then i restart and it work normally, the full nvidia bug report is :
nvidia-bug-report.log.gz (626.2 KB)

In the report file, i found:

NVRM: Xid (PCI:0000:21:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:21:00.0: GPU has fallen off the bus.

Maybe is the power’s problem?

Thanks, i regenerate the bug log in below. Can you help to look at the full log?

You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1500

Did you already fix this issue?

I have got the same:

nvidia-debugdump --list
Found 4 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
[root@node02]# less /var/log/messages | grep NVRM
Sep 17 20:27:58 node02 kernel: NVRM: GPU at PCI:0000:41:00: GPU-c80141d8-2ecb-d4bd-f000-943a0b30b0d5
Sep 17 20:27:58 node02 kernel: NVRM: GPU Board Serial Number: 1654622009200
Sep 17 20:27:58 node02 kernel: NVRM: Xid (PCI:0000:41:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU serial number is 1654622009200.
Sep 17 20:27:58 node02 kernel: NVRM: A GPU crash dump has been created. If possible, please run#012NVRM: nvidia-bug-report.sh as root to collect this data before#012NVRM: the NVIDIA kernel module is unloaded.

nvidia-smi just reports:

Unable to determine the device handle for GPU0000:41:00.0: Unknown Error

I can exclude overheating as the temperatures were monitored and fine all the time (max 65°C). Will try if reseating helps - which other issues could cause this?