Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error

my server is Dell7525
Linux version 5.19.0-35-generic (buildd@lcy02-amd64-020)
Ubuntu 11.3.0-1ubuntu1~22.04
Driver Version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
2 × NVIDIA Corporation GA102 [GeForce RTX 3090]
when i run the commands:

# nvidia-smi
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error
# nvidia-debugdump --list
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

The same problems have happened twice, and after reboot the system twice, it will work normally for several hours and then break down.
The debug file is:
nvidia-bug-report.log.gz (119.9 KB)

How can i look for the causes for the bug in the report files? It’s too complex and unreadable for me.

The nvidia-bug-report.log is truncated, it doesn’t contain any dmesg logs from the crash.

I reboot the machine, but it repeats outputs:

nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4040

Then i restart and it work normally, the full nvidia bug report is :
nvidia-bug-report.log.gz (626.2 KB)

In the report file, i found:

NVRM: Xid (PCI:0000:21:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:21:00.0: GPU has fallen off the bus.

Maybe is the power’s problem?

Thanks, i regenerate the bug log in below. Can you help to look at the full log?

You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1500