Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error

my server is Dell7525
OS:
Linux version 5.19.0-35-generic (buildd@lcy02-amd64-020)
Ubuntu 11.3.0-1ubuntu1~22.04
Driver Version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GPUs:
2 × NVIDIA Corporation GA102 [GeForce RTX 3090]
when i run the commands:

# nvidia-smi
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error
# nvidia-debugdump --list
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

The same problems have happened twice, and after reboot the system twice, it will work normally for several hours and then break down.
The debug file is:
nvidia-bug-report.log.gz (119.9 KB)

How can i look for the causes for the bug in the report files? It’s too complex and unreadable for me.

The nvidia-bug-report.log is truncated, it doesn’t contain any dmesg logs from the crash.

I reboot the machine, but it repeats outputs:

nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4040

Then i restart and it work normally, the full nvidia bug report is :
nvidia-bug-report.log.gz (626.2 KB)

In the report file, i found:

NVRM: Xid (PCI:0000:21:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:21:00.0: GPU has fallen off the bus.

Maybe is the power’s problem?

Thanks, i regenerate the bug log in below. Can you help to look at the full log?

You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1500

Did you already fix this issue?

I have got the same:

nvidia-debugdump --list
Found 4 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
[root@node02]# less /var/log/messages | grep NVRM
Sep 17 20:27:58 node02 kernel: NVRM: GPU at PCI:0000:41:00: GPU-c80141d8-2ecb-d4bd-f000-943a0b30b0d5
Sep 17 20:27:58 node02 kernel: NVRM: GPU Board Serial Number: 1654622009200
Sep 17 20:27:58 node02 kernel: NVRM: Xid (PCI:0000:41:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU serial number is 1654622009200.
Sep 17 20:27:58 node02 kernel: NVRM: A GPU crash dump has been created. If possible, please run#012NVRM: nvidia-bug-report.sh as root to collect this data before#012NVRM: the NVIDIA kernel module is unloaded.

nvidia-smi just reports:

Unable to determine the device handle for GPU0000:41:00.0: Unknown Error

I can exclude overheating as the temperatures were monitored and fine all the time (max 65°C). Will try if reseating helps - which other issues could cause this?

I have got the same, Did you solve the problem?

1 Like

Hi, I’m getting the same error and couldn’t figure out the cause. Can you please help? This is my log after running nvidia-bug-report.sh.
nvidia-bug-report.log.gz (1.6 MB)

The Titan turned off, I suspect either a power issue (check power connectors, swap cables) or it’s broken (check individually in another system).

Hi, I’m getting the same error and couldn’t figure out the cause. Can you please help? This is my log after running nvidia-bug-report.sh. I have two L40S and the problems always appears only in one gpu.

nvidia-bug-report.log.gz (2.1 MB)

Please create a log once the error triggered.

Hi, I ran the process again to replicate the error. Thanks for your help.

nvidia-bug-report.log.gz (1.6 MB)

The L40 shut down without any other error. Please check its power connectors/swap with the other one. If that yields nothing, it’s likely broken, check warranty.
You might also want to monitor tempreatures though it looked fine in the other log.

1 Like

Hi, I have the same error for one of two RTX 3090 GPUs.
How to solve the problem?

Thanks for your answer.

After several experiments, I realized that the problem is temperature related. The GPU with the error reaches 99 degrees Celsius and after that it shuts down, which I assume is the automatic protection process.