Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error

big18brother · March 4, 2023, 11:47am

my server is Dell7525
OS:
Linux version 5.19.0-35-generic (buildd@lcy02-amd64-020)
Ubuntu 11.3.0-1ubuntu1~22.04
Driver Version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GPUs:
2 × NVIDIA Corporation GA102 [GeForce RTX 3090]
when i run the commands:

# nvidia-smi
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error
# nvidia-debugdump --list
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

The same problems have happened twice, and after reboot the system twice, it will work normally for several hours and then break down.
The debug file is:
nvidia-bug-report.log.gz (119.9 KB)

How can i look for the causes for the bug in the report files? It’s too complex and unreadable for me.

generix · March 4, 2023, 2:37pm

The nvidia-bug-report.log is truncated, it doesn’t contain any dmesg logs from the crash.

big18brother · March 5, 2023, 4:23pm

I reboot the machine, but it repeats outputs:

nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c67d:0 2:0:4048:4040

Then i restart and it work normally, the full nvidia bug report is :
nvidia-bug-report.log.gz (626.2 KB)

In the report file, i found:

NVRM: Xid (PCI:0000:21:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
NVRM: GPU 0000:21:00.0: GPU has fallen off the bus.

Maybe is the power’s problem?

big18brother · March 5, 2023, 4:25pm

Thanks, i regenerate the bug log in below. Can you help to look at the full log?

generix · March 5, 2023, 5:36pm

You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1500

justeverything · September 18, 2023, 4:21pm

Did you already fix this issue?

I have got the same:

nvidia-debugdump --list
Found 4 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

[root@node02]# less /var/log/messages | grep NVRM
Sep 17 20:27:58 node02 kernel: NVRM: GPU at PCI:0000:41:00: GPU-c80141d8-2ecb-d4bd-f000-943a0b30b0d5
Sep 17 20:27:58 node02 kernel: NVRM: GPU Board Serial Number: 1654622009200
Sep 17 20:27:58 node02 kernel: NVRM: Xid (PCI:0000:41:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU has fallen off the bus.
Sep 17 20:27:58 node02 kernel: NVRM: GPU 0000:41:00.0: GPU serial number is 1654622009200.
Sep 17 20:27:58 node02 kernel: NVRM: A GPU crash dump has been created. If possible, please run#012NVRM: nvidia-bug-report.sh as root to collect this data before#012NVRM: the NVIDIA kernel module is unloaded.

nvidia-smi just reports:

Unable to determine the device handle for GPU0000:41:00.0: Unknown Error

I can exclude overheating as the temperatures were monitored and fine all the time (max 65°C). Will try if reseating helps - which other issues could cause this?

zhangyue2709 · March 20, 2024, 6:44am

I have got the same， Did you solve the problem?

yataozhong · May 20, 2024, 5:57pm

Hi, I’m getting the same error and couldn’t figure out the cause. Can you please help? This is my log after running nvidia-bug-report.sh.
nvidia-bug-report.log.gz (1.6 MB)

generix · May 25, 2024, 12:20am

The Titan turned off, I suspect either a power issue (check power connectors, swap cables) or it’s broken (check individually in another system).

diripar8 · July 2, 2024, 5:55pm

Hi, I’m getting the same error and couldn’t figure out the cause. Can you please help? This is my log after running nvidia-bug-report.sh. I have two L40S and the problems always appears only in one gpu.

nvidia-bug-report.log.gz (2.1 MB)

generix · July 3, 2024, 10:11am

Please create a log once the error triggered.

diripar8 · July 3, 2024, 5:15pm

Hi, I ran the process again to replicate the error. Thanks for your help.

nvidia-bug-report.log.gz (1.6 MB)

generix · July 3, 2024, 7:08pm

The L40 shut down without any other error. Please check its power connectors/swap with the other one. If that yields nothing, it’s likely broken, check warranty.
You might also want to monitor tempreatures though it looked fine in the other log.

andy10801 · July 4, 2024, 4:48am

Hi, I have the same error for one of two RTX 3090 GPUs.
How to solve the problem?

diripar8 · July 4, 2024, 8:28pm

Thanks for your answer.

After several experiments, I realized that the problem is temperature related. The GPU with the error reaches 99 degrees Celsius and after that it shuts down, which I assume is the automatic protection process.

frares · February 4, 2025, 7:23pm

@andy10801 I’ve added the regular user to group messagebus too. Root user don’t have that problem.

Topic		Replies	Views
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux	17	44049	December 19, 2024
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux ubuntu , nvidia-smi	7	4057	March 12, 2024
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	5021	November 9, 2022
Unable to determine the device handle for GPU 0000:85:00.0: Unknown Error //GPU has fallen off the bus Linux linux	6	678	November 9, 2023
Unable to determine the device handle for GPU xxxxxxxx: Unknown Error Linux ubuntu , kb	4	10663	October 15, 2022
Unable to determine the device handle for GPU0000:05:00.0: Unknown Error Linux	0	180	October 31, 2024
Unable to determine the device handle for GPU Linux	14	10066	September 14, 2022
Unable to determine the device handle for GPU 0000:68:00.0: Unknown Error Linux	5	13522	July 27, 2021
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux cuda , nvidia-smi , kb	6	3751	May 16, 2025
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux	9	1267	June 23, 2022

Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error

Related topics