Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

2386677465 · December 14, 2021, 6:52am

OS: CentOS Linux 7 (Core)
Driver Version: 470.82.01
GPUs: 2 Telsa T4

While I am typing nvidia-smi in my terminal, there is an error Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error. And this is the output of nvidia-debugdump --list.

Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

Here is the detailed info of bug report
nvidia-bug-report.log.gz (1.5 MB)

PS: I have checked other developers question and their logs. However, it seems my bug is quite different from them. This is the reason why I new a question.

generix · December 14, 2021, 10:05am

You’re getting a fatal pcie error on the root bus so the gpu is disconnected. Please try reseating the gpu in its slot, try a different slot, check for a bios upgrade, check/replace mainboard.

[  695.203791] pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: id=0010
[  695.203805] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0010(Receiver ID)
[  695.203811] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00000020/00000000
[  695.203815] pcieport 0000:00:02.0:    [ 5] Surprise Down Error    (First)
[  695.203821] pcieport 0000:00:02.0: broadcast error_detected message
[  695.203825] nvidia 0000:02:00.0: device has no AER-aware driver
[  695.830814] NVRM: GPU at PCI:0000:02:00: GPU-b4e8ed5d-9b8a-c48c-1ecd-aac240753b23
[  695.830817] NVRM: Xid (PCI:0000:02:00): 79, pid=29245, GPU has fallen off the bus.
[  695.830819] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
[  695.830836] NVRM: GPU 0000:02:00.0: GPU serial number is \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff.
[  695.830844] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[  696.207412] pcieport 0000:00:02.0: Root Port link has been reset
[  696.207422] pcieport 0000:00:02.0: AER: Device recovery failed

2386677465 · December 14, 2021, 1:55pm

Thanks for the advice. What do you mean by “check for a bios upgrade”. I have googled a lot, it doesnot have so mant satisfied answers. Can you make it more specificly?

generix · December 14, 2021, 3:49pm

A system bios update from the manufacturer of the server/mainboard.

fkeufss · January 16, 2022, 1:39pm

Dear sir, may I ask how you solved your problem eventually? I think I got exactly the same problem as you. After trying a lot of approaches mentioned in previous posts, I still got the same problem. Thank you very much.

2386677465 · March 19, 2022, 8:11am

Sorry for replying so late… I finished by changing a different slot.

Ryosuke · October 26, 2022, 8:07am

Hi, I’ve now came across the same question, but the problem only broke up when all the gpu have been used(use part of the gpu won’t appear). Is it the same problem you’ve solved?

san3 · January 4, 2023, 5:00pm

I had the same issue with one of the GPUs not being able to load due to some reason. After draining the specified incorrect GPU ID, the remainder of the system works perfectly. I believe the issue is caused by the connected monitor. I have four Tesla V100 GPUs, one of which is linked to the Monitor.

sudo nvidia-smi drain -p 0000:02:00.0 -m 1

To enable it back

sudo nvidia-smi drain -p 0000:02:00.0 -m 0

johy.yz · January 10, 2023, 8:32am

$ sudo nvidia-smi drain -p 0000:1E:00.0 -m 1
Successfully set GPU 00000000:1E:00.0 drain state to: draining.

$ sudo nvidia-smi drain -p 0000:1E:00.0 -m 0
Successfully set GPU 00000000:1E:00.0 drain state to: not draining.

I can not enable it back. I don’t know what went wrong.

2386677465 · January 10, 2023, 8:54am

You should see the very detailed log of the bug report. Many different bugs may result the same output in your terminal. You can always find the real reason in the log report.

2386677465 · January 10, 2023, 8:56am

Please check my reply below

sk8_peterf · August 31, 2023, 1:08pm

I’ve got the exact same issue with my 4090. When running nvidia-debugdump --list i get

Found 2 NVIDIA devices
	Device ID:              0
	Device name:            NVIDIA GeForce RTX 4090   (*PrimaryCard)
	GPU internal ID:        GPU-42e965ce-ce70-ded4-a964-7be45311e1e6

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error

I’ve uploaded my bug report.
nvidia-bug-report.log.gz (651.2 KB)

The weird thing for me is that my Pytorch code will run fine for a few minutes, sometimes hours, then all of a sudden, I get this error and the code hangs (presumably because something is buggered deep down in CUDA stack or kernel driver). I would expect it to either work or not work. Not sometimes break at runtime in the future.
I’ve tried:

Downgrading CUDA → didn’t fix
Uninstalling CUDA&drivers then re-installing them using ubuntu package manager → doesn’t fix
Looked at disabling PCIe power management in BIOS → didn’t fix
This is really frustrating. We’ve spent a lot of money on this machine…
Any help would be greatly appreciated. thank you.

zsy_lookout · October 27, 2023, 6:16am

Hi, I’ve met exactly the same issue like yours, after running pytorch codes for a while the specific card (RTX 4090) would crash, and I will have to reboot the system. Do you have any idea if it is due to issues on the PCIE slot or the card?

generix · October 31, 2023, 5:11pm

Xid (PCI:0000:42:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Running ML workloads will cause heavy spikes in power usage, so rather get a better PSU.

user121939 · November 14, 2023, 5:10am

In my case the GPU overheated. I could see its temperature raising to 97 C via nvitop (pip install nvitop), then the error message occurred.

v6jrwu4s2 · March 14, 2024, 7:29am

OS: Rockylinux 9.3
Driver Version: 535.113.01
GPUs: A5000 x8

The same error occurs periodically. When typing nvidia-smi, an error appears: Unable to determine the device handle for GPU 0000:DB:00.0: Unknown Error.

Bug report
nvidia-bug-report.log.gz (2.6 MB)

DB:00.0 seems to be the PCIe bus-id, and it’s always the card at this location that disappears, but it’s unclear whether the card is faulty or if there’s an issue with the server. Rebooting usually fixes this issue.

generix · March 14, 2024, 8:05am

Please monitor gpu temperatures.

michael.bourassa · December 19, 2024, 3:01pm

I recently had this error. The correct useful search term is “GPU has fallen off the bus” because that’s what is happening and how it is described. Things I found and solution:

a reboot would get the GPU back on the root bus but some ML processes would still sometimes disconnect them again;
running device monitoring “nvidia-smi dmon -i 0 -s puv -d 5 -o TD” (open a terminal, run this command, and watch the log in real time) was useful in seeing temperatures, mem use, and power and temperature violations. It allowed me to rule out temperature as a cause but it did signal power problems; and
reseating the cards and replugging the power connections “solved” the problem. That said, even though there are no longer any failures, there still appear to be power violations leading me to conclude that either my PSU is starting to fail or the GPU is.

The fact is, GPU’s are pretty sensitive to things like voltage because it affects the data flow which has to very precise at the clock speeds the GPU’s are working at. Things that have a lot of effect on voltage are:

condition of the GPU
temperature of the GPU
condition of the PSU

Topic		Replies	Views
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux ubuntu , nvidia-smi	7	3875	March 12, 2024
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error Linux ubuntu , driver	14	14271	July 4, 2024
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux	9	1234	June 23, 2022
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	4771	November 9, 2022
Unable to determine the device handle for GPU :GPU is lost Linux	10	31866	August 11, 2021
Unable to determine the device handle for GPU0000:05:00.0: Unknown Error Linux	0	113	October 31, 2024
Unable to determine the device handle for GPU 0000:04:00.0: Unknown Error Linux driver	1	1740	August 31, 2022
GPU loss Linux	7	13664	April 3, 2019
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux cuda , nvidia-smi , kb	3	3615	November 22, 2022
Unable to determine the device handle for GPU 0000:85:00.0: Unknown Error //GPU has fallen off the bus Linux linux	6	601	November 9, 2023

Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

Related topics