Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

OS: CentOS Linux 7 (Core)
Driver Version: 470.82.01
GPUs: 2 Telsa T4

While I am typing nvidia-smi in my terminal, there is an error Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error. And this is the output of nvidia-debugdump --list.

Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

Here is the detailed info of bug report
nvidia-bug-report.log.gz (1.5 MB)

PS: I have checked other developers question and their logs. However, it seems my bug is quite different from them. This is the reason why I new a question.

2 Likes

You’re getting a fatal pcie error on the root bus so the gpu is disconnected. Please try reseating the gpu in its slot, try a different slot, check for a bios upgrade, check/replace mainboard.

[  695.203791] pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: id=0010
[  695.203805] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0010(Receiver ID)
[  695.203811] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00000020/00000000
[  695.203815] pcieport 0000:00:02.0:    [ 5] Surprise Down Error    (First)
[  695.203821] pcieport 0000:00:02.0: broadcast error_detected message
[  695.203825] nvidia 0000:02:00.0: device has no AER-aware driver
[  695.830814] NVRM: GPU at PCI:0000:02:00: GPU-b4e8ed5d-9b8a-c48c-1ecd-aac240753b23
[  695.830817] NVRM: Xid (PCI:0000:02:00): 79, pid=29245, GPU has fallen off the bus.
[  695.830819] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
[  695.830836] NVRM: GPU 0000:02:00.0: GPU serial number is \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff.
[  695.830844] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[  696.207412] pcieport 0000:00:02.0: Root Port link has been reset
[  696.207422] pcieport 0000:00:02.0: AER: Device recovery failed
1 Like

Thanks for the advice. What do you mean by “check for a bios upgrade”. I have googled a lot, it doesnot have so mant satisfied answers. Can you make it more specificly?

A system bios update from the manufacturer of the server/mainboard.

Dear sir, may I ask how you solved your problem eventually? I think I got exactly the same problem as you. After trying a lot of approaches mentioned in previous posts, I still got the same problem. Thank you very much.

Sorry for replying so late… I finished by changing a different slot.

Hi, I’ve now came across the same question, but the problem only broke up when all the gpu have been used(use part of the gpu won’t appear). Is it the same problem you’ve solved?

I had the same issue with one of the GPUs not being able to load due to some reason. After draining the specified incorrect GPU ID, the remainder of the system works perfectly. I believe the issue is caused by the connected monitor. I have four Tesla V100 GPUs, one of which is linked to the Monitor.

sudo nvidia-smi drain -p 0000:02:00.0 -m 1

To enable it back

sudo nvidia-smi drain -p 0000:02:00.0 -m 0
1 Like

$ sudo nvidia-smi drain -p 0000:1E:00.0 -m 1
Successfully set GPU 00000000:1E:00.0 drain state to: draining.

$ sudo nvidia-smi drain -p 0000:1E:00.0 -m 0
Successfully set GPU 00000000:1E:00.0 drain state to: not draining.

I can not enable it back. I don’t know what went wrong.

You should see the very detailed log of the bug report. Many different bugs may result the same output in your terminal. You can always find the real reason in the log report.

Please check my reply below

I’ve got the exact same issue with my 4090. When running nvidia-debugdump --list i get

Found 2 NVIDIA devices
	Device ID:              0
	Device name:            NVIDIA GeForce RTX 4090   (*PrimaryCard)
	GPU internal ID:        GPU-42e965ce-ce70-ded4-a964-7be45311e1e6

Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x1): Unknown Error

I’ve uploaded my bug report.
nvidia-bug-report.log.gz (651.2 KB)

The weird thing for me is that my Pytorch code will run fine for a few minutes, sometimes hours, then all of a sudden, I get this error and the code hangs (presumably because something is buggered deep down in CUDA stack or kernel driver). I would expect it to either work or not work. Not sometimes break at runtime in the future.
I’ve tried:

  • Downgrading CUDA → didn’t fix
  • Uninstalling CUDA&drivers then re-installing them using ubuntu package manager → doesn’t fix
  • Looked at disabling PCIe power management in BIOS → didn’t fix
    This is really frustrating. We’ve spent a lot of money on this machine…
    Any help would be greatly appreciated. thank you.

Hi, I’ve met exactly the same issue like yours, after running pytorch codes for a while the specific card (RTX 4090) would crash, and I will have to reboot the system. Do you have any idea if it is due to issues on the PCIE slot or the card?

Xid (PCI:0000:42:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Running ML workloads will cause heavy spikes in power usage, so rather get a better PSU.

In my case the GPU overheated. I could see its temperature raising to 97 C via nvitop (pip install nvitop), then the error message occurred.

OS: Rockylinux 9.3
Driver Version: 535.113.01
GPUs: A5000 x8

The same error occurs periodically. When typing nvidia-smi, an error appears: Unable to determine the device handle for GPU 0000:DB:00.0: Unknown Error.

Bug report
nvidia-bug-report.log.gz (2.6 MB)

DB:00.0 seems to be the PCIe bus-id, and it’s always the card at this location that disappears, but it’s unclear whether the card is faulty or if there’s an issue with the server. Rebooting usually fixes this issue.

Please monitor gpu temperatures.

I recently had this error. The correct useful search term is “GPU has fallen off the bus” because that’s what is happening and how it is described. Things I found and solution:

  1. a reboot would get the GPU back on the root bus but some ML processes would still sometimes disconnect them again;
  2. running device monitoring “nvidia-smi dmon -i 0 -s puv -d 5 -o TD” (open a terminal, run this command, and watch the log in real time) was useful in seeing temperatures, mem use, and power and temperature violations. It allowed me to rule out temperature as a cause but it did signal power problems; and
  3. reseating the cards and replugging the power connections “solved” the problem. That said, even though there are no longer any failures, there still appear to be power violations leading me to conclude that either my PSU is starting to fail or the GPU is.

The fact is, GPU’s are pretty sensitive to things like voltage because it affects the data flow which has to very precise at the clock speeds the GPU’s are working at. Things that have a lot of effect on voltage are:

  1. condition of the GPU
  2. temperature of the GPU
  3. condition of the PSU