Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

OS: CentOS Linux 7 (Core)
Driver Version: 470.82.01
GPUs: 2 Telsa T4

While I am typing nvidia-smi in my terminal, there is an error Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error. And this is the output of nvidia-debugdump --list.

Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

Here is the detailed info of bug report
nvidia-bug-report.log.gz (1.5 MB)

PS: I have checked other developers question and their logs. However, it seems my bug is quite different from them. This is the reason why I new a question.

You’re getting a fatal pcie error on the root bus so the gpu is disconnected. Please try reseating the gpu in its slot, try a different slot, check for a bios upgrade, check/replace mainboard.

[  695.203791] pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: id=0010
[  695.203805] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0010(Receiver ID)
[  695.203811] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00000020/00000000
[  695.203815] pcieport 0000:00:02.0:    [ 5] Surprise Down Error    (First)
[  695.203821] pcieport 0000:00:02.0: broadcast error_detected message
[  695.203825] nvidia 0000:02:00.0: device has no AER-aware driver
[  695.830814] NVRM: GPU at PCI:0000:02:00: GPU-b4e8ed5d-9b8a-c48c-1ecd-aac240753b23
[  695.830817] NVRM: Xid (PCI:0000:02:00): 79, pid=29245, GPU has fallen off the bus.
[  695.830819] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
[  695.830836] NVRM: GPU 0000:02:00.0: GPU serial number is \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff.
[  695.830844] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[  696.207412] pcieport 0000:00:02.0: Root Port link has been reset
[  696.207422] pcieport 0000:00:02.0: AER: Device recovery failed

Thanks for the advice. What do you mean by “check for a bios upgrade”. I have googled a lot, it doesnot have so mant satisfied answers. Can you make it more specificly?

A system bios update from the manufacturer of the server/mainboard.

Dear sir, may I ask how you solved your problem eventually? I think I got exactly the same problem as you. After trying a lot of approaches mentioned in previous posts, I still got the same problem. Thank you very much.

Sorry for replying so late… I finished by changing a different slot.

Hi, I’ve now came across the same question, but the problem only broke up when all the gpu have been used(use part of the gpu won’t appear). Is it the same problem you’ve solved?

I had the same issue with one of the GPUs not being able to load due to some reason. After draining the specified incorrect GPU ID, the remainder of the system works perfectly. I believe the issue is caused by the connected monitor. I have four Tesla V100 GPUs, one of which is linked to the Monitor.

sudo nvidia-smi drain -p 0000:02:00.0 -m 1

To enable it back

sudo nvidia-smi drain -p 0000:02:00.0 -m 0
1 Like

$ sudo nvidia-smi drain -p 0000:1E:00.0 -m 1
Successfully set GPU 00000000:1E:00.0 drain state to: draining.

$ sudo nvidia-smi drain -p 0000:1E:00.0 -m 0
Successfully set GPU 00000000:1E:00.0 drain state to: not draining.

I can not enable it back. I don’t know what went wrong.

You should see the very detailed log of the bug report. Many different bugs may result the same output in your terminal. You can always find the real reason in the log report.

Please check my reply below