While I am typing nvidia-smi in my terminal, there is an error Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error. And this is the output of nvidia-debugdump --list.
Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
PS: I have checked other developers question and their logs. However, it seems my bug is quite different from them. This is the reason why I new a question.
You’re getting a fatal pcie error on the root bus so the gpu is disconnected. Please try reseating the gpu in its slot, try a different slot, check for a bios upgrade, check/replace mainboard.
[ 695.203791] pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: id=0010
[ 695.203805] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0010(Receiver ID)
[ 695.203811] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00000020/00000000
[ 695.203815] pcieport 0000:00:02.0: [ 5] Surprise Down Error (First)
[ 695.203821] pcieport 0000:00:02.0: broadcast error_detected message
[ 695.203825] nvidia 0000:02:00.0: device has no AER-aware driver
[ 695.830814] NVRM: GPU at PCI:0000:02:00: GPU-b4e8ed5d-9b8a-c48c-1ecd-aac240753b23
[ 695.830817] NVRM: Xid (PCI:0000:02:00): 79, pid=29245, GPU has fallen off the bus.
[ 695.830819] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
[ 695.830836] NVRM: GPU 0000:02:00.0: GPU serial number is \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff.
[ 695.830844] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 696.207412] pcieport 0000:00:02.0: Root Port link has been reset
[ 696.207422] pcieport 0000:00:02.0: AER: Device recovery failed
Thanks for the advice. What do you mean by “check for a bios upgrade”. I have googled a lot, it doesnot have so mant satisfied answers. Can you make it more specificly?
Dear sir, may I ask how you solved your problem eventually? I think I got exactly the same problem as you. After trying a lot of approaches mentioned in previous posts, I still got the same problem. Thank you very much.
Hi, I’ve now came across the same question, but the problem only broke up when all the gpu have been used(use part of the gpu won’t appear). Is it the same problem you’ve solved?
I had the same issue with one of the GPUs not being able to load due to some reason. After draining the specified incorrect GPU ID, the remainder of the system works perfectly. I believe the issue is caused by the connected monitor. I have four Tesla V100 GPUs, one of which is linked to the Monitor.
You should see the very detailed log of the bug report. Many different bugs may result the same output in your terminal. You can always find the real reason in the log report.