Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

OS: CentOS Linux 7 (Core)
Driver Version: 470.82.01
GPUs: 2 Telsa T4

While I am typing nvidia-smi in my terminal, there is an error Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error. And this is the output of nvidia-debugdump --list.

Found 2 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

Here is the detailed info of bug report
nvidia-bug-report.log.gz (1.5 MB)

PS: I have checked other developers question and their logs. However, it seems my bug is quite different from them. This is the reason why I new a question.

You’re getting a fatal pcie error on the root bus so the gpu is disconnected. Please try reseating the gpu in its slot, try a different slot, check for a bios upgrade, check/replace mainboard.

[  695.203791] pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: id=0010
[  695.203805] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0010(Receiver ID)
[  695.203811] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00000020/00000000
[  695.203815] pcieport 0000:00:02.0:    [ 5] Surprise Down Error    (First)
[  695.203821] pcieport 0000:00:02.0: broadcast error_detected message
[  695.203825] nvidia 0000:02:00.0: device has no AER-aware driver
[  695.830814] NVRM: GPU at PCI:0000:02:00: GPU-b4e8ed5d-9b8a-c48c-1ecd-aac240753b23
[  695.830817] NVRM: Xid (PCI:0000:02:00): 79, pid=29245, GPU has fallen off the bus.
[  695.830819] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
[  695.830836] NVRM: GPU 0000:02:00.0: GPU serial number is \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff.
[  695.830844] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[  696.207412] pcieport 0000:00:02.0: Root Port link has been reset
[  696.207422] pcieport 0000:00:02.0: AER: Device recovery failed

Thanks for the advice. What do you mean by “check for a bios upgrade”. I have googled a lot, it doesnot have so mant satisfied answers. Can you make it more specificly?

A system bios update from the manufacturer of the server/mainboard.

Dear sir, may I ask how you solved your problem eventually? I think I got exactly the same problem as you. After trying a lot of approaches mentioned in previous posts, I still got the same problem. Thank you very much.

Sorry for replying so late… I finished by changing a different slot.