Unable to determine the device handle for GPU :GPU is lost

Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

I am using 1080Ti. How can I solve this error?

dmesg show those errors

[ 3890.407740] device vethd51e459 left promiscuous mode
[ 3890.407742] docker0: port 1(vethd51e459) entered disabled state
[31439.960406] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
[31439.960411] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[31439.960412] pcieport 0000:00:03.0: device [8086:6f08] error status/mask=00004000/00000000
[31439.960413] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[31439.960415] pcieport 0000:00:03.0: broadcast error_detected message
[31439.960420] pcieport 0000:00:03.0: AER: Device recovery failed
[31439.995406] NVRM: GPU at PCI:0000:05:00: GPU-deffe79b-df67-187a-8081-d1a80bcd3c9f
[31439.995409] NVRM: GPU Board Serial Number:
[31439.995411] NVRM: Xid (PCI:0000:05:00): 79, GPU has fallen off the bus.
[31439.995411] NVRM: GPU at 0000:05:00.0 has fallen off the bus.
[31439.995412] NVRM: GPU is on Board .
[31439.995418] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[31439.995465] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0018
[31439.995491] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[31439.995493] pcieport 0000:00:03.0: device [8086:6f08] error status/mask=00004000/00000000
[31439.995495] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[31439.995500] pcieport 0000:00:03.0: broadcast error_detected message
[31439.995504] pcieport 0000:00:03.0: AER: Device recovery failed

XID 79 points to overheating or insufficient/flaky power supply. Also reseat card(s) and check power connectors. Check temperature using nvidia-smi -q

I am also facing similar issue, but not immediately after power on. I am getting after few PCIe transactions success.

Can some one help me about this

[ 157.767564] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767568] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767573] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767581] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767597] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767599] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767607] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767609] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767612] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767617] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767619] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767620] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767628] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767630] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767632] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767636] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767638] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767640] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767648] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767650] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767651] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767656] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767657] pcieport 0000:00:03.0: AER: Device recovery failed

Is this error related to PCIe Link error (PCIe transaction) or Power supply issue.

That’s too little info. Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post will reveal a paperclip icon.

Thank you for your quick reply.

nvidia-bug-report.sh script file where i can find?

If you know some path please let me know or can you share the file.

Based on Error type can we identify whether it is related to power rails issue or some PCIe transaction error.

nvidia-bug-report.sh comes with the driver.
Whether/where this is installed depends on distro/package which you didn’t tell.
You posted only PCI errors which can be anything from any device.
Too little info.