Unable to determine the device handle for GPU :GPU is lost

Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

I am using 1080Ti. How can I solve this error?

dmesg show those errors

[ 3890.407740] device vethd51e459 left promiscuous mode
[ 3890.407742] docker0: port 1(vethd51e459) entered disabled state
[31439.960406] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
[31439.960411] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[31439.960412] pcieport 0000:00:03.0: device [8086:6f08] error status/mask=00004000/00000000
[31439.960413] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[31439.960415] pcieport 0000:00:03.0: broadcast error_detected message
[31439.960420] pcieport 0000:00:03.0: AER: Device recovery failed
[31439.995406] NVRM: GPU at PCI:0000:05:00: GPU-deffe79b-df67-187a-8081-d1a80bcd3c9f
[31439.995409] NVRM: GPU Board Serial Number:
[31439.995411] NVRM: Xid (PCI:0000:05:00): 79, GPU has fallen off the bus.
[31439.995411] NVRM: GPU at 0000:05:00.0 has fallen off the bus.
[31439.995412] NVRM: GPU is on Board .
[31439.995418] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[31439.995465] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0018
[31439.995491] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[31439.995493] pcieport 0000:00:03.0: device [8086:6f08] error status/mask=00004000/00000000
[31439.995495] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[31439.995500] pcieport 0000:00:03.0: broadcast error_detected message
[31439.995504] pcieport 0000:00:03.0: AER: Device recovery failed

XID 79 points to overheating or insufficient/flaky power supply. Also reseat card(s) and check power connectors. Check temperature using nvidia-smi -q

I am also facing similar issue, but not immediately after power on. I am getting after few PCIe transactions success.

Can some one help me about this

[ 157.767564] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767568] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767573] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767581] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767597] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767599] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767607] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767609] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767612] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767617] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767619] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767620] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767628] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767630] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767632] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767636] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767638] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767640] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767648] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767650] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767651] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767656] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767657] pcieport 0000:00:03.0: AER: Device recovery failed

Is this error related to PCIe Link error (PCIe transaction) or Power supply issue.

That’s too little info. Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post will reveal a paperclip icon.

Thank you for your quick reply.

nvidia-bug-report.sh script file where i can find?

If you know some path please let me know or can you share the file.

Based on Error type can we identify whether it is related to power rails issue or some PCIe transaction error.

nvidia-bug-report.sh comes with the driver.
Whether/where this is installed depends on distro/package which you didn’t tell.
You posted only PCI errors which can be anything from any device.
Too little info.

I got the same error. Reboot makes the error goes away. But still wondering why.

Here is my bug report

nvidia-bug-report.log.gz (683.7 KB)

XID 79, either due to lack of power or overheating. The Teslas don’t have their own fans, they’re relying on a proper server chassis to provide those. You’re running them on a consumer board so you’ll have to get some add-on fans.

Hi, I’m facing a similar issue. I see XID 79, but also others like 56. The web UI does not allow me to upload the bug report to this message due to Connection refused - connect(2) for “ade0844122c9.tiefighter04.sd.sjc6.discourse.cloud” port 16726. Pasting a snippet at the bottom.

Out of 2 GPUs in the server, it is only one who give problems, while the other one works smoothly. I’ve running with this server for a year without problems, so it should not be lack of power or motherboard. Temperature is around 85-89C most of the time, with the GPU fan at 70-75%.

Any ideas how to solve this? Would it be advisable to try increasing the GPU fan speed? Update drivers? enable persistence mode? Thanks in advance!

A representative nvidia-smi -i 1 is:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:01:00.0 Off | N/A |
| 72% 87C P2 211W / 250W | 11777MiB / 12196MiB | 93% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 7380 C python 11773MiB |
±----------------------------------------------------------------------------+

nvidia-bug-report.log:

May 15 09:58:19 deepblue kernel: [115231.676657] NVRM: GPU at PCI:0000:02:00: GPU-a55b515a-1f01-a26d-a3f4-3470894aa349
May 15 09:58:19 deepblue kernel: [115231.676659] NVRM: GPU Board Serial Number: 0321517086901
May 15 09:58:19 deepblue kernel: [115231.676660] NVRM: Xid (PCI:0000:02:00): 79, pid=0, GPU has fallen off the bus.
May 15 09:58:19 deepblue kernel: [115231.676664] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
May 15 09:58:19 deepblue kernel: [115231.676665] NVRM: GPU 0000:02:00.0: GPU is on Board 0321517086901.
May 15 09:58:19 deepblue kernel: [115231.676671] NVRM: A GPU crash dump has been created. If possible, please run
May 15 09:58:19 deepblue kernel: [115231.676671] NVRM: nvidia-bug-report.sh as root to collect this data before
May 15 09:58:19 deepblue kernel: [115231.676671] NVRM: the NVIDIA kernel module is unloaded.

may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=1798, CMDre 00000000 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000001 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000002 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000003 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000004 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000005 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000006 00000ffc ffffffff 00000007 00ffffff

Unable to determine the device handle for GPU 0000:68:00.0: GPU is lost. Reboot the system to recover this GPU

I am using RTX 2080TI.

nvidia-bug-report.log.gz (2.0 MB)

Could you @generix please give me some advice on how to address this issue? Thanks a lot.

Hi there,

I have the same issue.
Although I set my NVIDIA to on demand:image

If I use the Performance mode, then it will freezes randomly which is what reported here: Ubuntu 18.04 completely freezes after a few minutes of being booted - #19 by Mart

I am having a HP-ZBook-Firefly-14-G7 with Ubuntu 18.

Regards,
Yaqub

nvidia-bug-report.log.gz (446.7 KB)