Unable to determine the device handle for GPU :GPU is lost

futurian · January 30, 2018, 11:53pm

Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

I am using 1080Ti. How can I solve this error?

dmesg show those errors

[ 3890.407740] device vethd51e459 left promiscuous mode
[ 3890.407742] docker0: port 1(vethd51e459) entered disabled state
[31439.960406] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0018
[31439.960411] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[31439.960412] pcieport 0000:00:03.0: device [8086:6f08] error status/mask=00004000/00000000
[31439.960413] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[31439.960415] pcieport 0000:00:03.0: broadcast error_detected message
[31439.960420] pcieport 0000:00:03.0: AER: Device recovery failed
[31439.995406] NVRM: GPU at PCI:0000:05:00: GPU-deffe79b-df67-187a-8081-d1a80bcd3c9f
[31439.995409] NVRM: GPU Board Serial Number:
[31439.995411] NVRM: Xid (PCI:0000:05:00): 79, GPU has fallen off the bus.
[31439.995411] NVRM: GPU at 0000:05:00.0 has fallen off the bus.
[31439.995412] NVRM: GPU is on Board .
[31439.995418] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[31439.995465] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0018
[31439.995491] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[31439.995493] pcieport 0000:00:03.0: device [8086:6f08] error status/mask=00004000/00000000
[31439.995495] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[31439.995500] pcieport 0000:00:03.0: broadcast error_detected message
[31439.995504] pcieport 0000:00:03.0: AER: Device recovery failed

generix · January 31, 2018, 12:37pm

XID 79 points to overheating or insufficient/flaky power supply. Also reseat card(s) and check power connectors. Check temperature using nvidia-smi -q

sudheer5689 · June 22, 2018, 10:36am

I am also facing similar issue, but not immediately after power on. I am getting after few PCIe transactions success.

Can some one help me about this

[ 157.767564] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767568] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767573] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767581] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767597] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767599] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767607] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767609] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767612] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767617] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767619] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767620] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767628] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767630] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767632] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767636] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767638] pcieport 0000:00:03.0: AER: Device recovery failed
[ 157.767640] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0060
[ 157.767648] pcieport 0000:00:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0018(Requester ID)
[ 157.767650] pcieport 0000:00:03.0: device [10de:10e6] error status/mask=00004000/00000000
[ 157.767651] pcieport 0000:00:03.0: [14] Completion Timeout (First)
[ 157.767656] pcieport 0000:00:03.0: broadcast error_detected message
[ 157.767657] pcieport 0000:00:03.0: AER: Device recovery failed

Is this error related to PCIe Link error (PCIe transaction) or Power supply issue.

generix · June 22, 2018, 11:02am

That’s too little info. Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post will reveal a paperclip icon.

sudheer5689 · June 22, 2018, 11:52am

Thank you for your quick reply.

nvidia-bug-report.sh script file where i can find?

If you know some path please let me know or can you share the file.

Based on Error type can we identify whether it is related to power rails issue or some PCIe transaction error.

generix · June 22, 2018, 12:01pm

nvidia-bug-report.sh comes with the driver.
Whether/where this is installed depends on distro/package which you didn’t tell.
You posted only PCI errors which can be anything from any device.
Too little info.

yang.guo.1 · April 18, 2021, 5:50am

I got the same error. Reboot makes the error goes away. But still wondering why.

Here is my bug report

nvidia-bug-report.log.gz (683.7 KB)

generix · April 18, 2021, 1:16pm

XID 79, either due to lack of power or overheating. The Teslas don’t have their own fans, they’re relying on a proper server chassis to provide those. You’re running them on a consumer board so you’ll have to get some add-on fans.

eduardo.fonseca · May 15, 2021, 3:27pm

Hi, I’m facing a similar issue. I see XID 79, but also others like 56. The web UI does not allow me to upload the bug report to this message due to Connection refused - connect(2) for “ade0844122c9.tiefighter04.sd.sjc6.discourse.cloud” port 16726. Pasting a snippet at the bottom.

Out of 2 GPUs in the server, it is only one who give problems, while the other one works smoothly. I’ve running with this server for a year without problems, so it should not be lack of power or motherboard. Temperature is around 85-89C most of the time, with the GPU fan at 70-75%.

Any ideas how to solve this? Would it be advisable to try increasing the GPU fan speed? Update drivers? enable persistence mode? Thanks in advance!

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 7380 C python 11773MiB |
±----------------------------------------------------------------------------+

nvidia-bug-report.log:

May 15 09:58:19 deepblue kernel: [115231.676657] NVRM: GPU at PCI:0000:02:00: GPU-a55b515a-1f01-a26d-a3f4-3470894aa349
May 15 09:58:19 deepblue kernel: [115231.676659] NVRM: GPU Board Serial Number: 0321517086901
May 15 09:58:19 deepblue kernel: [115231.676660] NVRM: Xid (PCI:0000:02:00): 79, pid=0, GPU has fallen off the bus.
May 15 09:58:19 deepblue kernel: [115231.676664] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
May 15 09:58:19 deepblue kernel: [115231.676665] NVRM: GPU 0000:02:00.0: GPU is on Board 0321517086901.
May 15 09:58:19 deepblue kernel: [115231.676671] NVRM: A GPU crash dump has been created. If possible, please run
May 15 09:58:19 deepblue kernel: [115231.676671] NVRM: nvidia-bug-report.sh as root to collect this data before
May 15 09:58:19 deepblue kernel: [115231.676671] NVRM: the NVIDIA kernel module is unloaded.

…

may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=1798, CMDre 00000000 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000001 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000002 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000003 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000004 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000005 00000ffc ffffffff 00000007 00ffffff
may 14 01:53:00 deepblue kernel: NVRM: Xid (PCI:0000:02:00): 56, pid=265, CMDre 00000006 00000ffc ffffffff 00000007 00ffffff

981435961 · May 23, 2021, 8:21am

Unable to determine the device handle for GPU 0000:68:00.0: GPU is lost. Reboot the system to recover this GPU

I am using RTX 2080TI.

nvidia-bug-report.log.gz (2.0 MB)

Could you @generix please give me some advice on how to address this issue? Thanks a lot.

y.jonmo · August 11, 2021, 2:50am

Hi there,

I have the same issue.
Although I set my NVIDIA to on demand:

If I use the Performance mode, then it will freezes randomly which is what reported here: Ubuntu 18.04 completely freezes after a few minutes of being booted - #19 by Mart

I am having a HP-ZBook-Firefly-14-G7 with Ubuntu 18.

Regards,
Yaqub

nvidia-bug-report.log.gz (446.7 KB)

Topic		Replies	Views
Unable to determine the device handle for GPU Linux	14	10133	September 14, 2022
Unable to determine the device handle for GPU 0000:21:00.0: GPU is lost. Reboot the system to recover this GPU Linux	4	947	January 18, 2022
GPU has fallen of the bus Linux	15	7600	July 19, 2019
Unable to determine device handle (TitanX - Ubuntu 18.04 - NVIDIA Driver 460.32, CUDA 11.2) Linux ubuntu , gpu	1	1037	March 22, 2021
Unable to determine the device handle for GPU 0000:68:00.0: Unknown Error Linux	5	13662	July 27, 2021
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux	17	46162	December 19, 2024
Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. on Titan xp and 1080Ti CUDA Setup and Installation	4	7393	June 21, 2022
Unable to determine the device handle for GPU 0000:04:00.0: Unknown Error Linux driver	1	1843	August 31, 2022
Pascal Titan X's GPU's falling off the bus Linux	0	896	December 29, 2016
Ubuntu16.04 with 1080Ti x 4, encounters this problem once every week. <Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost> Linux	0	500	August 26, 2020

Unable to determine the device handle for GPU :GPU is lost

Related topics