Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

fkeufss · January 16, 2022, 1:37pm

OS: Ubuntu 20.04
Driver Version: 470.86
GPUs: 1 x RTX3090

Recently I set up a new machine for deep learning experiments. However, the GPU often crashes during training. Then when I type nvidia-smi, the terminal shows: Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error. And this is the output of nvidia-debugdump --list:

Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

I have tried many methods according to previous posts, including upgrading the BIOS, reconnecting the power cable and monitoring the temperature. But none of them helped. It is still the problem.

Can somebody help me with this problem? I attach the nvidia-bug-report.log.gz and log file from dmesg command for your review. Thank you very much.

dmesg.log (91.1 KB)
nvidia-bug-report.log.gz (711.6 KB)

generix · January 17, 2022, 9:35am

You might want to try if limiting pcie speeds to gen3 or even gen2 makes it more reliable.
Did you already try reseating the graphics board in its slot?

fkeufss · January 17, 2022, 11:29am

Sir Generix, thank you for your reply.

What critical information can you find from my log files?

I have tried many approaches including upgrading the BIOS, reseating the GPU card, using separate two 8PIN power cables instead of one Y shape cable. But the problem still remains. It crashes quite randomly, sometimes after running the program of a few hours and sometime just within a few minutes. I use a 2000W PSU for the single card so I guess the power is enough. I was suspicious about the thermal issue but I found the temperature is also stable by checking nvidia-smi realtime. I could try the solution you proposed.

generix · January 17, 2022, 11:35am

The pcie root bus is reporting errors:

[ 2667.938348] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 2667.938353] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 2667.938354] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 2667.938355] pcieport 0000:00:01.0:    [12] Timeout               
[ 2950.259872] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 2950.259877] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 2950.259878] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 2950.259879] pcieport 0000:00:01.0:    [12] Timeout               
[ 3864.697827] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 3864.697832] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 3864.697833] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 3864.697834] pcieport 0000:00:01.0:    [12] Timeout               
[ 3996.562367] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 3996.562372] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 3996.562373] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 3996.562374] pcieport 0000:00:01.0:    [12] Timeout               
[ 5198.761067] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 5198.761084] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 5198.761085] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 5198.761086] pcieport 0000:00:01.0:    [12] Timeout               
[ 5211.651160] pcieport 0000:00:01.0: DPC: containment event, status:0x1f11 source:0x0000
[ 5211.651162] pcieport 0000:00:01.0: DPC: unmasked uncorrectable error detected
[ 5211.651166] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5211.651168] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00100000/00010000
[ 5211.651170] pcieport 0000:00:01.0:    [20] UnsupReq               (First)
[ 5211.651180] pcieport 0000:00:01.0: AER:   TLP Header: 34000000 02000010 00000000 00000000

finally leading to it breaking down completely and the gpu being shut down.
This is most often caused by continuous high bus loads and the mainboard (pcie chipset) breaking down.
Can often be worked-around by reducing the bus speed in bios or using a different mainboard (model).

fkeufss · January 17, 2022, 1:46pm

Thanks. I will keep trying to figure it out. Will publish my final solution if problem solved.

fkeufss · January 18, 2022, 9:24am

Do you think this problem could be related to driver or CUDA version? My driver is 470.86. I have tried to CUDA 11.4 and CUDA 11.1. But both failed with the same problem.

895981904 · March 12, 2024, 1:28am

hello, do you solve this problem?

fkeufss · March 12, 2024, 2:45pm

The problem was solve by changing the motherboard.

Topic		Replies	Views
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux	17	45161	December 19, 2024
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	5121	November 9, 2022
Unable to determine the device handle for GPU 0000:21:00.0: Unknown Error Linux ubuntu , driver	15	16463	February 4, 2025
Unable to determine the device handle for GPU 0000:19:00.0: Unknown Error Linux	1	1181	September 30, 2022
Unable to determine the device handle for GPU0000:05:00.0: Unknown Error Linux	0	201	October 31, 2024
Unable to determine the device handle for GPU 0000:68:00.0: Unknown Error Linux	5	13564	July 27, 2021
How to address the error. "Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error" Linux boot , kb	1	2691	November 28, 2022
Unable to determine the device handle for GPU0000:67:00.0: Unknown Error Linux ubuntu	0	59	February 24, 2025
Unable to determine the GPU device handle Linux	1	233	July 26, 2024
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux ubuntu	1	481	May 25, 2024

Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

Related topics