Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

OS: Ubuntu 20.04
Driver Version: 470.86
GPUs: 1 x RTX3090

Recently I set up a new machine for deep learning experiments. However, the GPU often crashes during training. Then when I type nvidia-smi, the terminal shows: Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error. And this is the output of nvidia-debugdump --list:

Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

I have tried many methods according to previous posts, including upgrading the BIOS, reconnecting the power cable and monitoring the temperature. But none of them helped. It is still the problem.

Can somebody help me with this problem? I attach the nvidia-bug-report.log.gz and log file from dmesg command for your review. Thank you very much.

dmesg.log (91.1 KB)
nvidia-bug-report.log.gz (711.6 KB)

You might want to try if limiting pcie speeds to gen3 or even gen2 makes it more reliable.
Did you already try reseating the graphics board in its slot?

Sir Generix, thank you for your reply.

What critical information can you find from my log files?

I have tried many approaches including upgrading the BIOS, reseating the GPU card, using separate two 8PIN power cables instead of one Y shape cable. But the problem still remains. It crashes quite randomly, sometimes after running the program of a few hours and sometime just within a few minutes. I use a 2000W PSU for the single card so I guess the power is enough. I was suspicious about the thermal issue but I found the temperature is also stable by checking nvidia-smi realtime. I could try the solution you proposed.

The pcie root bus is reporting errors:

[ 2667.938348] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 2667.938353] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 2667.938354] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 2667.938355] pcieport 0000:00:01.0:    [12] Timeout               
[ 2950.259872] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 2950.259877] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 2950.259878] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 2950.259879] pcieport 0000:00:01.0:    [12] Timeout               
[ 3864.697827] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 3864.697832] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 3864.697833] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 3864.697834] pcieport 0000:00:01.0:    [12] Timeout               
[ 3996.562367] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 3996.562372] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 3996.562373] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 3996.562374] pcieport 0000:00:01.0:    [12] Timeout               
[ 5198.761067] pcieport 0000:00:01.0: AER: Corrected error received: 0000:00:01.0
[ 5198.761084] pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[ 5198.761085] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00001000/00002000
[ 5198.761086] pcieport 0000:00:01.0:    [12] Timeout               
[ 5211.651160] pcieport 0000:00:01.0: DPC: containment event, status:0x1f11 source:0x0000
[ 5211.651162] pcieport 0000:00:01.0: DPC: unmasked uncorrectable error detected
[ 5211.651166] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 5211.651168] pcieport 0000:00:01.0:   device [8086:4c01] error status/mask=00100000/00010000
[ 5211.651170] pcieport 0000:00:01.0:    [20] UnsupReq               (First)
[ 5211.651180] pcieport 0000:00:01.0: AER:   TLP Header: 34000000 02000010 00000000 00000000

finally leading to it breaking down completely and the gpu being shut down.
This is most often caused by continuous high bus loads and the mainboard (pcie chipset) breaking down.
Can often be worked-around by reducing the bus speed in bios or using a different mainboard (model).

Thanks. I will keep trying to figure it out. Will publish my final solution if problem solved.

Do you think this problem could be related to driver or CUDA version? My driver is 470.86. I have tried to CUDA 11.4 and CUDA 11.1. But both failed with the same problem.