OS: Ubuntu 20.04
Driver Version: 470.86
GPUs: 1 x RTX3090
Recently I set up a new machine for deep learning experiments. However, the GPU often crashes during training. Then when I type nvidia-smi, the terminal shows: Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error. And this is the output of nvidia-debugdump --list:
Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
I have tried many methods according to previous posts, including upgrading the BIOS, reconnecting the power cable and monitoring the temperature. But none of them helped. It is still the problem.
Can somebody help me with this problem? I attach the nvidia-bug-report.log.gz and log file from dmesg command for your review. Thank you very much.
You might want to try if limiting pcie speeds to gen3 or even gen2 makes it more reliable.
Did you already try reseating the graphics board in its slot?
What critical information can you find from my log files?
I have tried many approaches including upgrading the BIOS, reseating the GPU card, using separate two 8PIN power cables instead of one Y shape cable. But the problem still remains. It crashes quite randomly, sometimes after running the program of a few hours and sometime just within a few minutes. I use a 2000W PSU for the single card so I guess the power is enough. I was suspicious about the thermal issue but I found the temperature is also stable by checking nvidia-smi realtime. I could try the solution you proposed.
finally leading to it breaking down completely and the gpu being shut down.
This is most often caused by continuous high bus loads and the mainboard (pcie chipset) breaking down.
Can often be worked-around by reducing the bus speed in bios or using a different mainboard (model).
Do you think this problem could be related to driver or CUDA version? My driver is 470.86. I have tried to CUDA 11.4 and CUDA 11.1. But both failed with the same problem.