Unable to determine the device handle for GPU 0000:06:00.0: Unknown Error. It recovers after a restart, and the restart requires a long press of the power button; a direct reboot via SSH doesn’t work as it gets stuck on the login screen, and I can’t enter the system.
I am training models for deep learning, using the Pytorch framework. When this problem first appeared, I thought it was an issue with one particular graphics card (I have two 4090s). So I swapped the positions of the GPUs, but the problem persisted. Initially, it was always gpu:1 that had the issue, but recently gpu:0 also started having the same problem. Therefore, I suspect it is not an issue with any specific card.
I am using excellent cooling devices, and I monitor the temperature with nvitop, which never shows it going above 60-75 degrees Celsius.
Enabling the persistent mode on the GPU does not resolve my issue. Do I need to replace some component?
This problem has been troubling me for half a year.
nvidia-bug-report.log.gz (1.0 MB)