RTX 4090 frequently encounters the issue "Unable to determine the device handle for GPU 0000:06:00.0: Unknown Error". It seems not to be an overheatin

Unable to determine the device handle for GPU 0000:06:00.0: Unknown Error. It recovers after a restart, and the restart requires a long press of the power button; a direct reboot via SSH doesn’t work as it gets stuck on the login screen, and I can’t enter the system.
I am training models for deep learning, using the Pytorch framework. When this problem first appeared, I thought it was an issue with one particular graphics card (I have two 4090s). So I swapped the positions of the GPUs, but the problem persisted. Initially, it was always gpu:1 that had the issue, but recently gpu:0 also started having the same problem. Therefore, I suspect it is not an issue with any specific card.
I am using excellent cooling devices, and I monitor the temperature with nvitop, which never shows it going above 60-75 degrees Celsius.
Enabling the persistent mode on the GPU does not resolve my issue. Do I need to replace some component?

This problem has been troubling me for half a year.
nvidia-bug-report.log.gz (1.0 MB)

Insufficient/broken PSU.

Thank you very much. I will try to replace my PSU to see if the same problem still occurs. Initially, I thought a 2KW PSU would be enough since dual 4090s are only 900W.

With ML, the gpus produce excessive power spikes. So the total wattage of the psu doesn’t matter.

Could you please tell me how I should choose a new PSU device? I have no knowledge about this area. What issues should I be aware of?

I can’t help you with that, it’s mostly trial-and-error.