Unable to determine the device handle for GPU0000:65:00.0: Unknown Error

Everytime I training some deep learning model, my GPU seem to shut down. Then I get

$ nvidia-smi
Unable to determine the device handle for GPU0000:65:00.0: Unknown Error.                     

After I restart the machine, my GPU restore working. I try to reinstall the latest drvier, but the problem still occurs.

I checked similar questions and my nvidia bug report then found that I encountered the problem NVRM: Xid (PCI:0000:65:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
But I did not figure out how to solve it.

Can someone give any suggestions? Thanks a lot!
nvidia-bug-report.log (3.0 MB)

You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU.
To check for power issues, you can use nvidia-smi -lgc to prevent boost situations, e.g.
nvidia-smi -lgc 300,1500