Everytime I training some deep learning model, my GPU seem to shut down. Then I get
$ nvidia-smi
Unable to determine the device handle for GPU0000:65:00.0: Unknown Error.
After I restart the machine, my GPU restore working. I try to reinstall the latest drvier, but the problem still occurs.
I checked similar questions and my nvidia bug report then found that I encountered the problem NVRM: Xid (PCI:0000:65:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
But I did not figure out how to solve it.
Can someone give any suggestions? Thanks a lot!
nvidia-bug-report.log (3.0 MB)