Unable to determine the device handle for GPU

i run my torch model about every 20min, then it give the error, i dont know why something wrong!
nvidia-bug-report.log (821.2 KB)

Every 1.0s: nvidia-smi zfx: Sat Oct 15 23:59:46 2022
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

Hello @justforted, welcome to the NVIDIA developer forums.

Thank you for sharing the log file.

Could you please share also the complete setup of yours? I can see an RTX 3080Ti, but which CPU, what platform? Desktop or Server? And is it correct, that you are running Ubuntu 20.04 with an RTX 3080 Ti?

The errors I found in the log suggest that there might be something physically wrong with the GPU.

Oct 9 14:08:58 zfx kernel: [ 8683.941534] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
Oct 9 14:08:58 zfx kernel: [ 8683.941536] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

Did you monitor the temperatures of the GPU while you are running your torch models? Could it be the GPU is running too hot?

yes, ubuntu 20.04+driver= x86_64-515.76.run+cuda11.6+pytorch12, i didn’t monitor temperatures, but if i just reboot the machine,the gpu fails with the error above after i run 3 epoches.
————————————————
after i monitor the temperature, the highest temperature is j’ust 63C, i don’t know why it occurs such error!

nvidia-bug-report.log (1.1 MB)