When training a network with pytorch, sometimes after a random amount of time (a few minutes), the execution freezes and I get this message by running “nvidia-smi”:
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU
I’m working with:
CentOS Linux release 7.5.1804 (Core)
RTX 2080Ti
CUDA Version 10.0.130
Pytorch 1.0.1.post2
nvidia-bug-report.log.gz (1.44 MB)