GPU is lost during execution of pytorch code

When training a network with pytorch, sometimes after a random amount of time (a few minutes), the execution freezes and I get this message by running “nvidia-smi”:
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

I’m working with:
CentOS Linux release 7.5.1804 (Core)
RTX 2080Ti
CUDA Version 10.0.130
Pytorch 1.0.1.post2

nvidia-bug-report.log.gz (1.44 MB)

the log file

It’s sometimes an indication of either a temperature or power issue with the GPU. The GPU temperature may be getting to high or the power delivery to the GPU is inadequate.

These are just possibilities, of course, and such issues usually can’t be conclusively diagnosed in a forum like this. It usually requires trial/experimentation. For example, monitor temperature, try a larger power supply, etc.