When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running “nvidia-smi”:
“Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU”
I tried monitoring the GPU performance for 13-hours execution, and everything seems stable - see graph attached (or here https://unsee.cc/pusogiba/)
Also, this behavior repeats on another GPU on the same machine.
I’m working with:
- Ubuntu 14.04.5 LTS
- GPUs are Titan Xp
- CUDA 8.0
- CuDNN 5.1
- Tensorflow 1.3
- Theano 0.8.2
I’m not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?