I am running some Tensorflow neural nets on a Titan X GPU. I have had no issues for months but a couple of days ago the driver has started crashing when running CUDA code on it. This happens consistently but at different times through the process. The display becomes unresponsive. I can ssh into the machine and see a message from nvidia-smi along the lines of “…GPU is lost…”
I have gone through NVIDIA support, removed the Nouveau drivers and installed different NVIDIA drivers to see if I can get this working but the problem persists. I suspect it may be down to a change in kernel as I did an update of Ubuntu. Has anyone had similar issues or have any tips?