Ubuntu 16.04 CUDA8 crashing graphics driver

I am running some Tensorflow neural nets on a Titan X GPU. I have had no issues for months but a couple of days ago the driver has started crashing when running CUDA code on it. This happens consistently but at different times through the process. The display becomes unresponsive. I can ssh into the machine and see a message from nvidia-smi along the lines of “…GPU is lost…”

I have gone through NVIDIA support, removed the Nouveau drivers and installed different NVIDIA drivers to see if I can get this working but the problem persists. I suspect it may be down to a change in kernel as I did an update of Ubuntu. Has anyone had similar issues or have any tips?

Is the GPU or system overheating? Please also make sure the GPU is firmly seated in the PCIe slot and that the power cables are also firmly connected.

I don’t think it is I have nvidia-smi running to monitor the card and it never reports a temperature over 85C. I would guess an average of 80C looking at it with no obvious spikes before the crash.

I have checked all cables to make sure power and PCIe slot are firmly connected. I am running another run now to check. The last one ran for 2 hours before crashing.

The reason I ask is that if the GPU just suddenly stops responding like that – especially when it’s an isolated case and it used to work – it’s almost always faulty hardware somewhere.

The card has ran my CUDA computation for 10 hours. I am going to chalk this one up to a loose power cable.