Tensorflow freezes during training (Linux OS)

Hi,
We use Tensorflow for training CNNs. This works without any issues most of the time. But after a while sometimes the complete OS freezes.

After a restart, tensorflow looses the ability to use the GPU (but not always).

We guess the problem is Linux, Tensorflow or CUDA.

We already used different images, batch-sizes etc. The code isn’t the problem too.

This freezing isn’t the main problem. The main problem is, that we have to reinstall the complete system, to use the GPU again.

Our configuration is:
Linux Ubuntu 16.04
AMD Ryzen 7 1800X Eight-Core 3.6 GHz
32 GB RAM
Gefore GTX 1080Zi
Latest CUDA

We cross-posted this also at stackoverflow
https://stackoverflow.com/questions/49752930/tensorflow-freezes-during-training-linux-os

We hope you can help us and thank you in advance!

Greetings

Run some other GPU stress tests to see if such loads also trigger system freezes.

What PSU does the machine have? Is the mainboard BIOS fully up to date?

Does a cold power cycle (mainboard off power for some minutes) fix the GPU recognition problem?

Christian