GPU crashes when running tensorflow-gpu and clock speed goes to idle at 0 MHz

I am trying to run tensorflow-gpu using Anaconda. I have a GeForce GTX 960M card, which has no problem at all running games. What I’ve noticed is that the tf-gpu runs fine for the very first run. But as soon as tensorflow stop running, the GPU naturally wants to idle from 1097 MHz to 0 MHz, which causes the GPU to crash. I can see that the “GPU is lost” on NVSMI. I have to then disable and re-enable my GPU in the Device Manager to get it to work.

I’ve done some testing with various codes while simultaneously monitoring my GPU usage using MSI Afterburner, GPU-Z, nvidia-smi and Task Manager. The only thing I see is that if the GPU goes to idle with tensorflow still holding memory, the card crashes.

One workaround to temporarily prevent this from happening for very small programs is by using the “allow_growth” feature as follows:

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

However, this only works if the operation is really small such that it uses only about 0.1 GB of GPU memory. In this case, the GPU memory gets cleared to zero pretty quickly and only after that does the GPU go to idle. However, if the program is using memory of even 0.3 GB of memory my GPU crashes since the memory does not clear to 0 GB before the clock speed drops to 0 MHz (lower power state).

I was finally able to figure out the issue thanks to someone from another forum. It was a driver issue. The latest drivers provided by Nvidia are causing the issue unlike the old drivers provided by my laptop manufacturer.

Since I was not able to run tensorflow with my old drivers and do more troubleshooting, what I did was download eDrawings Viewer and open up some random assembly drawings I found online. First I tried with the latest Nvidia drivers, and I see that when I manipulate the models, my card is at P0 state but if I don’t do anything and let the software idle, my card goes to a lower power state and crashes my GPU. But when I did the same exercise with my ASUS manufacturer-certified drivers (since this software was compatible even with the older drivers unlike TF), my GPU did NOT crash.

What I also discovered was that eDrawings Viewer does not crash even with the latest Nvidia drivers if I go into the Nvidia Control Panel and select “Prefer Maximum Performance” under Power Management Mode. The card stays at P0 state whenever I have the software open even after idling for minutes. Unfortunately, since python.exe does not have a graphical interface, this option does not work for my case. As a workaround, I can still run tensorflow without getting it to crash by running eDrawings Viewer in the background (or really any program that uses a graphical interface), which keeps my card at the P0 State.