GPU crashes when running tensorflow-gpu and clock speed goes to idle at 0 MHz

r8drascal · November 13, 2018, 3:10pm

I am trying to run tensorflow-gpu using Anaconda. I have a GeForce GTX 960M card, which has no problem at all running games. What I’ve noticed is that the tf-gpu runs fine for the very first run. But as soon as tensorflow stop running, the GPU naturally wants to idle from 1097 MHz to 0 MHz, which causes the GPU to crash. I can see that the “GPU is lost” on NVSMI. I have to then disable and re-enable my GPU in the Device Manager to get it to work.

I’ve done some testing with various codes while simultaneously monitoring my GPU usage using MSI Afterburner, GPU-Z, nvidia-smi and Task Manager. The only thing I see is that if the GPU goes to idle with tensorflow still holding memory, the card crashes.

One workaround to temporarily prevent this from happening for very small programs is by using the “allow_growth” feature as follows:

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

However, this only works if the operation is really small such that it uses only about 0.1 GB of GPU memory. In this case, the GPU memory gets cleared to zero pretty quickly and only after that does the GPU go to idle. However, if the program is using memory of even 0.3 GB of memory my GPU crashes since the memory does not clear to 0 GB before the clock speed drops to 0 MHz (lower power state).

r8drascal · November 29, 2018, 3:30pm

I was finally able to figure out the issue thanks to someone from another forum. It was a driver issue. The latest drivers provided by Nvidia are causing the issue unlike the old drivers provided by my laptop manufacturer.

Since I was not able to run tensorflow with my old drivers and do more troubleshooting, what I did was download eDrawings Viewer and open up some random assembly drawings I found online. First I tried with the latest Nvidia drivers, and I see that when I manipulate the models, my card is at P0 state but if I don’t do anything and let the software idle, my card goes to a lower power state and crashes my GPU. But when I did the same exercise with my ASUS manufacturer-certified drivers (since this software was compatible even with the older drivers unlike TF), my GPU did NOT crash.

What I also discovered was that eDrawings Viewer does not crash even with the latest Nvidia drivers if I go into the Nvidia Control Panel and select “Prefer Maximum Performance” under Power Management Mode. The card stays at P0 state whenever I have the software open even after idling for minutes. Unfortunately, since python.exe does not have a graphical interface, this option does not work for my case. As a workaround, I can still run tensorflow without getting it to crash by running eDrawings Viewer in the background (or really any program that uses a graphical interface), which keeps my card at the P0 State.

Topic		Replies	Views
Tensorflow-GPU installation cuDNN	1	650	May 17, 2021
TensorFlow crash after PC suspend/wake up Linux	0	678	April 10, 2017
The GPU is always lost when I run TensorFlow program Linux	0	581	September 20, 2018
GPU occasionally gets lost when running Tensorflow. Linux	1	3369	January 21, 2019
Did TensorFlow caused GPU memory crash? CUDA Setup and Installation	5	5110	April 26, 2017
The GPU is always lost when I run TensorFlow program General Topics & Other SDKs	1	963	September 20, 2018
Hard shutdown problem on ubuntu16.04 with 2080Ti Frameworks (archived) tensorflow	3	646	September 23, 2019
Restarts when running tensorflow CUDA Setup and Installation	6	868	March 22, 2018
GPU Lost when using Tensorflow Training Linux	0	472	March 6, 2019
Tensorflow freezes during training (Linux OS) CUDA Programming and Performance	1	1412	April 11, 2018

GPU crashes when running tensorflow-gpu and clock speed goes to idle at 0 MHz

Related topics