GPU memory allocated but GPU usage 0%

GPUs : 2 Nos - Tesla K80
CUDA: 10.0
Tensorflow-gpu - 1.15
keras: 2.2.4

During Training of model, I am seeing one GPU memory allocated but GPU Utilisation most of the time shows 0%, only few seconds it shows 100%, then again shows 0%. It is reflecting in epoch duration that it takes long time to training, that means one GPU is not utilised only memory allocated. Whereas other GPU is always 100% utilisation.

Please refer below screenshot. ProcessId is also shows for both GPU and memory of both GPU are allocated


What is the issue here? Why GPU memory used but GPU Util ?

It depends on your DL/ML library and what’s your model architecture. As far as I can tell, the training processes are computed on both CPU and GPU. If there is stuff that needs to be done on CPU, then your GPU is idle, and so on. (some algorithms can’t run parallelly on GPU, so CPU is preferable).
With a few hundred milliseconds idle, you are able to see 0% utilization. So there’s no issue except your GPU0 is at 91°C, that’s not safe.

While an operating temperature of 91°C is far from ideal, it should still be safe. nvidia-smi -q shows Max Operating Temp, Slowdown Temp, and Shutdown Temp for a GPU. The last of these is the temperature at which a GPU will shut down to prevent permanent damage. For the K80 specifically, Shutdown Temp appears to be 93°C.

However, the Slowdown Temp for K80 seems to be 88°C, so by operating the GPU above this temperature one throws away performance. It would be highly advisable to investigate how airflow across this K80 could be improved.