Nvidia Tesla V100 goes to 100% utilization and get stucked without any progress

I’m connected through SSH to a Linux machine which has two NVIDA-T100 32GB. I’m doing a model selection on a deep learning model using tensorflow in a virtual environment.

I’m not alone to use the system but the same problem still happened when both GPU were totally free so 64GB out.

I’m using also tf techniques to clear the session (and the GPU memory) after each iteration. Despite that, GPU starts from GPU-util 5% and goes up and down between 6-8% during the time and then at the 10/20th iteration it suddendly goes up to 100% and get stuck without any chances to progress.

Typing the nvidia-smi I get this (my processes are those named “python”, other are of another user)

    +-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100S-PCIE-32GB          Off |   00000000:AF:00.0 Off |                    0 |
| N/A   38C    P0             38W /  250W |   28282MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100S-PCIE-32GB          Off |   00000000:D8:00.0 Off |                    0 |
| N/A   51C    P0             73W /  250W |   26328MiB /  32768MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1291782      C   ...8/Desktop/MDThesis/.venv/bin/python      27972MiB |
|    0   N/A  N/A   1302409      C   python                                        306MiB |
|    1   N/A  N/A   1291782      C   ...8/Desktop/MDThesis/.venv/bin/python      21868MiB |
|    1   N/A  N/A   1302409      C   python                                       4456MiB |

The Memory-usage keeps constant itself, the issue is the GPU-util.

The CPU usage is at 110%, I’m thinking also to a CPU-GPU bottleneck (may be?)

Consider also that when gpu goes to 100%, the process on CPU turn from Run(R) to sleep (S) mode permanently.

I’ve already checked on Google several similar situations, such as:

Volatile gpu-util is 100% and no progress on PyTorch1.1 after infering lots of images - vision - PyTorch Forums <— PyTorch

Volatile gpu-util is 100 and no progress - PyTorch Forums <— PyTorch

https://www.tencentcloud.com/document/product/560/18151 <— Similar issue but in that case no processes are loaded on GPU…here they suggest the problem could be the ECC Memory Scrubbing mechanism and therefore to set the persistence mode…

1080Ti stuck at idle clock frequency even at 100% GPU utilization —> This seems to be the most similar to my case

Anyone has some ideas to share with me? It would be appreciated. Thanks.