I’m connected through SSH to a Linux machine which has two NVIDA-T100 32GB. I’m doing a model selection on a deep learning model using tensorflow in a virtual environment.
I’m not alone to use the system but the same problem still happened when both GPU were totally free so 64GB out.
I’m using also tf techniques to clear the session (and the GPU memory) after each iteration. Despite that, GPU starts from GPU-util 5% and goes up and down between 6-8% during the time and then at the 10/20th iteration it suddendly goes up to 100% and get stuck without any chances to progress.
Typing the nvidia-smi
I get this (my processes are those named “python”, other are of another user)
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100S-PCIE-32GB Off | 00000000:AF:00.0 Off | 0 |
| N/A 38C P0 38W / 250W | 28282MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100S-PCIE-32GB Off | 00000000:D8:00.0 Off | 0 |
| N/A 51C P0 73W / 250W | 26328MiB / 32768MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1291782 C ...8/Desktop/MDThesis/.venv/bin/python 27972MiB |
| 0 N/A N/A 1302409 C python 306MiB |
| 1 N/A N/A 1291782 C ...8/Desktop/MDThesis/.venv/bin/python 21868MiB |
| 1 N/A N/A 1302409 C python 4456MiB |
The Memory-usage
keeps constant itself, the issue is the GPU-util
.
The CPU usage is at 110%, I’m thinking also to a CPU-GPU bottleneck (may be?)
Consider also that when gpu goes to 100%, the process on CPU turn from Run(R) to sleep (S) mode permanently.
I’ve already checked on Google several similar situations, such as:
Volatile gpu-util is 100 and no progress - PyTorch Forums <— PyTorch
https://www.tencentcloud.com/document/product/560/18151 <— Similar issue but in that case no processes are loaded on GPU…here they suggest the problem could be the ECC Memory Scrubbing mechanism and therefore to set the persistence mode…
1080Ti stuck at idle clock frequency even at 100% GPU utilization —> This seems to be the most similar to my case
Anyone has some ideas to share with me? It would be appreciated. Thanks.