Nvidia Tesla V100 goes to 100% utilization and get stucked without any progress

c.baccheschi · October 28, 2024, 12:06pm

I’m connected through SSH to a Linux machine which has two NVIDA-T100 32GB. I’m doing a model selection on a deep learning model using tensorflow in a virtual environment.

I’m not alone to use the system but the same problem still happened when both GPU were totally free so 64GB out.

I’m using also tf techniques to clear the session (and the GPU memory) after each iteration. Despite that, GPU starts from GPU-util 5% and goes up and down between 6-8% during the time and then at the 10/20th iteration it suddendly goes up to 100% and get stuck without any chances to progress.

Typing the nvidia-smi I get this (my processes are those named “python”, other are of another user)

    +-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100S-PCIE-32GB          Off |   00000000:AF:00.0 Off |                    0 |
| N/A   38C    P0             38W /  250W |   28282MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100S-PCIE-32GB          Off |   00000000:D8:00.0 Off |                    0 |
| N/A   51C    P0             73W /  250W |   26328MiB /  32768MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1291782      C   ...8/Desktop/MDThesis/.venv/bin/python      27972MiB |
|    0   N/A  N/A   1302409      C   python                                        306MiB |
|    1   N/A  N/A   1291782      C   ...8/Desktop/MDThesis/.venv/bin/python      21868MiB |
|    1   N/A  N/A   1302409      C   python                                       4456MiB |

The Memory-usage keeps constant itself, the issue is the GPU-util.

The CPU usage is at 110%, I’m thinking also to a CPU-GPU bottleneck (may be?)

Consider also that when gpu goes to 100%, the process on CPU turn from Run(R) to sleep (S) mode permanently.

I’ve already checked on Google several similar situations, such as:

Volatile gpu-util is 100% and no progress on PyTorch1.1 after infering lots of images - vision - PyTorch Forums <— PyTorch

Volatile gpu-util is 100 and no progress - PyTorch Forums <— PyTorch

https://www.tencentcloud.com/document/product/560/18151 <— Similar issue but in that case no processes are loaded on GPU…here they suggest the problem could be the ECC Memory Scrubbing mechanism and therefore to set the persistence mode…

1080Ti stuck at idle clock frequency even at 100% GPU utilization —> This seems to be the most similar to my case

Anyone has some ideas to share with me? It would be appreciated. Thanks.

Topic		Replies	Views
nvidia-smi Volatile GPU-Util 100%, always, reboot operating system can not fix CUDA Setup and Installation	6	11323	November 30, 2020
GPU getting stuck, not able to execute any command using GPU Linux	1	2065	January 17, 2018
Tesla V100 GPU thermal causing shutdown even it's doing nothing Linux boot , kernel , ubuntu	10	1569	December 17, 2020
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	217	December 19, 2024
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	26724	March 19, 2015
Systeme crash after "nvidia-smi" command. Rhel7.6/A100 GPU Linux	14	2566	January 31, 2022
Intermittent "No devices were found" on CentOS 7 CUDA Setup and Installation	9	2510	December 7, 2021
The GPU process takes the GPU but there is not memory usage and the process hangs. Forum Feedback	2	1173	October 8, 2021
No running processes found by NVIDIA Tesla P100, what could be the cause? CUDA Programming and Performance	8	4304	May 3, 2019
Nvidia-smi really slow to execute Linux ubuntu	4	12503	August 11, 2024

Nvidia Tesla V100 goes to 100% utilization and get stucked without any progress

Related topics