10 GB of GPU RAM used, and no process listed by nvidia-smi

user92647 · April 21, 2022, 1:27am

Hello.
Currently, the phenomenon shown in the figure below has occurred.

The timing of occurrence is when the reference is finished using pytorch, but the GPU is still being held. So when you turn the reference again, you get “cuda out of memory.”

We looked for similar phenomena as above, and the most similar results are shown on the page below.
(11 GB of GPU RAM used, and no process listed by nvidia-smi)

The way I solved this problem is as follows.

$ sudo nvidia-smi --gpu-reset -i 0
→
GPU 00000000:02:02.0 is currently in use by another process.
1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.

for i in $(sudo lsof /dev/nvidia* | grep python | awk ‘{print $2}’ | sort -u); do kill -9 $i; done
→ Nothing changes

sudo fuser -v /dev/nvidia*
→ Nothing changes

Modify Pytorch Code

torch.Apply no_grad()
torch.cuda_empty_cache()

I tried the above four methods, but there was no process, but I couldn’t fix the GPU occupancy.

The last method is to reboot, but the same phenomenon occurred continuously when running the reference again upon reboot.

How can we solve this problem?

Thank you.

daniel312 · June 15, 2023, 5:33pm

have the same issue,

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:07:00.0 Off |                   On |
| N/A   29C    P0    53W / 400W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |            Disabled* |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:0F:00.0 Off |                    0 |
| N/A   29C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:47:00.0 Off |                    0 |
| N/A   29C    P0    61W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000000:4E:00.0 Off |                    0 |
| N/A   30C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  Off  | 00000000:87:00.0 Off |                    0 |
| N/A   34C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  Off  | 00000000:90:00.0 Off |                    0 |
| N/A   33C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  Off  | 00000000:B7:00.0 Off |                    0 |
| N/A   32C    P0    63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  Off  | 00000000:BD:00.0 Off |                   On |
| N/A   31C    P0    55W / 400W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |            Disabled* |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7    7   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7    8   0   1  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7    9   0   2  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7   10   0   3  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7   11   0   4  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7   12   0   5  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7   13   0   6  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

killing persistencd, fabricmanager, and dcgm, as well as flushing out docker containers had all worked once, but not again after a restart, so this issue is back.

Topic		Replies	Views
11 GB of GPU RAM used, and no process listed by nvidia-smi CUDA Programming and Performance	17	142229	September 22, 2023
GPU Memory Usage shows "N/A" CUDA Setup and Installation	15	33240	May 22, 2024
How to kill unknown process that eating up the GPU memory? CUDA Programming and Performance cuda , kernel	2	5604	February 1, 2023
CUDA unavailable in RedHat without other GPU issues CUDA Programming and Performance	4	543	May 26, 2021
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3129	June 8, 2018
how to effectively free large memory allocation CUDA Programming and Performance	8	7556	November 5, 2015
A100 GPUs visible on nvidia-smi not visible for Pytorch or on cuda-samples Linux cuda	5	4527	October 12, 2021
GPU memory cannot be released Deep Learning (Training & Inference)	0	1322	October 26, 2018
`nvidia-smi` Performance degredation CUDA Programming and Performance	5	243	August 14, 2024
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	326	September 11, 2024

10 GB of GPU RAM used, and no process listed by nvidia-smi

Related topics