Provide solution for "GPU MEM used by PID but no GPU LOAD"

ron.koss · December 22, 2022, 11:46am

Hi all,

as a service provider, I am managing some DGX-1 and DGX-2 machines for a customer in a multi-.user-shared DL/ML environment on Linux. We use nvidia-smi regularly. In some cases, the tool does not provide the information that you would require as an engineer to resolve issues.

There are situations where the user’s code may lead to GPU memory leaking. This will basically cause the GPU to report 0% GPU LOAD but it causes the RAM usage to be at 100% at some point in time.
Please notice, any process not related to any GPU LOAD will not show up the RAM usage per PID in nvidia-smi imho.

E.g. card three has almost 6GB in RAM used but no LOAD:

+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   37C    P0    57W / 300W |   5930MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |

so the actual used memory by such processes cannot be identified as it is not listed:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A   3843136      C   ...custom-pytorch/bin/python    10215MiB |
|    2   N/A  N/A   1998533      C   ...custom-pytorch/bin/python     7777MiB |
|    2   N/A  N/A   3839181      C   ...custom-pytorch/bin/python     3185MiB |
|    5   N/A  N/A   3744013      C   ...envs/flower39/bin/python3     1511MiB |
|    6   N/A  N/A   3744132      C   ...envs/flower39/bin/python3     1581MiB |
+-----------------------------------------------------------------------------+

The only option you may have to find the related PIDs which still occupy the RAM, you can use fuser which will identify the PID still “attached” to the GPU:

sudo fuser -uv /dev/nvidia3

which outputs things like:

                     userA         1039387 F...m (q525449)python3
                     userB         1039451 F...m (q525449)python3
                     userC         1039515 F...m (q525449)python3

Unfortunately, this does not provide the “GPU MEM used per PID” in any size.

Usually, to fix such an issue, you would probably be required to kill the identified processes owned by customers/users.

But how can I identify which is the most valuable process (eating most RAM) to kill in order to regain free memory?

I would assume, the nvidia-smi tool is “calculating” the amount itsself. as the man page statess:

“memory.used”
Total memory allocated by active contexts.

Please let me know how I can access the GPU mem used per PID even if no GPU load is being reported a.k.a. “no active context”. This feature would allow me to quickly identify memory hogging processes by impact (RAM occupied) and actually identify the owner of that process.

Please note, all our affected processes were NOT identified as zombies and could easily be killed. So it remains weird to me that the OS will identify processes using the GPU (fuser) but nvidia-smi is not giving you information on such processes as their are not cinsidered to be in an active context.

I could not find any solution as many OS tools only report on system RAM, not GPU RAM. Please advise or reach out to me at any time. I am open to any feasable solution.

Best regards,
Ron

ScottEllis · December 27, 2022, 10:19pm

Hi @ron.koss ,

What if you enable accounting in DCGM? Feature Overview — NVIDIA DCGM Documentation latest documentation

In theory that’d give you some additional per-PID memory usage information, which you could then use to kill jobs which seem like they are not releasing GPU memory.

ScottE

ron.koss · January 10, 2023, 10:52am

Dear ScottE,

I was not aware of that feature until now and I hope I can test your solution asap. If it does the trick, I will mark your answer as solution.

Much appreciated, thank you.
Ron

Topic		Replies	Views
11 GB of GPU RAM used, and no process listed by nvidia-smi CUDA Programming and Performance	17	146446	September 22, 2023
per-process resource accounting CUDA Programming and Performance	2	2750	December 22, 2022
GPU utilization DGX User Forum	8	6746	August 21, 2019
How to kill unknown process that eating up the GPU memory? CUDA Programming and Performance cuda , kernel	2	6443	February 1, 2023
10 GB of GPU RAM used, and no process listed by nvidia-smi CUDA Setup and Installation cuda , nvbugs , pytorch	1	3209	June 15, 2023
How to see which process is loading the GPU? Linux	2	1150	March 30, 2023
DCGM reporting Max GPU Memory Used is 0 . Linux	1	720	January 30, 2020
No Process in GPU but GPU memory-usage is full; CUDA Setup and Installation	1	5083	March 28, 2021
DCGM not reporting Max Memory Used correctly. Other Tools	1	621	January 30, 2020
Nvidia-smi shows 0MB GPU memory utilization for docker processes CUDA Programming and Performance nvidia-smi	1	140	December 26, 2024

Provide solution for "GPU MEM used by PID but no GPU LOAD"

Related topics