Provide solution for "GPU MEM used by PID but no GPU LOAD"

Hi all,

as a service provider, I am managing some DGX-1 and DGX-2 machines for a customer in a multi-.user-shared DL/ML environment on Linux. We use nvidia-smi regularly. In some cases, the tool does not provide the information that you would require as an engineer to resolve issues.

There are situations where the user’s code may lead to GPU memory leaking. This will basically cause the GPU to report 0% GPU LOAD but it causes the RAM usage to be at 100% at some point in time.
Please notice, any process not related to any GPU LOAD will not show up the RAM usage per PID in nvidia-smi imho.

E.g. card three has almost 6GB in RAM used but no LOAD:

+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   37C    P0    57W / 300W |   5930MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |

so the actual used memory by such processes cannot be identified as it is not listed:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A   3843136      C   ...custom-pytorch/bin/python    10215MiB |
|    2   N/A  N/A   1998533      C   ...custom-pytorch/bin/python     7777MiB |
|    2   N/A  N/A   3839181      C   ...custom-pytorch/bin/python     3185MiB |
|    5   N/A  N/A   3744013      C   ...envs/flower39/bin/python3     1511MiB |
|    6   N/A  N/A   3744132      C   ...envs/flower39/bin/python3     1581MiB |
+-----------------------------------------------------------------------------+

The only option you may have to find the related PIDs which still occupy the RAM, you can use fuser which will identify the PID still “attached” to the GPU:

sudo fuser -uv /dev/nvidia3

which outputs things like:

                     userA         1039387 F...m (q525449)python3
                     userB         1039451 F...m (q525449)python3
                     userC         1039515 F...m (q525449)python3

Unfortunately, this does not provide the “GPU MEM used per PID” in any size.

Usually, to fix such an issue, you would probably be required to kill the identified processes owned by customers/users.

But how can I identify which is the most valuable process (eating most RAM) to kill in order to regain free memory?

I would assume, the nvidia-smi tool is “calculating” the amount itsself. as the man page statess:

“memory.used”
Total memory allocated by active contexts.

Please let me know how I can access the GPU mem used per PID even if no GPU load is being reported a.k.a. “no active context”. This feature would allow me to quickly identify memory hogging processes by impact (RAM occupied) and actually identify the owner of that process.

Please note, all our affected processes were NOT identified as zombies and could easily be killed. So it remains weird to me that the OS will identify processes using the GPU (fuser) but nvidia-smi is not giving you information on such processes as their are not cinsidered to be in an active context.

I could not find any solution as many OS tools only report on system RAM, not GPU RAM. Please advise or reach out to me at any time. I am open to any feasable solution.

Best regards,
Ron

Hi @ron.koss ,

What if you enable accounting in DCGM? Feature Overview — NVIDIA DCGM Documentation latest documentation

In theory that’d give you some additional per-PID memory usage information, which you could then use to kill jobs which seem like they are not releasing GPU memory.

ScottE

1 Like

Dear ScottE,

I was not aware of that feature until now and I hope I can test your solution asap. If it does the trick, I will mark your answer as solution.

Much appreciated, thank you.
Ron