Hi all,
as a service provider, I am managing some DGX-1 and DGX-2 machines for a customer in a multi-.user-shared DL/ML environment on Linux. We use nvidia-smi regularly. In some cases, the tool does not provide the information that you would require as an engineer to resolve issues.
There are situations where the user’s code may lead to GPU memory leaking. This will basically cause the GPU to report 0% GPU LOAD but it causes the RAM usage to be at 100% at some point in time.
Please notice, any process not related to any GPU LOAD will not show up the RAM usage per PID in nvidia-smi imho.
E.g. card three has almost 6GB in RAM used but no LOAD:
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 37C P0 57W / 300W | 5930MiB / 16160MiB | 0% Default |
| | | N/A |
so the actual used memory by such processes cannot be identified as it is not listed:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 3843136 C ...custom-pytorch/bin/python 10215MiB |
| 2 N/A N/A 1998533 C ...custom-pytorch/bin/python 7777MiB |
| 2 N/A N/A 3839181 C ...custom-pytorch/bin/python 3185MiB |
| 5 N/A N/A 3744013 C ...envs/flower39/bin/python3 1511MiB |
| 6 N/A N/A 3744132 C ...envs/flower39/bin/python3 1581MiB |
+-----------------------------------------------------------------------------+
The only option you may have to find the related PIDs which still occupy the RAM, you can use fuser which will identify the PID still “attached” to the GPU:
sudo fuser -uv /dev/nvidia3
which outputs things like:
userA 1039387 F...m (q525449)python3
userB 1039451 F...m (q525449)python3
userC 1039515 F...m (q525449)python3
Unfortunately, this does not provide the “GPU MEM used per PID” in any size.
Usually, to fix such an issue, you would probably be required to kill the identified processes owned by customers/users.
But how can I identify which is the most valuable process (eating most RAM) to kill in order to regain free memory?
I would assume, the nvidia-smi tool is “calculating” the amount itsself. as the man page statess:
“memory.used”
Total memory allocated by active contexts.
Please let me know how I can access the GPU mem used per PID even if no GPU load is being reported a.k.a. “no active context”. This feature would allow me to quickly identify memory hogging processes by impact (RAM occupied) and actually identify the owner of that process.
Please note, all our affected processes were NOT identified as zombies and could easily be killed. So it remains weird to me that the OS will identify processes using the GPU (fuser) but nvidia-smi is not giving you information on such processes as their are not cinsidered to be in an active context.
I could not find any solution as many OS tools only report on system RAM, not GPU RAM. Please advise or reach out to me at any time. I am open to any feasable solution.
Best regards,
Ron