Unified Memory: nvidia-smi "Memory Usage" interpretation

I haven’t found any documentation on how to interpret what nvidia-smi reports regarding Memory Usage for processes that use Unified Memory (cudaMallocManaged()).
I understand that the interpretation for the Normal Allocation case (cudaMalloc()) is well-defined and documented.

I will attempt to analyze what my understanding regarding the Unified Memory case is.
Take this screenshot as an example:

Let the top-middle value, which in the format {Used} / Total MiB, be called Overall Memory Usage.
Let the bottom-right value, which is in the format {Used} MiB, be called GPU Memory Usage.

My understanding is that Overall Memory Usage (the top value) represents the Total GPU Memory that is occupied by data from any process (CUDA context) and there is no way to know what amount of data is owned by which process (context).

GPU Memory Usage (the bottom value, that is per-process) represents the size of the CUDA context (in the range of a few 100s of MiBs). The size of the Unified Memory allocations made by this process is not accounted for.

So, my interpretation of the screenshot is as follows:

From the output of nvidia-smi , we can tell that the process with PID 13987 has ~5003 MiB (5900 - 897) (we disregard the internally reserved memory size, which is in the range of ~10s of MiB) of data resident on the GPU (either fetched manually (prefetched) or via Page Faults). We can’t tell what the total amount of Unified Memory allocated by the process is.


I am also interested in the case where more than 1 applications are executing, as in the screenshot below:

In this case, my interpretation is as follows:

We can tell that the two processes with PIDs 14229, 21047 have a total of sum of ~9867 MiB (11661 - 2*897) of data resident on the GPU (either fetched manually (prefetched) or via Page Faults). We can’t tell how much data belongs to each process. Additionally, we cannot tell what the total amount of Unified Memory allocated by any of the processes is.


If possible, I would like an authoritative answer on this matter from NVIDIA.

It seems other people are having similar questions.
(Nvidia-smi shows high global memory usage, but low in the only process)

In their case, it simply leads to confusion, so I definitely think someone from Nvidia ought to provide an answer on this.

There is a large amount of applications that make use of Unified Memory, and not being able to have a certain interpretation of the utilization results in nvidia-smi is detrimental.

I think this is rooted in the fact that the driver does on-demand paging into the GPU depending on whether the page is currently being accessed by the CPU or the GPU. I believe the global report on memory per device accurately shows how much physical memory on the device is currently taken.

However the per process summary does not reflect how many of the managed memory pages of this application currently reside on the GPU. I think (based on my observations) that the per process statistic may not include the managed memory at all.

Yes, that is in line with my observations above.

So, we have to wait for an authoritative answer from NVIDIA then.