Memory-Usage on the GPU twice as the GPU Memory Usage of the process

I’m sorting a std::vector with std::sort and par_unseq policy using nvc++ -stdpar.

nvidia-smi reports the “GPU Memory Usage” of the process roughly equal to the size of the vector,
but it also reports the “Memory-Usage” on the GPU to be twice of that.
What is the difference between these memory usages?

The unified memory profiling result shows the expected “Host To Device” size.
When the vector is bigger than the half of the GPU memory, nvprof reports both high “Host To Device” and “Device To Host” metrics, and the Memory-Usage on the GPU is almost 100%.
The SDK version is 21.1.

Sort uses a radix sort which does need temporary GPU memory that’s the same size as the original data.

I figured that out from the nvprof output. Is the implementation of the algorithms documented anywhere to check the memory requirements beforehand?

Also the other question still stands: why doesn’t the “GPU Memory Usage” of the process include that additional memory?

PS I received another reply in the mail where you requested a reproducing example, so here it is. Just change the size to better fit the GPU.

Is the implementation of the algorithms documented anywhere to check the memory requirements beforehand?

Our C++17 stdpar implementation is built on top of Thrust, but I don’t see anything in their documentation regarding memory usage.

Also the other question still stands: why doesn’t the “GPU Memory Usage” of the process include that additional memory?

Unified memory usage isn’t included in what is shown per process via nvidia-smi.

Do I understand correctly, the “GPU Memory Usage” of the process is the temporary memory, used by the radix sort, which happens to be equal to the unified memory size in this case?

That’s our understanding, though if you need a more definitive answer, we’d need to double check with the Thrust folks.

Thank you for the clarification.
I suppose this topic can be closed now.