I’m sorting a std::vector with std::sort and par_unseq policy using nvc++ -stdpar.
nvidia-smi reports the “GPU Memory Usage” of the process roughly equal to the size of the vector,
but it also reports the “Memory-Usage” on the GPU to be twice of that.
What is the difference between these memory usages?
The unified memory profiling result shows the expected “Host To Device” size.
When the vector is bigger than the half of the GPU memory, nvprof reports both high “Host To Device” and “Device To Host” metrics, and the Memory-Usage on the GPU is almost 100%.
The SDK version is 21.1.
Do I understand correctly, the “GPU Memory Usage” of the process is the temporary memory, used by the radix sort, which happens to be equal to the unified memory size in this case?