Mem usage not match for device&process after calling cudaMallocManaged+cudaMemPrefetch

I found that after prefetching a block of memory allocated by cudaMallocManaged with cudaMemPrefetch, memory usage of device will increase but memory usage of process just keep the same. And I can reproduce it with the test code below, Is it designed to be like this?

test code
test.cu (5.6 KB)

system info:
Debian GNU/Linux 9
driver: 450.80.02
cuda: 11.0
nvcc:
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
gcc: 8.3.0

I’m not sure why there would be any expectation that process memory usage would change based on a call to cudaMemPrefetchAsync().

Cuz the memory prefetched was allocated by that process?

It confused me as cudaMallocManagedis called in a third party library, and I can’t tell whether anything goes wrong in my program.

Anyway, thank you for your explanation! Maybe I just need a deeper understanding of Unified-Memory !

My expectation is that a process reservation of the (host) memory will take place at the point of cudaMallocManaged call. I wouldn’t expect migrating that memory from one place to another would change the process reservation, but I haven’t studied it closely.