Mem usage not match for device&process after calling cudaMallocManaged+cudaMemPrefetch

I found that after prefetching a block of memory allocated by cudaMallocManaged with cudaMemPrefetch, memory usage of device will increase but memory usage of process just keep the same. And I can reproduce it with the test code below, Is it designed to be like this?

system info:
Debian GNU/Linux 9
driver: 450.80.02
cuda: 11.0
gcc: 8.3.0

I’m not sure why there would be any expectation that process memory usage would change based on a call to cudaMemPrefetchAsync().

Cuz the memory prefetched was allocated by that process?

It confused me as cudaMallocManagedis called in a third party library, and I can’t tell whether anything goes wrong in my program.

Anyway, thank you for your explanation! Maybe I just need a deeper understanding of Unified-Memory !

My expectation is that a process reservation of the (host) memory will take place at the point of cudaMallocManaged call. I wouldn’t expect migrating that memory from one place to another would change the process reservation, but I haven’t studied it closely.