I found that after prefetching a block of memory allocated by cudaMallocManaged with cudaMemPrefetch, memory usage of device will increase but memory usage of process just keep the same. And I can reproduce it with the test code below, Is it designed to be like this?
My expectation is that a process reservation of the (host) memory will take place at the point of cudaMallocManaged call. I wouldn’t expect migrating that memory from one place to another would change the process reservation, but I haven’t studied it closely.