Since CudaAllocate and CudaFree is pretty expensive, I am trying to reuse the memory.
so for 2(as example) round of processing:
instead of
cudaMallocManaged(size, cudaMemAttachHost) => write data from host => sync to Device => use in device => free => cudaMallocManaged(size, cudaMemAttachHost) => write data from host => sync to Device => use in device => free,
I want to:
cudaMallocManaged(size, cudaMemAttachHost) =>write data from host => sync to Device => use in device => reattach to Host => write data from host => sync to Device => use in device => free
However, the (reattach to Host) part, to change one managed memory from cudaMemAttachGlobal to cudaMemAttachHost, I have to call cudaStreamAttachMemAsync then cudaStreamSynchronize, which cause the prefetch the data from Device to CPU cache.
But since I am trying to reuse the memory, that prefetch was not needed(actually the content of that memory is just not needed anymore), so it is a waste of time. Is there anyway we can just attach the memory to host without the cost of prefetching?
Thanks.