cudaStreamAttachMemAsync without the cost of prefetching on Tegra?

Since CudaAllocate and CudaFree is pretty expensive, I am trying to reuse the memory.

so for 2(as example) round of processing:
instead of
cudaMallocManaged(size, cudaMemAttachHost) => write data from host => sync to Device => use in device => free => cudaMallocManaged(size, cudaMemAttachHost) => write data from host => sync to Device => use in device => free,
I want to:
cudaMallocManaged(size, cudaMemAttachHost) =>write data from host => sync to Device => use in device => reattach to Host => write data from host => sync to Device => use in device => free

However, the (reattach to Host) part, to change one managed memory from cudaMemAttachGlobal to cudaMemAttachHost, I have to call cudaStreamAttachMemAsync then cudaStreamSynchronize, which cause the prefetch the data from Device to CPU cache.

But since I am trying to reuse the memory, that prefetch was not needed(actually the content of that memory is just not needed anymore), so it is a waste of time. Is there anyway we can just attach the memory to host without the cost of prefetching?

Thanks.

Questions regarding NVIDIA’s embedded platforms usually receive better/faster answers in the sub-forums dedicated to them:

https://forums.developer.nvidia.com/c/agx-autonomous-machines/jetson-embedded-systems/70