The impact of cudaMalloc()and cudaFree() on the overlapping of kernel executions and data transfer

Hi, I posted a question about the impact of cudaMalloc and cudaFree on the asynchronous execution.

According to cuda c programming guide, both cudaMalloc and cudaFree are synchronous. However, from my experiments, if I simply use cudaMalloc, then, it does affect the overlapping of kernel executions and data transfer. If we insert a corresponding cudaFree, then, it affects the asynchronous execution, it that correct?

Also, my situation is that I have a huge amount of data transfer from device to host, I want to make if overlapped with a complex function. ( There are cudaMalloc, cudaFree, kernel executions and data transfers) To achieve this, do I need to remove all cudaMalloc inside the function>? This means I need to pre-allocate the device memory. Also, because of the resource limits, I also need to remove data transfer from the device to the host. Is there any better solution or work-around?

Thanks a lot!