How to overlap execution of kernels in different streams with copy operations

Thank you. In this case, my previous posting about cudaMemcpyAsync explains your observation. In the linked document, it says "

For transfers from device memory to pageable host memory, the function will return only once the copy has completed."

This means that the second kernel call will never be submitted before the first memcpy is finished, because the CPU is blocked.

Try to replace malloc with cudaMallocHost

1 Like