I want to overlap host-to-device memory copy with the kernel executions on the device.
Here is what I’ve done:
- I use two separate non-default streams; one for the memory copy, and another one for the kernels.
- I use cudaMemcpyAsync.
- I use cudaHostAlloc for the memory allocation on the host to have the memory pinned.
Using the visual profiler, I can see that the memory copy does happen on its own stream without blocking the kernels. BUT, the problem is that whenever cudaHostAlloc is called, kernel executions are stopped until cudaHostAlloc is finished.
As far as I could figure out, this is because cudaHostAlloc uses the default stream. So, one solution might be to create the kernels’ stream with the non-blocking flag. However, that does not work for me because I DO want the kernels’ stream to be blocking with respect to the defautl stream.
So, how can I tell cuda to exeute cudaHostAlloc on a specific stream, and not the default stream?