Hi all,
New CUDA programmer here that still has much to learn. I have a piece of CUDA Fortran code that launches a number of kernels in separate streams:
call kernel1 <<<*, *, , stream(1)>>>
call kernel2 <<<, *, , stream(2)>>>
call kernel3 <<<, *, , stream(3)>>>
call kernel4 <<<, *, *, stream(4)>>>
I then have some asynchronous memory copies from device to host:
istat = cudaMemcpyAsync(*, , * , stream(1))
istat = cudaMemcpyAsync(, , * , stream(2))
istat = cudaMemcpyAsync(, , * , stream(3))
istat = cudaMemcpyAsync(, *, * , stream(4))
Now, I want to launch another kernel immediately after the above statements:
call kernel5 <<<*, *, *, stream(5)>>>
But I want to make sure kernels 1 to 4 complete first and initiate the asynchronous memory copies, before kernel5 starts running.
Is there any way to accomplish this without doing a cudaStreamSynchronize on streams 1 to 4?
If I put a cudaStreamSynchronize after kernels 1 to 4, then the async memory copies will wait until those kernels complete. I want those memory copies to start ASAP and possibly overlap with the kernels.
Any suggestions?