Question about CUDA streams

Hi all,

New CUDA programmer here that still has much to learn. I have a piece of CUDA Fortran code that launches a number of kernels in separate streams:

call kernel1 <<<*, *, , stream(1)>>>
call kernel2 <<<
, *, , stream(2)>>>
call kernel3 <<<
, *, , stream(3)>>>
call kernel4 <<<
, *, *, stream(4)>>>

I then have some asynchronous memory copies from device to host:

istat = cudaMemcpyAsync(*, , * , stream(1))
istat = cudaMemcpyAsync(
, , * , stream(2))
istat = cudaMemcpyAsync(
, , * , stream(3))
istat = cudaMemcpyAsync(
, *, * , stream(4))

Now, I want to launch another kernel immediately after the above statements:

call kernel5 <<<*, *, *, stream(5)>>>

But I want to make sure kernels 1 to 4 complete first and initiate the asynchronous memory copies, before kernel5 starts running.

Is there any way to accomplish this without doing a cudaStreamSynchronize on streams 1 to 4?

If I put a cudaStreamSynchronize after kernels 1 to 4, then the async memory copies will wait until those kernels complete. I want those memory copies to start ASAP and possibly overlap with the kernels.

Any suggestions?

put the synchronize call after the cudaMemcpyAsync calls

You can put events into streams 1 to 4 after the kernels and before the memcpys, and synchronize stream 5 on those events.

Thanks! It seems events are what I exactly need here.

So, I have a follow-up question:

Let’s say we have a few more kernels, like kernel6, kernel7, kernel8, that can be launched in separate streams such as stream(6), stream(7), stream(8), in parallel with kernel5 running in stream(5).

I guess we could synchronize streams 5 to 8 individually for the events placed in streams 1 to 4. Is there a better alternative or a shortcut for doing that?

what I said in comment 2 would support that scenario as well.

put a cudaDeviceSynchronize() after the async memcpy calls. Then launch your kernel 5-9999 in streams 5-9999 and none of them will start until the async memcpy calls are complete.

If you need fine-grained control (not obvious from your description so far) then use the method indicated by tera. Yes, every one of your streams that has dependencies on previous work will need an event to synchronize on (or multiple events, for multiple previous stream dependencies).

Thanks, I should have mentioned that I want all kernels launched after the async memcpy calls to overlap with the memory copies. Perhaps that was not very clear previously.

Instead of synchronizing individual streams via cudaStreamWaitEvent() you could synchronize host execution via cudaEventSynchronize() before launching any more kernels from the host side. That would be relatively simple, but probably not particularly efficient.

If you only want to work with cudaStreamWaitEvent(), I suppose you can make stream 5 synchronize on streams 1 to 4, record an event in stream 5, and then synchronize streams 6 to 9999 only on the event in stream 5.

Thanks again! I think cudaStreamWaitEvent is preferable over cudaEventSynchronize in this particular case, so your suggestion about recording an event in stream 5 should work well here.