Question about CUDA streams

ali112975 · November 7, 2019, 10:06pm

Hi all,

New CUDA programmer here that still has much to learn. I have a piece of CUDA Fortran code that launches a number of kernels in separate streams:

call kernel1 <<<*, *, , stream(1)>>>
call kernel2 <<<, *, , stream(2)>>>
call kernel3 <<<, *, , stream(3)>>>
call kernel4 <<<, *, *, stream(4)>>>

I then have some asynchronous memory copies from device to host:

istat = cudaMemcpyAsync(*, , * , stream(1))
istat = cudaMemcpyAsync(, , * , stream(2))
istat = cudaMemcpyAsync(, , * , stream(3))
istat = cudaMemcpyAsync(, *, * , stream(4))

Now, I want to launch another kernel immediately after the above statements:

call kernel5 <<<*, *, *, stream(5)>>>

But I want to make sure kernels 1 to 4 complete first and initiate the asynchronous memory copies, before kernel5 starts running.

Is there any way to accomplish this without doing a cudaStreamSynchronize on streams 1 to 4?

If I put a cudaStreamSynchronize after kernels 1 to 4, then the async memory copies will wait until those kernels complete. I want those memory copies to start ASAP and possibly overlap with the kernels.

Any suggestions?

Robert_Crovella · November 7, 2019, 11:03pm

put the synchronize call after the cudaMemcpyAsync calls

tera · November 7, 2019, 11:06pm

You can put events into streams 1 to 4 after the kernels and before the memcpys, and synchronize stream 5 on those events.

ali112975 · November 8, 2019, 3:55pm

Thanks! It seems events are what I exactly need here.

ali112975 · November 8, 2019, 5:39pm

So, I have a follow-up question:

Let’s say we have a few more kernels, like kernel6, kernel7, kernel8, that can be launched in separate streams such as stream(6), stream(7), stream(8), in parallel with kernel5 running in stream(5).

I guess we could synchronize streams 5 to 8 individually for the events placed in streams 1 to 4. Is there a better alternative or a shortcut for doing that?

Robert_Crovella · November 8, 2019, 6:37pm

what I said in comment 2 would support that scenario as well.

put a cudaDeviceSynchronize() after the async memcpy calls. Then launch your kernel 5-9999 in streams 5-9999 and none of them will start until the async memcpy calls are complete.

If you need fine-grained control (not obvious from your description so far) then use the method indicated by tera. Yes, every one of your streams that has dependencies on previous work will need an event to synchronize on (or multiple events, for multiple previous stream dependencies).

ali112975 · November 8, 2019, 7:35pm

Thanks, I should have mentioned that I want all kernels launched after the async memcpy calls to overlap with the memory copies. Perhaps that was not very clear previously.

tera · November 8, 2019, 7:49pm

Instead of synchronizing individual streams via cudaStreamWaitEvent() you could synchronize host execution via cudaEventSynchronize() before launching any more kernels from the host side. That would be relatively simple, but probably not particularly efficient.

If you only want to work with cudaStreamWaitEvent(), I suppose you can make stream 5 synchronize on streams 1 to 4, record an event in stream 5, and then synchronize streams 6 to 9999 only on the event in stream 5.

ali112975 · November 8, 2019, 8:02pm

Thanks again! I think cudaStreamWaitEvent is preferable over cudaEventSynchronize in this particular case, so your suggestion about recording an event in stream 5 should work well here.

Topic		Replies	Views
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	518	October 12, 2021
cudaStreamSynchronize(a_stream) simpleStreams CUDA Programming and Performance	2	2436	December 2, 2010
I want to synchronize CUDA streams CUDA Programming and Performance	5	1002	January 5, 2024
About Stream control CUDA Programming and Performance	1	986	March 26, 2009
Overlap cudaMemcpyAsync and kernel CUDA Programming and Performance	1	533	February 10, 2021
CUDA stream management CUDA Programming and Performance	1	497	December 15, 2016
Kernel Synchronization in CUDA not fully explained in programming guild CUDA Programming and Performance	1	10677	February 25, 2010
sanity check: when do I need to synchronize kernel launches? CUDA Programming and Performance	1	567	February 2, 2018
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	309	April 1, 2024
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	391	November 18, 2024

Question about CUDA streams

Related topics