Syncronization with cuda Streams

giulio.graziani.93 · April 13, 2021, 11:49am

Hi,

I’m trying to understand how the different functions get executed using streams.

Let’s say i have 5 streams, in each one I have a call to cudaMemcpyAsync that copies the inputs used by the Kernel. This is what it looks like:

for (int i = 0; i < nStreams; i++)
{
cudaMemcpyAsync( … , streams[i]);
kernel<<grid, block, 0, streams[i]>>();
}

And this is what it looks like on the Timeline:

Here it looks like from the 2nd kernel on, the dataset used is the same (5th memcpy). I need instead that each memcpy waits for previous kernel to be launched.

Do I need to specify it using events or is it done implicitly inside the stream and I’m reading the timeline wrong?

Robert_Crovella · April 13, 2021, 1:38pm

So just launch everything in your loop into the same stream

giulio.graziani.93 · April 13, 2021, 1:50pm

But the goal was to hide the various memcpy behind the kernel executions

Robert_Crovella · April 13, 2021, 1:57pm

that seems to conflict with

giulio.graziani.93 · April 13, 2021, 2:01pm

oh i meant that i need the memcpy[i+1] to be executed right after the kernel[i] gets called, while i run the kernel i load the data for the next one

Robert_Crovella · April 13, 2021, 2:03pm

how could kernel 2 be using the data copied by memcpy 5? Are all the memcpy operations copying data into the same destination buffer?

giulio.graziani.93 · April 13, 2021, 2:06pm

Yes i m using the same device allocated memory for all the memcpy’s

Robert_Crovella · April 13, 2021, 2:08pm

Then you can’t just wait for the previous kernel to start, you must wait for the previous kernel to finish, so that you know all the data has been consumed by that kernel, before you overwrite it with new data. Which puts you into the scenario where what I suggested makes sense.

If you create 5 separate destination buffers, one for each kernel/stream, then what you have now should be fine, and you are indeed getting the overlap of memcpy and compute now, that you desire.

Topic		Replies	Views
About Stream control CUDA Programming and Performance	1	940	March 26, 2009
Concurrent memcpy and kernel execution CUDA Programming and Performance	5	1415	December 9, 2014
Question about CUDA streams CUDA Programming and Performance	8	735	November 8, 2019
Overlap cudaMemcpyAsync and kernel CUDA Programming and Performance	1	505	February 10, 2021
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	972	February 1, 2022
Help with CUDA streams CUDA Programming and Performance	1	1599	April 2, 2010
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1889	June 17, 2010
I want to synchronize CUDA streams CUDA Programming and Performance	5	767	January 5, 2024
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1766	June 23, 2010
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008

Syncronization with cuda Streams

Related topics