I’m trying to understand how the different functions get executed using streams.
Let’s say i have 5 streams, in each one I have a call to cudaMemcpyAsync that copies the inputs used by the Kernel. This is what it looks like:
for (int i = 0; i < nStreams; i++)
cudaMemcpyAsync( … , streams[i]);
kernel<<grid, block, 0, streams[i]>>();
And this is what it looks like on the Timeline:
Here it looks like from the 2nd kernel on, the dataset used is the same (5th memcpy). I need instead that each memcpy waits for previous kernel to be launched.
Do I need to specify it using events or is it done implicitly inside the stream and I’m reading the timeline wrong?