Hi,
I’m trying to understand how the different functions get executed using streams.
Let’s say i have 5 streams, in each one I have a call to cudaMemcpyAsync that copies the inputs used by the Kernel. This is what it looks like:
for (int i = 0; i < nStreams; i++)
{
cudaMemcpyAsync( … , streams[i]);
kernel<<grid, block, 0, streams[i]>>();
}
And this is what it looks like on the Timeline:
Here it looks like from the 2nd kernel on, the dataset used is the same (5th memcpy). I need instead that each memcpy waits for previous kernel to be launched.
Do I need to specify it using events or is it done implicitly inside the stream and I’m reading the timeline wrong?
So just launch everything in your loop into the same stream
But the goal was to hide the various memcpy behind the kernel executions
that seems to conflict with
oh i meant that i need the memcpy[i+1] to be executed right after the kernel[i] gets called, while i run the kernel i load the data for the next one
how could kernel 2 be using the data copied by memcpy 5? Are all the memcpy operations copying data into the same destination buffer?
Yes i m using the same device allocated memory for all the memcpy’s
Then you can’t just wait for the previous kernel to start, you must wait for the previous kernel to finish, so that you know all the data has been consumed by that kernel, before you overwrite it with new data. Which puts you into the scenario where what I suggested makes sense.
If you create 5 separate destination buffers, one for each kernel/stream, then what you have now should be fine, and you are indeed getting the overlap of memcpy and compute now, that you desire.
1 Like