Hi,
I am writing a piece of code which utilises streams to help mask the memory overheads.
for ( int i=0; i< MaxStreams; i++)
{
cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, i); [A]
kernel<<<blocks, threads, 0, i>>>(d_a, value); [B]
cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, i); [C]
}
As far as i understand this will do the following:
[Stream 0] A_B_C
[Stream 1] A__B_C
[Stream 2] A___B_C
…
Is this correct, and if so how do i set up cuda events to show the time for the completion of A-C for each stream?