Timing With Streams


I am writing a piece of code which utilises streams to help mask the memory overheads.

for ( int i=0; i< MaxStreams; i++)
cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, i); [A]
kernel<<<blocks, threads, 0, i>>>(d_a, value); [B]
cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, i); [C]

As far as i understand this will do the following:

[Stream 0] A_B_C
[Stream 1] A__B_C
[Stream 2] A___B_C

Is this correct, and if so how do i set up cuda events to show the time for the completion of A-C for each stream?