performance variation when using asynchronous calls

Hi. I’m testing asynchronous behavior using 2 streams.

I made up some settings.
host->device copy time , kernel execution time, device->host copy time takes about 10 ms each.
kernel uses 2^12 blocks and each block uses 256 threads.

when 2 streams are overlapped, my expectation was to get 40 ms for the entire job.
(because kernel execution and copying can be done simultaneously.)
however, when tested on Tesla C2050 gpu, I found that function calling order make some variations…
for example…

cudaMemcpyAsync(stream 0, …, HostToDevice)
kernel<<<…,stream 0>>>(…)
cudaMemcpyAsync(stream 0, …, DeviceToHost)
cudaMemcpyAsync(stream 1, …, HostToDevice)
kernel<<<…,stream 1>>>(…)
cudaMemcpyAsync(stream 1, …, DeviceToHost)
cudaStreamSynchronize(stream 0)
cudaStreamSynchronize(stream 1)

this takes 40 ms.

cudaMemcpyAsync(stream 0, …, HostToDevice)
cudaMemcpyAsync(stream 1, …, HostToDevice)
kernel<<<…,stream 0>>>(…)
kernel<<<…,stream 1>>>(…)
cudaMemcpyAsync(stream 0, …, DeviceToHost)
cudaMemcpyAsync(stream 1, …, DeviceToHost)
cudaStreamSynchronize(stream 0)
cudaStreamSynchronize(stream 1)

this takes 50 ms.

On the contrary, I did the same test on GeForce 9800 and it showed the opposite result.
former took 60 ms, latter took 50 ms.
I would like to understand this situation.
Any help will be welcomed :)

For the GeForce 9800 times, memcopy H2D and D2H don’t overlap so the 60 ms seems correct to me because you enqueue your first copy on stream 1 after the last copy on stream 0.

But in the 3 other cases, i agree you should obtain 40ms. I’ve already reported a bug on overlaping (see details on my post: http://forums.nvidia.com/index.php?showtopic=190347).

In your cases, i think that your first kernel starts after the second copy. you can check that with cudaevent.