Hi. I’m testing asynchronous behavior using 2 streams.
I made up some settings.
host->device copy time , kernel execution time, device->host copy time takes about 10 ms each.
kernel uses 2^12 blocks and each block uses 256 threads.
when 2 streams are overlapped, my expectation was to get 40 ms for the entire job.
(because kernel execution and copying can be done simultaneously.)
however, when tested on Tesla C2050 gpu, I found that function calling order make some variations…
for example…
cudaMemcpyAsync(stream 0, …, HostToDevice)
kernel<<<…,stream 0>>>(…)
cudaMemcpyAsync(stream 0, …, DeviceToHost)
cudaMemcpyAsync(stream 1, …, HostToDevice)
kernel<<<…,stream 1>>>(…)
cudaMemcpyAsync(stream 1, …, DeviceToHost)
cudaStreamSynchronize(stream 0)
cudaStreamSynchronize(stream 1)
this takes 40 ms.
cudaMemcpyAsync(stream 0, …, HostToDevice)
cudaMemcpyAsync(stream 1, …, HostToDevice)
kernel<<<…,stream 0>>>(…)
kernel<<<…,stream 1>>>(…)
cudaMemcpyAsync(stream 0, …, DeviceToHost)
cudaMemcpyAsync(stream 1, …, DeviceToHost)
cudaStreamSynchronize(stream 0)
cudaStreamSynchronize(stream 1)
this takes 50 ms.
On the contrary, I did the same test on GeForce 9800 and it showed the opposite result.
former took 60 ms, latter took 50 ms.
I would like to understand this situation.
Any help will be welcomed :)