performance variation when using asynchronous calls

boyung · February 11, 2011, 8:36am

Hi. I’m testing asynchronous behavior using 2 streams.

I made up some settings.
host->device copy time , kernel execution time, device->host copy time takes about 10 ms each.
kernel uses 2^12 blocks and each block uses 256 threads.

when 2 streams are overlapped, my expectation was to get 40 ms for the entire job.
(because kernel execution and copying can be done simultaneously.)
however, when tested on Tesla C2050 gpu, I found that function calling order make some variations…
for example…

cudaMemcpyAsync(stream 0, …, HostToDevice)
kernel<<<…,stream 0>>>(…)
cudaMemcpyAsync(stream 0, …, DeviceToHost)
cudaMemcpyAsync(stream 1, …, HostToDevice)
kernel<<<…,stream 1>>>(…)
cudaMemcpyAsync(stream 1, …, DeviceToHost)
cudaStreamSynchronize(stream 0)
cudaStreamSynchronize(stream 1)

this takes 40 ms.

cudaMemcpyAsync(stream 0, …, HostToDevice)
cudaMemcpyAsync(stream 1, …, HostToDevice)
kernel<<<…,stream 0>>>(…)
kernel<<<…,stream 1>>>(…)
cudaMemcpyAsync(stream 0, …, DeviceToHost)
cudaMemcpyAsync(stream 1, …, DeviceToHost)
cudaStreamSynchronize(stream 0)
cudaStreamSynchronize(stream 1)

this takes 50 ms.

On the contrary, I did the same test on GeForce 9800 and it showed the opposite result.
former took 60 ms, latter took 50 ms.
I would like to understand this situation.
Any help will be welcomed :)

fcs · February 11, 2011, 9:17am

For the GeForce 9800 times, memcopy H2D and D2H don’t overlap so the 60 ms seems correct to me because you enqueue your first copy on stream 1 after the last copy on stream 0.

But in the 3 other cases, i agree you should obtain 40ms. I’ve already reported a bug on overlaping (see details on my post: The Official NVIDIA Forums | NVIDIA).

In your cases, i think that your first kernel starts after the second copy. you can check that with cudaevent.

Topic		Replies	Views
Overhead using cudaMemcpyAsync CUDA Programming and Performance	5	3201	September 1, 2009
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1764	June 23, 2010
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	85	November 18, 2024
Kernel Queueing CUDA Programming and Performance	8	9682	June 29, 2009
Copy-Compute Overlap Performance CUDA Programming and Performance	4	962	January 19, 2019
kernal and memcpy cannot overlap when using cudaMemcpyDeviceToDevicev in some situations CUDA Programming and Performance	1	599	October 23, 2015
Help with concurrency.. Not any improvement in total cycle time CUDA Programming and Performance	2	397	November 7, 2017
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1017	December 15, 2022
Parallelizing data transfer with kernel execution CUDA Programming and Performance	7	1392	January 13, 2014
Slow memory transfers CUDA Programming and Performance	7	1990	May 23, 2011

performance variation when using asynchronous calls

Related topics