Bug when overlapping tranfert & data

Hy guys,

It seems that i can’t overlap asynchronous memory transfert with computation on my cards.

I have attached the little prog i use to reproduce it (it is inspired by the one found here: http://forums.nvidia.com/index.php?showtopic=173705)
(It is just for test purpose and does nothing interesting)

The results i got looks like:

time launching memcpy h2d =0.0017643
time memcpy h2d =3.72838
time between memcpy h2d and kernel =3.73197
time launching kernel =0.00332499
time kernel =0.436005
time launching memcpy d2h =0.00183749
time memcpy d2h =3.4469
time waiting =7.64547

I analyse time between memcpy h2d and kernel =3.73197 as if the kernel starts in stream “st1” only after memcpyasync in stream “st” has finished. (And that’s what the profiler shows)

For me, it’s a huge bug but maybe i miss something ovious.

I tested it on a machine with an Tesla S1070 card (driver 256.40) and on a Quadro FX5800 with driver 260.19.21. Both are Linux x86_64 and i made test with cuda 3.0,3.1 and 3.2(only for quadro) with similar results/

I will realy appreciate any answer.

Thank you very much
overlap.cu (3.38 KB)

Since I’ve reported the bug in the Developper web site, i update here proper reproducers. overlap_bugreport.cu shows that kernel waits for memcopy to finish, and overlap.cu shows that the overlaping is done only at first step.
I’ve still no answer from Nvidia about this.
overlap_bugreport.cu (3.57 KB)
overlap.cu (4.05 KB)