It seems that i can’t overlap asynchronous memory transfert with computation on my cards.
I have attached the little prog i use to reproduce it (it is inspired by the one found here: http://forums.nvidia.com/index.php?showtopic=173705)
(It is just for test purpose and does nothing interesting)
The results i got looks like:
time launching memcpy h2d =0.0017643
time memcpy h2d =3.72838
time between memcpy h2d and kernel =3.73197
time launching kernel =0.00332499
time kernel =0.436005
time launching memcpy d2h =0.00183749
time memcpy d2h =3.4469
time waiting =7.64547
I analyse time between memcpy h2d and kernel =3.73197 as if the kernel starts in stream “st1” only after memcpyasync in stream “st” has finished. (And that’s what the profiler shows)
For me, it’s a huge bug but maybe i miss something ovious.
I tested it on a machine with an Tesla S1070 card (driver 256.40) and on a Quadro FX5800 with driver 260.19.21. Both are Linux x86_64 and i made test with cuda 3.0,3.1 and 3.2(only for quadro) with similar results/
I will realy appreciate any answer.
Thank you very much
overlap.cu (3.38 KB)