It appears that 3.1 has changed the asynchronous nature of cudaMemcpyAsync. I’ve attached a small program that demonstrates:
[*] calls cudaMemcpyAsync on stream 0 to copy Host2Device
[*] launches a simple device function (also adds one)
[*] calls cudaMemcpyAsync on stream 0 to copy Device2Host
[*] calls cudaThreadSynchronize
My understanding is that the first 3 calls should return quickly, and the cudaThreadSynchronize should be the synchronization point at which execution blocks.
This was the case under nvcc 3.0:
[markb@hedy cudabot]$ nvcc -O3 memcopies.cu && ./a.out
time spent in host routine =0.167148
time launching memcpy h2d =0.00090456
time launching kernel =0.00139213
time launching memcpy d2h =0.000252008
time waiting =0.381643
However, the memory copies are blocking under nvcc 3.1:
[markb@hedy cudabot]$ nvcc -O3 memcopies.cu && ./a.out
time spent in host routine =0.166327
time launching memcpy h2d =0.161125
time launching kernel =0.00172162
time launching memcpy d2h =0.225461
time waiting =0.000165701
memcopies.cu (2.56 KB)