cudaMemcpyAsync not "async" in cuda 3.1 cudaMemcpyAsync blocking cuda 3.1

It appears that 3.1 has changed the asynchronous nature of cudaMemcpyAsync. I’ve attached a small program that demonstrates:

    [*] calls cudaMemcpyAsync on stream 0 to copy Host2Device

    [*] launches a simple device function (also adds one)

    [*] calls cudaMemcpyAsync on stream 0 to copy Device2Host

    [*] calls cudaThreadSynchronize

My understanding is that the first 3 calls should return quickly, and the cudaThreadSynchronize should be the synchronization point at which execution blocks.

This was the case under nvcc 3.0:

[markb@hedy cudabot]$ nvcc -O3 memcopies.cu && ./a.out

time spent in host routine =0.167148

time launching memcpy h2d =0.00090456

time launching kernel =0.00139213

time launching memcpy d2h =0.000252008

time waiting =0.381643

However, the memory copies are blocking under nvcc 3.1:

[markb@hedy cudabot]$ nvcc -O3 memcopies.cu && ./a.out

time spent in host routine =0.166327

time launching memcpy h2d =0.161125

time launching kernel =0.00172162

time launching memcpy d2h =0.225461

time waiting =0.000165701

memcopies.cu (2.56 KB)

What GPU are you using?

If I replace the default stream “0” with an explictly created stream, the cudaMemcpyAsync calls behave as expected.

need more info on GPU and driver version, not reproducing this (with a non-public driver and a C1060)

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2010 NVIDIA Corporation

Built on Tue_Jun__8_18:13:14_PDT_2010

Cuda compilation tools, release 3.1, V0.2.1221

observed on 2 different GPUs:

name=GeForce 8800 GT

totalGlobalMem=536150016

sharedMemPerBlock=16384

regsPerBlock=8192

warpSize=32

memPitch=2147483647

maxThreadsPerBlock=512

maxThreadsDim={512,512,64}

maxGridSize={65535,65535,1}

clockRate=1500000

totalConstMem=65536

major=1

minor=1

textureAlignment=256

deviceOverlap=1

multiProcessorCount=14

name=GeForce GTX 260

totalGlobalMem=1878327296

sharedMemPerBlock=16384

regsPerBlock=16384

warpSize=32

memPitch=2147483647

maxThreadsPerBlock=512

maxThreadsDim={512,512,64}

maxGridSize={65535,65535,1}

clockRate=1080000

totalConstMem=65536

major=1

minor=3

textureAlignment=256

deviceOverlap=1

multiProcessorCount=24

blergh, this is apparently a known bug in 3.1 that’s already fixed in 3.2.

[s]

Unfortunately, my program is now giving incorrect answers with the created stream :([/s]

My code was not synchronizing properly in some cases. It looks like the created stream will be a good workaround

So does this mean we can expect a 3.2 release soon?

Thanks!

Clamport