cudaMemcpyAsync not "async" in cuda 3.1 cudaMemcpyAsync blocking cuda 3.1

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2010 NVIDIA Corporation

Built on Tue_Jun__8_18:13:14_PDT_2010

Cuda compilation tools, release 3.1, V0.2.1221

observed on 2 different GPUs:

name=GeForce 8800 GT

totalGlobalMem=536150016

sharedMemPerBlock=16384

regsPerBlock=8192

warpSize=32

memPitch=2147483647

maxThreadsPerBlock=512

maxThreadsDim={512,512,64}

maxGridSize={65535,65535,1}

clockRate=1500000

totalConstMem=65536

major=1

minor=1

textureAlignment=256

deviceOverlap=1

multiProcessorCount=14

name=GeForce GTX 260

totalGlobalMem=1878327296

sharedMemPerBlock=16384

regsPerBlock=16384

warpSize=32

memPitch=2147483647

maxThreadsPerBlock=512

maxThreadsDim={512,512,64}

maxGridSize={65535,65535,1}

clockRate=1080000

totalConstMem=65536

major=1

minor=3

textureAlignment=256

deviceOverlap=1

multiProcessorCount=24