cudaMemcpyAsync not "async" in cuda 3.1 cudaMemcpyAsync blocking cuda 3.1

mborgerd · July 9, 2010, 8:07pm

It appears that 3.1 has changed the asynchronous nature of cudaMemcpyAsync. I’ve attached a small program that demonstrates:

[*] calls cudaMemcpyAsync on stream 0 to copy Host2Device

[*] launches a simple device function (also adds one)

[*] calls cudaMemcpyAsync on stream 0 to copy Device2Host

[*] calls cudaThreadSynchronize

My understanding is that the first 3 calls should return quickly, and the cudaThreadSynchronize should be the synchronization point at which execution blocks.

This was the case under nvcc 3.0:

[markb@hedy cudabot]$ nvcc -O3 memcopies.cu && ./a.out

time spent in host routine =0.167148

time launching memcpy h2d =0.00090456

time launching kernel =0.00139213

time launching memcpy d2h =0.000252008

time waiting =0.381643

However, the memory copies are blocking under nvcc 3.1:

[markb@hedy cudabot]$ nvcc -O3 memcopies.cu && ./a.out

time spent in host routine =0.166327

time launching memcpy h2d =0.161125

time launching kernel =0.00172162

time launching memcpy d2h =0.225461

time waiting =0.000165701

memcopies.cu (2.56 KB)

tmurray · July 9, 2010, 8:30pm

What GPU are you using?

mborgerd · July 9, 2010, 8:37pm

If I replace the default stream “0” with an explictly created stream, the cudaMemcpyAsync calls behave as expected.

tmurray · July 9, 2010, 8:38pm

need more info on GPU and driver version, not reproducing this (with a non-public driver and a C1060)

mborgerd · July 9, 2010, 8:38pm

nvcc: NVIDIA ® Cuda compiler driver

Built on Tue_Jun__8_18:13:14_PDT_2010

Cuda compilation tools, release 3.1, V0.2.1221

observed on 2 different GPUs:

name=GeForce 8800 GT

totalGlobalMem=536150016

sharedMemPerBlock=16384

regsPerBlock=8192

warpSize=32

memPitch=2147483647

maxThreadsPerBlock=512

maxThreadsDim={512,512,64}

maxGridSize={65535,65535,1}

clockRate=1500000

totalConstMem=65536

major=1

minor=1

textureAlignment=256

deviceOverlap=1

multiProcessorCount=14

name=GeForce GTX 260

totalGlobalMem=1878327296

sharedMemPerBlock=16384

regsPerBlock=16384

warpSize=32

memPitch=2147483647

maxThreadsPerBlock=512

maxThreadsDim={512,512,64}

maxGridSize={65535,65535,1}

clockRate=1080000

totalConstMem=65536

major=1

minor=3

textureAlignment=256

deviceOverlap=1

multiProcessorCount=24

tmurray · July 9, 2010, 9:58pm

blergh, this is apparently a known bug in 3.1 that’s already fixed in 3.2.

mborgerd · July 11, 2010, 3:08am

[s]

Unfortunately, my program is now giving incorrect answers with the created stream :([/s]

My code was not synchronizing properly in some cases. It looks like the created stream will be a good workaround

clamport · July 12, 2010, 10:15am

So does this mean we can expect a 3.2 release soon?

Thanks!

Clamport

Topic		Replies	Views
cudaMemcpyAsync not behaving asynchronously CUDA Programming and Performance	5	2514	July 4, 2008
cudaMemcpyAsync code problem CUDA Programming and Performance	3	4609	September 16, 2008
cudaMemcpyAsync not giving any answers using cudaMemcpyAsync function CUDA Programming and Performance	1	831	September 5, 2011
Odd cudaMemcpyAsync() behavior with Kepler K20c and CUDA 5.0 CUDA Programming and Performance	0	957	January 14, 2013
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1767	October 17, 2009
Async Memcpy calls blocking main thread CUDA Programming and Performance	3	2488	November 19, 2011
Execution mode question: asynchronous or synchronous CUDA Programming and Performance	4	1418	January 26, 2011
cudaMemcpyAsync slower than cudaMemcpy? CUDA Programming and Performance	1	3120	March 10, 2009
"cudaMemcpy3DAsync" bug with 260.19.21 driver ? CUDA Programming and Performance	1	4815	November 19, 2010
cudaMemcpyAsync CUDA Programming and Performance	10	21672	October 16, 2015

cudaMemcpyAsync not "async" in cuda 3.1 cudaMemcpyAsync blocking cuda 3.1

Related topics