cudaThreadSynchronize() with texture binding

I have a cuda program like this :

[codebox]

for (int i=0;i<100000;i++) {

if (i%2 == 0) {

bind_x(x) // bind x to texture

kernel_code<<A,B>>(M,x,y) // calculate y = M*x

}

else {

bind_x(y)

kernel_code<<A,B>>(M,y,x) // calculate x = M*y

}

cudaThreadSynchronize();

if (i%2 == 0)

unbind_x(x)

else

unbind_x(y) // unbind x from texture

}

[/codebox]

I heard that if I do not put cudaThreadSynchronize(); cpu will continue to run without waiting for the kernel to end.

However I tried to run with and without, the result is the same ?!? (And in theory It shouldn’t ?).

Should I use it in this case ?

Thanks in advance

Yes, kernel launches are asynchrous, as is explicitly documented in the programming guide (not just “heard” from somewhere). As the programming guide also states, multiple async launches are queued and executed in sequence on the GPU.

One NEVER needs to call *Synchronize() unless you are making wall clock timings or are about to read/write values copied with *MemcpyAsync.

Yes, kernel launches are asynchrous, as is explicitly documented in the programming guide (not just “heard” from somewhere). As the programming guide also states, multiple async launches are queued and executed in sequence on the GPU.

One NEVER needs to call *Synchronize() unless you are making wall clock timings or are about to read/write values copied with *MemcpyAsync.

Or about to read/write values in zero copy host memory which is being used by a running kernel.

Or about to read/write values in zero copy host memory which is being used by a running kernel.

Thanks for your reply,

However I still do not understand, as my bind & unbind is not kernel code. It’s cpu code. If an array X is unbind from texture memory, the texture can’t be accessed through kernel call…
( while calculating y = M*x I have to fetch x from texture memory )
so It may cause some strange behavior. And as far as I know cpu code is not queued after kernel call in this case

Thanks for your reply,

However I still do not understand, as my bind & unbind is not kernel code. It’s cpu code. If an array X is unbind from texture memory, the texture can’t be accessed through kernel call…
( while calculating y = M*x I have to fetch x from texture memory )
so It may cause some strange behavior. And as far as I know cpu code is not queued after kernel call in this case

Every async operation (including texture binds) are queued.

Every async operation (including texture binds) are queued.

oh yeah, so texture binding will call a kernel code anyway
Thanks

oh yeah, so texture binding will call a kernel code anyway
Thanks

As an optimization, you might use two textures that you bind to x and y, and keep them for the whole loop. There should be no need to unbind and rebind textures on each iteration.

As an optimization, you might use two textures that you bind to x and y, and keep them for the whole loop. There should be no need to unbind and rebind textures on each iteration.