Synchronising between kernel launches Ensuring memory coherence during kernel launches in for-loop

Just to confirm, is this the correct way to ensure memory coherence between iterative launches of a kernel using a for loop, where I update some old device memory with update values, computed in the previous kernel launch:

for (unsigned int i = 0; i < N; ++i) {



	cudaMemcpy(d_oldVals, d_newVals, ... , cudaMemcpyDeviceToDevice);



I believe [font=“Courier New”]cudaDeviceSynchronize()[/font] in CUDA 4.0 replaces the now deprecated [font=“Courier New”]cudaThreadSynchronize()[/font] in CUDA 3.2?

Is this right?



You don’t need either cudaDeviceSynchronize() calls in that case. If you are launching into the same stream, and not using the asynchronous versions of memcpy, then coherence between operations is implicit.

avidday, thanks for your info.

Following your advice when I remove the [font=“Courier New”]cudaDeviceSynchronize()[/font]'s the host code appears to asynchronously iterate 512 (?) loop cycles, then blocks and thereafter iterates in step with (what appears to be) kernel execution time. Therefore, when the printout from the for-loop reports 512, only kernel 0 has just finished executing? I expect that at the very end it will just block and wait for all 512 items in the stream to finish? (haven’t tested this)

The kernel launches and cudaMemcpy’s appear to asynchronously fill up the stream, which then executes the queue in sequence? Is that what you mean?

If I need the host to print out which kernel iteration it’s up to, I need those [font=“Courier New”]cudaDeviceSynchronize()[/font]'s in there right?

The cudaMemcpy is a blocking call, so you get implicit synchronization after each cudaMemcpy within the loop. So you don’t need any explicit synchronization calls to make that loop block at each loop iteration. If you are using the WDDM driver on Windows, there might be some operation batching going on, but on a sane platform, you don’t need to do anything.

How can I check if I’m using WDDM driver? And how can I determine/set this batching behaviour?


If you are using Vista, 7, or server 2008, then you are using the WDDM driver, unless you have a Telsa and are using the dedicated TCC compute driver. I can’t help you with the drver runtime batch behavior, beyond saying I understand it exists in recent versions of the toolkit for reasons of latency management. I don’t use CUDA on Windows, sorry.

I am using a Tesla C1060 on Win 7, and therefore I know it’s TCC capable.
How can I check if it is operating in TCC mode.

And what are the benefits of TCC over WDDM, or the other way around?