Synchronising between kernel launches Ensuring memory coherence during kernel launches in for-loop

schmichael · May 30, 2011, 5:32am

Just to confirm, is this the correct way to ensure memory coherence between iterative launches of a kernel using a for loop, where I update some old device memory with update values, computed in the previous kernel launch:

for (unsigned int i = 0; i < N; ++i) {

	myKernel<<<...>>>();

	cudaDeviceSynchronize();

	cudaMemcpy(d_oldVals, d_newVals, ... , cudaMemcpyDeviceToDevice);

	cudaDeviceSynchronize();

}

I believe [font=“Courier New”]cudaDeviceSynchronize()[/font] in CUDA 4.0 replaces the now deprecated [font=“Courier New”]cudaThreadSynchronize()[/font] in CUDA 3.2?

Is this right?

Thanks,

Mike

avidday · May 30, 2011, 5:57am

You don’t need either cudaDeviceSynchronize() calls in that case. If you are launching into the same stream, and not using the asynchronous versions of memcpy, then coherence between operations is implicit.

schmichael · May 30, 2011, 6:44am

avidday, thanks for your info.

Following your advice when I remove the [font=“Courier New”]cudaDeviceSynchronize()[/font]'s the host code appears to asynchronously iterate 512 (?) loop cycles, then blocks and thereafter iterates in step with (what appears to be) kernel execution time. Therefore, when the printout from the for-loop reports 512, only kernel 0 has just finished executing? I expect that at the very end it will just block and wait for all 512 items in the stream to finish? (haven’t tested this)

The kernel launches and cudaMemcpy’s appear to asynchronously fill up the stream, which then executes the queue in sequence? Is that what you mean?

If I need the host to print out which kernel iteration it’s up to, I need those [font=“Courier New”]cudaDeviceSynchronize()[/font]'s in there right?

avidday · May 30, 2011, 8:20am

The cudaMemcpy is a blocking call, so you get implicit synchronization after each cudaMemcpy within the loop. So you don’t need any explicit synchronization calls to make that loop block at each loop iteration. If you are using the WDDM driver on Windows, there might be some operation batching going on, but on a sane platform, you don’t need to do anything.

schmichael · May 31, 2011, 2:46am

How can I check if I’m using WDDM driver? And how can I determine/set this batching behaviour?

Thanks

avidday · May 31, 2011, 5:15am

If you are using Vista, 7, or server 2008, then you are using the WDDM driver, unless you have a Telsa and are using the dedicated TCC compute driver. I can’t help you with the drver runtime batch behavior, beyond saying I understand it exists in recent versions of the toolkit for reasons of latency management. I don’t use CUDA on Windows, sorry.

schmichael · May 31, 2011, 10:06am

I am using a Tesla C1060 on Win 7, and therefore I know it’s TCC capable.
How can I check if it is operating in TCC mode.

And what are the benefits of TCC over WDDM, or the other way around?

Thanks.

Topic		Replies	Views
cudaThreadSynchronize usage CUDA Programming and Performance	3	2936	October 21, 2008
About Synchronize CUDA Programming and Performance	4	1461	March 26, 2009
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4321	September 22, 2009
cudaThreadSynchronize CUDA Programming and Performance	1	2405	February 1, 2009
When do I need cudaThreadSynchronize? CUDA Programming and Performance	3	11294	June 16, 2010
Problem: cuda calls are synchronized CUDA Programming and Performance	17	2881	February 18, 2011
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17463	June 14, 2008
matrixMul skd sample. Where is cudaThreadSynchronize? CUDA Programming and Performance	3	1968	December 19, 2009
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16417	September 29, 2017
cudaMemcpy during kernel execution asynchronous kernel launch CUDA Programming and Performance	2	3103	July 20, 2007

Synchronising between kernel launches Ensuring memory coherence during kernel launches in for-loop

Related topics