Kernel Synchronization in CUDA not fully explained in programming guild

Ahmed_Tarek · February 25, 2010, 7:18am

Hi all,
I’m almost finished with the CUDA programming guild and i realized that it’s short of important stuff, which i’m trying to understand now by experimentation :S

I want to know what happens in the following situations

Case 1

Kernel_0<<<… , …>>>(…)
MemCpy(…);

Does MemCpy wait for Kernel_0 to finish execution, the only thing mentioned in the programming guild is that Memcpys are synchronous, so control would return just after Memcpy are finished. doesnt say much about what am askin for !!!

by experimenting i guess it waits for Kernel_0 to finish , but i’m not that sure though, it should be added to the programming guild

Case 2

Kernel_0<<<… , …>>>(…)
Kernel_1<<<… , …>>>(…)

same as above, i also tried to figure that out … kernel_1 seems to wait for kernel_0 to finish … but not sure too so can anyone confirm that

Case 3

for(int i=0; i < …; i++)
(*Kernels[i])<<<… , … >>>(…)

Assume that its possible to use function pointers(which isn’t mentioned in the programming guild too) … how many asynchronous calls would the device allow until it halts/crashes/waits for previous calls to finish ??
Moreover does synchronization happen on device(assuming that the calls are synchronized), or does calling Kernel[i] block untill Kernel[i-1] finishes execution.

Please if you’re an Nvidia employee and you see this, can you please suggest adding these things to the guild.

Thanks

ONeill · February 25, 2010, 10:51am

You should read the chapter about streams in the Programming Guide. To sum it up you can only overlap kernels with asyn memcopies (eg cudaMemcpyAsync( ) ) and using streams (both at a time).

To answer your questions (all related to streams):

It will wait for kernel to finish, because kernel can only overlap with memcopies if the memcopy function is async AND “executed” by a different stream than the kernel.
Kernels can only be executed one after another. With Fermi (next NVIDIA GPU generation) this will change…
Somewhere in the forums theres a topic about failing kernels when queuing too many of them in a short time but you might have to try when this will happen on uor rig. Again all these kernels will be executed one after another on current hardware. The “sync” happens on device (its not actually a sync cause the kernels have to wait naturally and are queued) and uor CPU is free for other stuff right after the call and before kernel has finished.

Topic		Replies	Views
0.9 asynchronous kernel question CUDA Programming and Performance	7	8444	June 14, 2007
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	420	October 12, 2021
cudaThreadSyncronize and cudaMemcpy CUDA Programming and Performance	2	2765	February 11, 2008
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1766	June 23, 2010
I want to synchronize CUDA streams CUDA Programming and Performance	5	827	January 5, 2024
asynchronous memory transfer CUDA Programming and Performance	2	1650	October 29, 2008
Kernel Queueing CUDA Programming and Performance	8	9686	June 29, 2009
simple asynchronous memcpy and kernel execution question CUDA Programming and Performance	5	5963	March 6, 2010
A question about kernel execution CUDA Programming and Performance	1	2618	August 24, 2009
Asynchronicity of kernel execution and cuMemcpy CUDA Programming and Performance	2	3274	March 23, 2009

Kernel Synchronization in CUDA not fully explained in programming guild

Case 1

Case 2

Case 3

Related topics