Kernel Synchronization in CUDA not fully explained in programming guild

Hi all,
I’m almost finished with the CUDA programming guild and i realized that it’s short of important stuff, which i’m trying to understand now by experimentation :S

I want to know what happens in the following situations

Case 1

Kernel_0<<<… , …>>>(…)
MemCpy(…);

Does MemCpy wait for Kernel_0 to finish execution, the only thing mentioned in the programming guild is that Memcpys are synchronous, so control would return just after Memcpy are finished. doesnt say much about what am askin for !!!

by experimenting i guess it waits for Kernel_0 to finish , but i’m not that sure though, it should be added to the programming guild

Case 2

Kernel_0<<<… , …>>>(…)
Kernel_1<<<… , …>>>(…)

same as above, i also tried to figure that out … kernel_1 seems to wait for kernel_0 to finish … but not sure too so can anyone confirm that

Case 3

for(int i=0; i < …; i++)
(*Kernels[i])<<<… , … >>>(…)

Assume that its possible to use function pointers(which isn’t mentioned in the programming guild too) … how many asynchronous calls would the device allow until it halts/crashes/waits for previous calls to finish ??
Moreover does synchronization happen on device(assuming that the calls are synchronized), or does calling Kernel[i] block untill Kernel[i-1] finishes execution.

Please if you’re an Nvidia employee and you see this, can you please suggest adding these things to the guild.

Thanks

You should read the chapter about streams in the Programming Guide. To sum it up you can only overlap kernels with asyn memcopies (eg cudaMemcpyAsync( ) ) and using streams (both at a time).

To answer your questions (all related to streams):

  1. It will wait for kernel to finish, because kernel can only overlap with memcopies if the memcopy function is async AND “executed” by a different stream than the kernel.
  2. Kernels can only be executed one after another. With Fermi (next NVIDIA GPU generation) this will change…
  3. Somewhere in the forums theres a topic about failing kernels when queuing too many of them in a short time but you might have to try when this will happen on uor rig. Again all these kernels will be executed one after another on current hardware. The “sync” happens on device (its not actually a sync cause the kernels have to wait naturally and are queued) and uor CPU is free for other stuff right after the call and before kernel has finished.