Hi all,
I’m almost finished with the CUDA programming guild and i realized that it’s short of important stuff, which i’m trying to understand now by experimentation :S
I want to know what happens in the following situations
Case 1
Kernel_0<<<… , …>>>(…)
MemCpy(…);
Does MemCpy wait for Kernel_0 to finish execution, the only thing mentioned in the programming guild is that Memcpys are synchronous, so control would return just after Memcpy are finished. doesnt say much about what am askin for !!!
by experimenting i guess it waits for Kernel_0 to finish , but i’m not that sure though, it should be added to the programming guild
Case 2
Kernel_0<<<… , …>>>(…)
Kernel_1<<<… , …>>>(…)
same as above, i also tried to figure that out … kernel_1 seems to wait for kernel_0 to finish … but not sure too so can anyone confirm that
Case 3
for(int i=0; i < …; i++)
(*Kernels[i])<<<… , … >>>(…)
Assume that its possible to use function pointers(which isn’t mentioned in the programming guild too) … how many asynchronous calls would the device allow until it halts/crashes/waits for previous calls to finish ??
Moreover does synchronization happen on device(assuming that the calls are synchronized), or does calling Kernel[i] block untill Kernel[i-1] finishes execution.
Please if you’re an Nvidia employee and you see this, can you please suggest adding these things to the guild.
Thanks