Dynamic Parallelism 2

Dear all

See the below example:

*************************Parent Kernel
global void parent(…)
{
int i=threadIdx.x;
int t=blockIdx.x;

if ((i==0) && (t==0))
children<<<1024,128>>>(…);
cudaDeviceSynchronize();

More Code****

}
**************************call parent
parent<<<2,128>>>(…);


My doubt is (perhaps a dummy question)

Is it guaranteed that “More Code**” runs after the children finishes in all threads (and so in all blocks of the grid)?

Thanks

Luis Gonçalves

No

cudaDeviceSynchronize in this case prevents any subsequent code in that thread from running until the kernel launched by that thread completes (if one is launched). No other semantics are implied. It is not a device wide sync.

Thanks for the answer

But there is any sync instruction to do device wide sync? Or there any way to guaranty that “**More Code” runs after the terminus of children in all threads of all the blocks?

Thanks

Luis Gonçalves

If you want to do a device-wide sync, the proper method(s) in CUDA are:

  1. Use the kernel launch itself as a device-wide sync, i.e. break up the code into 2 or more kernels.
  2. Use Cooperative Groups. However the device wide sync in cooperative groups is only available for devices of cc6.x or higher. (And cooperative groups requires CUDA 9 or higher)

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups

(and there are various CUDA sample codes)

Your question is a bit strange:

“after the terminus of children in all threads of all the blocks”

but your particular code guarantees that the child kernel will only run in one thread, in one block.

Yes, bad explained. The thread zero in Block zero launches the kernel but I want that all threads stops and resume only after children finishes

Break in two or more kernels. One per block? Or nothing to do

Break in two or more kernels:

*************************Parent Kernel
__global__ void parent(.......)
{
int i=threadIdx.x;
int t=blockIdx.x;

if ((i==0) && (t==0))
children<<<1024,128>>>(.................);
cudaDeviceSynchronize();
}

__global__ void parent2(.......)
{
****More Code********

}
**************************call parent
parent<<< 1,  1, 0, streamX>>>(............);
parent2<<<2,128, 0, streamX>>>(............);
******************************************

I’ve now guaranteed that More Code runs (regardless of thread/block) only after children completes.

Yes, I was evaluating alternatives a that solution and avoid the break of several kernels in the host (in even with different streams). Alternatives with Dynamic Parallelism

Thanks for the answers