Dynamic Parallelism Memory Consistency across Thread Blocks?

Hi all,

My question is clear from title. Even though i am aware of there is no sync across thread blocks.

But yet, I am wondering how can i manage memory consistency while i ma using dynamic parallelism? My sample code is like this.

__global__ void change_some_global_data(int *data,int index) 
{
    int myId = threadIdx.x + blockIdx.x * blockDim.x;
    data[myId+index] = //some data
}

__global__ void parent_launch(int *data) 
{ 
    // do something     
    int myId = threadIdx.x + blockIdx.x * blockDim.x;    
    change_some_global_data<<< 128 , 256 >>>(data, myId);    
    cudaDeviceSynchronize();    // wait child kernel ? 
    
    __syncthreads();            //make sure memory changing only among current block?

    //HERE! when i want to access some index of data, can every thread see change?
    data[some_index];
}

void host_launch(int *data) 
{
    parent_launch<<< 128, 256 >>>(data);
}

given that it is seemingly global memory, i do not fully comprehend why you are concerned about memory consistency (when using dynamic parallelism)

and the sample code seems flawed:

even if only 1 thread per block launches the child kernel: change_some_global_data, you would end up executing change_some_global_data 128 x 128 times, if i am not mistaken

if you want to “make sure memory changing only among current block”, i would think that it much depends on how a block accesses (the) global memory (array), and that you can simply pass to the child an updated pointer, that points to its section within (the) global memory (array)

Actually i was in hurry that;s why i put flawed code. Now i fixed.

I am only concerning that after array data is changed by child kernel, How can i make sure that every thread of parent_launch kernel can see these changes ?

I put __syncthreads at Line 14. But it provides only thread level synchronization.

If you want global synchronization, I think you need a separate kernel in the same stream.

by ensuring that the very first parent kernel block, and only the very first parent kernel block, launches the child kernel, to update the entire array (use an atomic flag stored in global memory)

or by ensuring that the entire array is updated beforehand (the child kernel becomes a parent kernel, issued prior to your current parent kernel, and you do not use dynamic parallelism)

the initial point was: if you have 1 thread per parent kernel block launch the child kernel, you can not have the child kernel’s dimensions as <<<128, 256>>>, it should rather be <<<1, 256>>> in my view
your parent kernel has 128 blocks; if only block launches the child, as opposed to every one, you may have a child kernel dimension of <<<128, 256>>>

Thank you very much for quick replies

@MutantJohn How can i use stream for this example?

@little_jimmy

I know, if u changed my block size with 1 block of parent kernel, i can ensure all.

However my scenario is like that i have parent_launch kernel <<<128, 256>>. And entire data array is changed successfully at change_some_global_data child kernel. After these changing, How can i ensure that any thread of parent_launch (in any block) sees same entire data array?

“I know, if u changed my block size with 1 block of parent kernel, i can ensure all.”

i am saying this is wrong:

change_some_global_data<<< 128 , 256 >>>(data)

because you would likely have this:

if (threadIdx.x == 0)
{
change_some_global_data<<< 128 , 256 >>>
}

it should be:

if (threadIdx.x == 0)
{
change_some_global_data<<< 1 , 256 >>>
}

but it will not work

why do you need dynamic parallelism, what is wrong with:

void host_launch(int *data)
{
change_some_global_data<<< 128, 256 >>>

parent_launch<<< 128, 256 >>>(data);
}

the only way you can have dynamic parallelism in the above, is if a thread of parent_launch reads the same element written by a corresponding thread of change_some_global_data, such that synchronization outside of a block does not matter

All kernels are run in a stream by default. And each stream accomplishes tasks in order.

Basically, what I meant was, if you want true global synchronization in CUDA, you need to use a separate kernel.

This example from above is exactly what I’m talking about :

void host_launch(int *data)
{
    change_some_global_data<<< 128, 256 >>>
    parent_launch<<< 128, 256 >>>(data);
}

Dynamic parallelism has its uses, I do suppose, but they’re not always good.

Anyway, here’s some info on streams :


And here’s the same blog series on dynamic parallelism for some good applications :

Thank you for your quick answers.

@MutantJohn
I asked actually how can i used stream inside kernel in order to synchronize cuda grids?

@little_jimmy

If i changed my code like this, i can ensure that all threads of parent kernel aware of all changing. I wanted to explain these with “I know, if u changed my block size with 1 block of parent kernel, i can ensure all.”

__global__ void parent_launch(int* data){

change_some_global_data<<< 128 , 256 >>>(data, myId);    
cudaDeviceSynchronize();
__syncthreads();
//in this point every thread of parent can show changing of data array.

}
void host(){
parent_launch<<< 1, 256 >>>(data);
}

But I would like to create parent kernel with many grid like parent_launch<<<128,256>>>

Maybe splitting kernel into 2 is good idea. But I am just wondering is it possible or not? just concerning memory consistency inside parent kernel after some changing child kernel.

By the way,

//in my case every threads will invoke child. like this
change_some_global_data<<< 128 , 256 >>>
//But why did you say below code is wrong?
if (threadIdx.x == 0)
{
   change_some_global_data<<< 128 , 256 >>>
}

change_some_global_data<<< 128 , 256 >>>(data, myId);

passing myId to change_some_global_data is also wrong, as it is declared a local variable

void host(){
parent_launch<<< 1, 256 >>>(data);
}

global void parent_launch(int* data){

change_some_global_data<<< 128 , 256 >>>(data);
}

256 x 128 x 256 threads execute change_some_global_data

void host(){
parent_launch<<< 128, 256 >>>(data);
}

global void parent_launch(int* data){

if ((threadIdx.x == 0) && (blockIdx.x == 0))
change_some_global_data<<< 128 , 256 >>>(data);
}

128 x 256 threads execute change_some_global_data

void host(){
parent_launch<<< 128, 256 >>>(data);
}

global void parent_launch(int* data){

if (threadIdx.x == 0)
change_some_global_data<<< 1 , 256 >>>(data);
}

128 x 256 threads execute change_some_global_data

void host(){
parent_launch<<< 128, 256 >>>(data);
}

global void parent_launch(int* data){

change_some_global_data<<< 128 , 256 >>>(data);
}

128 x 256 x 128 x 256 threads execute change_some_global_data

@little_jimmy

Thank you. My question is after last case what you wrote above, can any thread of parent kernel see changing of data array anytime?

For example, let’s say thread 0 at block-0 finished its job faster than others. How can i guarantee that thread 0 at block-0 see changing of another threads(such as thread 120 at block - 250).

Or like this code;

void host(){
parent_launch<<< 128, 256 >>>(data, out);
}

__global__ void parent_launch(int* data, int* out){

change_some_global_data<<< 128 , 256 >>>(data);

//Let's assume this is device function wants to compute something with data array. 
//And it writes all results into out.
access_all_data(data,out); 
}

it depends on how threads access the arrays - data/ out

if the threads of change_some_global_data and access_all_data access the arrays in the same way (e.g. (blockDim.x * blockIdx.x) + threadIdx.x), you could get away with

void host(){
parent_launch<<< 128, 256 >>>(data, out);
}

global void parent_launch(int* data, int* out){

if (threadIdx.x == 0)
{
change_some_global_data<<< 1 , 256 >>>(data);
}

access_all_data(data,out);
}

otherwise, you are best off with 2 separate kernels i think