syncronize all threads from all blocks cudaThreadSynchronize() the only way ?

hello everyone,

im trying to achieve the following.

kernel_two_steps_in_one(int* data, int size){

int i = blockIdx.x * blockDim.x + threadIdx.x;

    if( i<numEdges ){

           //some threads write to data, some not

    }

    __threadfence(); //i thought this would have worked.

    __syncthreads();

    if( i < numEdges ){

//however, on this part all threads must access their corresponding index of data[i], i get an error here, i think some threads get to this point before the writing has completed.

}

}

}

i solved this problem by separating it in 2 kernels, and use cudaThreadSynchronize() between each call.

but i thought i could have a one kernel call , and __threadfence would solve my problem, but it didnt because it crashes, since i think some threads still access data at some places were data hasnt been put yet by the other threads.

is there any solution besides two kernels?

hello everyone,

im trying to achieve the following.

kernel_two_steps_in_one(int* data, int size){

int i = blockIdx.x * blockDim.x + threadIdx.x;

    if( i<numEdges ){

           //some threads write to data, some not

    }

    __threadfence(); //i thought this would have worked.

    __syncthreads();

    if( i < numEdges ){

//however, on this part all threads must access their corresponding index of data[i], i get an error here, i think some threads get to this point before the writing has completed.

}

}

}

i solved this problem by separating it in 2 kernels, and use cudaThreadSynchronize() between each call.

but i thought i could have a one kernel call , and __threadfence would solve my problem, but it didnt because it crashes, since i think some threads still access data at some places were data hasnt been put yet by the other threads.

is there any solution besides two kernels?

Separate kernel launches are the only way to do inter-block synchronization, because not all blocks are active at the same time.

You do not need the cudaThreadSynchronize() between kernel invocations, though, as kernels in the same stream execute sequentially anyway.

Separate kernel launches are the only way to do inter-block synchronization, because not all blocks are active at the same time.

You do not need the cudaThreadSynchronize() between kernel invocations, though, as kernels in the same stream execute sequentially anyway.

thanks, then this is the way to go.

edit: i’m still confused with __threadfence() and its purpose. I thought that if i put __threadfence(), then at that point all threads from the device would wait for writes to be performed on the global memory (on data), so from that point and on threads would access correct data. like a syncronization based on memory state.

i am mistaken i guess, some explanation besides the Cuda reference manual would be welcome.

thanks, then this is the way to go.

edit: i’m still confused with __threadfence() and its purpose. I thought that if i put __threadfence(), then at that point all threads from the device would wait for writes to be performed on the global memory (on data), so from that point and on threads would access correct data. like a syncronization based on memory state.

i am mistaken i guess, some explanation besides the Cuda reference manual would be welcome.

__threadfence() does not block other threads. It forces each thread to wait until its own memory writes have been completed and pushed back up the memory hierarchy to the appropriate level (depending on whether you use __threadfence(), __threadfence_block(), etc…). This is much more subtle guarantee than a true synchronization point, because threads are allowed to progress past the fence instruction at any time.

I’m definitely no expert on when __threadfence() is actually useful, as these tend to be non-trivial parallel algorithms where explicit synchronization is not needed, but some kind of memory coherence guarantee is. I do know that if you ever think __threadfence() is the answer to a problem, stop and reevaluate for a moment, because it usually is not. :)

__threadfence() does not block other threads. It forces each thread to wait until its own memory writes have been completed and pushed back up the memory hierarchy to the appropriate level (depending on whether you use __threadfence(), __threadfence_block(), etc…). This is much more subtle guarantee than a true synchronization point, because threads are allowed to progress past the fence instruction at any time.

I’m definitely no expert on when __threadfence() is actually useful, as these tend to be non-trivial parallel algorithms where explicit synchronization is not needed, but some kind of memory coherence guarantee is. I do know that if you ever think __threadfence() is the answer to a problem, stop and reevaluate for a moment, because it usually is not. :)

ill take that advice, thanks

ill take that advice, thanks