syncronize all threads from all blocks cudaThreadSynchronize() the only way ?

neoideo · November 11, 2010, 4:21am

hello everyone,

im trying to achieve the following.

kernel_two_steps_in_one(int* data, int size){

int i = blockIdx.x * blockDim.x + threadIdx.x;

    if( i<numEdges ){

           //some threads write to data, some not

    }

    __threadfence(); //i thought this would have worked.

    __syncthreads();

    if( i < numEdges ){

//however, on this part all threads must access their corresponding index of data[i], i get an error here, i think some threads get to this point before the writing has completed.

}

}

}

i solved this problem by separating it in 2 kernels, and use cudaThreadSynchronize() between each call.

but i thought i could have a one kernel call , and __threadfence would solve my problem, but it didnt because it crashes, since i think some threads still access data at some places were data hasnt been put yet by the other threads.

is there any solution besides two kernels?

neoideo · November 11, 2010, 4:21am

hello everyone,

im trying to achieve the following.

kernel_two_steps_in_one(int* data, int size){

int i = blockIdx.x * blockDim.x + threadIdx.x;

    if( i<numEdges ){

           //some threads write to data, some not

    }

    __threadfence(); //i thought this would have worked.

    __syncthreads();

    if( i < numEdges ){

//however, on this part all threads must access their corresponding index of data[i], i get an error here, i think some threads get to this point before the writing has completed.

}

}

}

i solved this problem by separating it in 2 kernels, and use cudaThreadSynchronize() between each call.

but i thought i could have a one kernel call , and __threadfence would solve my problem, but it didnt because it crashes, since i think some threads still access data at some places were data hasnt been put yet by the other threads.

is there any solution besides two kernels?

tera · November 11, 2010, 1:23pm

Separate kernel launches are the only way to do inter-block synchronization, because not all blocks are active at the same time.

You do not need the cudaThreadSynchronize() between kernel invocations, though, as kernels in the same stream execute sequentially anyway.

tera · November 11, 2010, 1:23pm

Separate kernel launches are the only way to do inter-block synchronization, because not all blocks are active at the same time.

You do not need the cudaThreadSynchronize() between kernel invocations, though, as kernels in the same stream execute sequentially anyway.

neoideo · November 11, 2010, 1:54pm

thanks, then this is the way to go.

edit: i’m still confused with __threadfence() and its purpose. I thought that if i put __threadfence(), then at that point all threads from the device would wait for writes to be performed on the global memory (on data), so from that point and on threads would access correct data. like a syncronization based on memory state.

i am mistaken i guess, some explanation besides the Cuda reference manual would be welcome.

neoideo · November 11, 2010, 1:54pm

thanks, then this is the way to go.

edit: i’m still confused with __threadfence() and its purpose. I thought that if i put __threadfence(), then at that point all threads from the device would wait for writes to be performed on the global memory (on data), so from that point and on threads would access correct data. like a syncronization based on memory state.

i am mistaken i guess, some explanation besides the Cuda reference manual would be welcome.

seibert · November 11, 2010, 3:12pm

__threadfence() does not block other threads. It forces each thread to wait until its own memory writes have been completed and pushed back up the memory hierarchy to the appropriate level (depending on whether you use __threadfence(), __threadfence_block(), etc…). This is much more subtle guarantee than a true synchronization point, because threads are allowed to progress past the fence instruction at any time.

I’m definitely no expert on when __threadfence() is actually useful, as these tend to be non-trivial parallel algorithms where explicit synchronization is not needed, but some kind of memory coherence guarantee is. I do know that if you ever think __threadfence() is the answer to a problem, stop and reevaluate for a moment, because it usually is not. :)

seibert · November 11, 2010, 3:12pm

__threadfence() does not block other threads. It forces each thread to wait until its own memory writes have been completed and pushed back up the memory hierarchy to the appropriate level (depending on whether you use __threadfence(), __threadfence_block(), etc…). This is much more subtle guarantee than a true synchronization point, because threads are allowed to progress past the fence instruction at any time.

I’m definitely no expert on when __threadfence() is actually useful, as these tend to be non-trivial parallel algorithms where explicit synchronization is not needed, but some kind of memory coherence guarantee is. I do know that if you ever think __threadfence() is the answer to a problem, stop and reevaluate for a moment, because it usually is not. :)

neoideo · November 11, 2010, 3:43pm

ill take that advice, thanks

neoideo · November 11, 2010, 3:43pm

ill take that advice, thanks

Crankie · November 15, 2010, 7:53am

Crankie · November 15, 2010, 7:53am

Topic		Replies	Views
interblock sync without __threadfence() ? CUDA Programming and Performance	17	8479	May 7, 2009
Synchronize all blocks in CUDA CUDA Programming and Performance	12	45968	October 25, 2013
How to syncronize across blocks in CUDA Fortran Legacy PGI Compilers	3	11209	August 16, 2010
Question related __threadfence CUDA Programming and Performance	13	5103	January 12, 2016
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	2940	January 25, 2010
Synchronize Blocks Within CUDA kernel Your ipinion CUDA Programming and Performance	5	16017	June 30, 2012
__syncthreads and __threadfence together in a loop CUDA Programming and Performance	5	3602	October 15, 2010
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29350	April 22, 2015
Problems with __threadfence CUDA Programming and Performance	2	3122	November 11, 2009
cuda block synchronization CUDA Programming and Performance	4	8415	June 20, 2011

syncronize all threads from all blocks cudaThreadSynchronize() the only way ?

Related topics