interblock sync without __threadfence() ?

asm · March 30, 2009, 1:02pm

Hi,

after looking through CUDA2.2 beta programming guide i noticed a new functionality, namely threadfence() and threadfence_block()
to enforce shared/global memory consistency…

do it essentially mean that without this functionality (in CUDA 2.1) there is no way to implement “correct” interblock synchronization (that is, only using atomic intrinsics) ?
in other words, no way to ensure that all threads of a block have finished global memory transaction,
such that the block can notify others that its results are available ?

thanks

Smokey · March 31, 2009, 1:41am

As far as I’m aware, no - there’s no other way to ‘correctly’ synchronize warps within a block - or blocks within a kernel - in relation to global memory transactions.

Of course with atomics, you can synchronize warps/blocks, however - there’s no guarantee any prior memory transactions will have completed…

I’m curious however, what (if any) the compute capability requirements for thread fence functionality are. (If this works on ‘all’ hardware, this is probably the best feature I’ve seen in CUDA, ever. as we currently write ‘all’ kernels for 1.0 compute capability, so no atomics…)

Edit: Actually I could be wrong about what I said - I can’t find any definitive documentation regarding gmem transactions and and synchronous behaviour between threads… I’m just ‘assuming’ gmem transactions are asynchronous, and the order of transactions operations being sent isn’t necessarily the order the card will complete them in.

Jamie_K · March 31, 2009, 2:38am

It depends what you mean by “synchronization” and what you mean by “correct”.

Inter-block communication in general could be considered “incorrect” in the sense that blocks should be completely independent. The kernel should give correct behavior regardless of how many blocks run simultaneously, and regardless of what order they run in. Almost any inter-block communication will violate this, regardless of the mechanism used.

Without atomics and without threadfence, it’s pretty much impossible to violate the model and have a program that works reliably. Race conditions will kill any attempt. With atomics, it’s possible to implement a working block synchronization barrier, but for the above reasons, such a thing is still bad. With threadfence, it may be possible to do other things as well, but it doesn’t change the fact that inter-block dependencies are bad.

asm · March 31, 2009, 8:49am

It depends what you mean by “synchronization” and what you mean by “correct”.

Inter-block communication in general could be considered “incorrect” in the sense that blocks should be completely independent. The kernel should give correct behavior regardless of how many blocks run simultaneously, and regardless of what order they run in. Almost any inter-block communication will violate this, regardless of the mechanism used.

Without atomics and without threadfence, it’s pretty much impossible to violate the model and have a program that works reliably. Race conditions will kill any attempt. With atomics, it’s possible to implement a working block synchronization barrier, but for the above reasons, such a thing is still bad. With threadfence, it may be possible to do other things as well, but it doesn’t change the fact that inter-block dependencies are bad.

yes you are right, the idea behind thread blocks is to have them run independent from each other, such that the kernel can run on the hardware with any number of SMs without modification.

although sometimes it might be desireable to have blocks partially synchronized

(after all, cudaThreadSynchronize is basically the same threadfence but called from the host),

say you’re implementing “fork-join” approach or out-of-memory radix2 FFT, that is,

results of every two thread blocks are consumed by succeeding block,

in this case one can effectively decrease the memory bandwidth by the factor of 2 if every two blocks are synced.

In other words, one can avoid unnecessary global memory transfer (back and forth) for every second block.

… but of course it’s not clear what is the trade-off for block synchronisation… maybe it outweights the costs for memory transfer

Jamie_K · March 31, 2009, 7:56pm

I agree that there can be a desire to synchronize blocks, and it is possible to implement kernels which do this. But I believe threadfence() does not help because it does not block. It does no synchronization. It is nothing like cudaThreadSynchronize. It is nothing at all like __syncthreads. It only prevents memory writes becoming visible to other threads in the wrong order.

asm · April 2, 2009, 6:50am

right, threadfence seems to block only the calling thread (or perhaps warp ?), then __syncthreads + atomics do the rest of the job…

so, looks like we need to wait for CUDA2.2 final to have these features officially available

_0x_bitfield · April 10, 2009, 7:16am

[font=“Courier New”]

I used this function to implement the red-black GS method and all iterations in one kernel,but it`s very very slow!

this`s my code:

__global__ 

void solver(float* P,float* D,uint pitch,uint cols,uint loops,float delta)

{

	const uint xloc=__umul24(blockDim.x,blockIdx.x)+threadIdx.x;

	const uint yloc=__umul24(blockDim.y,blockIdx.y)+threadIdx.y;

	const uint gloc=__umul24(pitch,yloc)+xloc;

	

	if((xloc==0u)|(xloc==cols-1u)|(yloc==0u)|(yloc==pitch-1u)) return;

	

	uint cc=((threadIdx.y+blockIdx.x)&1u);

	uint counter=loops;

	

__LOOP__:

	if(cc){

		float

		p  =P[gloc+1u];  //that`s despond the cuda dosen`t support texture array declare so it can`t optimitize with tex cache

		p+=P[gloc-1u];

		p+=P[gloc+pitch];

		p+=P[gloc-pitch];

		p=0.25f*(p+delta*D[gloc]);

		

		__threadfence();

		

		P[gloc]=p;	

		--counter;

	} __threadfence();

	

	if(counter>0){

		cc^=1u;

		goto __LOOP__;

	}

}

[/font]

Jamie_K · April 10, 2009, 5:31pm

From my understanding of threadfence, this will not prevent some threads from storing (P[gloc]=p) before other threads have loaded (p+=P[gloc+pitch]), which it appears is what you’re trying to do. Threadfence is not like syncthreads.

I would recommend trying a version that writes to a separate output buffer and see if it gives you the same results. I am betting it won’t.

gatoatigrado · April 10, 2009, 5:46pm

Jamie - that’s my understanding as well. (in re. “It only prevents memory writes becoming visible to other threads in the wrong order”)

asm - I actually have some kernels in which something similar might be applicable… unfortunately, I don’t know a wait free way to implement it correctly. I’m thinking it could be possible to establish some sort of work array / list, and if a block is left with only tasks that have dependencies, then it would recalculate dependencies (and increment profile counters); all blocks would do this until all work was done.

_0x_bitfield · April 11, 2009, 6:08am

But i used this function to take all of the reduce operation on GPU and the result is ok although it`s slower.

http://blog.csdn.net/cyrosly

erdooom · April 30, 2009, 3:23pm

Hi, im trying to test this out as well but am running into problems,

could you post the code in english please ? i can’t read chinese

Thanks,

_0x_bitfield · May 6, 2009, 12:52pm

define reduce_policy 0

define reduce_block 512

define uint unsigned int

//note:in this simple , the first result element count must be less-equal the size of one block

global

void reduce(uint* out,const uint* in)

{

const 

uint stride=__umul24(gridDim.x,blockDim.x<<1);

uint gloc=__umul24(blockDim.x,blockIdx.x)+threadIdx.x;

extern shared uint smem;

smem[threadIdx.x]=0u;

do{

    smem[threadIdx.x]+=in[gloc]+in[gloc+blockDim.x];

    gloc+=stride;

}while(gloc<(stride<<1));

__syncthreads();

#if reduce_block>=512

if(threadIdx.x<256u){

    smem[threadIdx.x]+=smem[threadIdx.x+256u];

} __syncthreads();

endif

#if reduce_block>=256

if(threadIdx.x<128u){

    smem[threadIdx.x]+=smem[threadIdx.x+128u];

} __syncthreads();

endif

#if reduce_block>=128

if(threadIdx.x< 64u){

    smem[threadIdx.x]+=smem[threadIdx.x+ 64u];

} __syncthreads();

endif

#if reduce_block>= 64

if(threadIdx.x< 32u)

{

    smem[threadIdx.x]+=smem[threadIdx.x+32u];

endif

#if reduce_block>=32

    smem[threadIdx.x]+=smem[threadIdx.x+16u];

endif

#if reduce_block>=16

    smem[threadIdx.x]+=smem[threadIdx.x+ 8u];

endif

#if reduce_block>= 8

    smem[threadIdx.x]+=smem[threadIdx.x+ 4u];

endif

#if reduce_block>= 4

    smem[threadIdx.x]+=smem[threadIdx.x+ 2u];

endif

#if reduce_block>=2

    smem[threadIdx.x]+=smem[threadIdx.x+ 1u];

endif

#if reduce_block>=64

endif

#if reduce_policy>0

__threadfence();

endif

#if reduce_policy==1

if(blockIdx.x==0u)

{

    if(threadIdx.x<256u){ out[threadIdx.x]+=out[threadIdx.x+256u]; } __syncthreads(); //the result is no right if no use this function 

    if(threadIdx.x<128u){ out[threadIdx.x]+=out[threadIdx.x+128u]; } __syncthreads();

    if(threadIdx.x< 64u){ out[threadIdx.x]+=out[threadIdx.x+ 64u]; } __syncthreads();

    if(threadIdx.x< 32u){

        out[threadIdx.x]+=out[threadIdx.x+32u];

        out[threadIdx.x]+=out[threadIdx.x+16u];

        out[threadIdx.x]+=out[threadIdx.x+ 8u];

        out[threadIdx.x]+=out[threadIdx.x+ 4u];

        out[threadIdx.x]+=out[threadIdx.x+ 2u];

        out[threadIdx.x]+=out[threadIdx.x+ 1u];

    } 

}

#elif reduce_policy ==2

if((threadIdx.x==0u)&(blockIdx.x==0u))

{ 

    uint record=1u;

    while(record<gridDim.x){

        out[0]+=out[record];

        ++record;

    }

}

else

if(threadIdx.x==0u){

    out[blockIdx.x]=smem[0];

}

endif

}

erdooom · May 6, 2009, 2:13pm

this works ? were in the code do you make sure that all the blocks have finished there work ? in the new cuda docs there is example code for reduce using __threadfence but they also use some attomics to make sure that the block that sums the final sum is the last one working.

becuase of the needed atomics this is actually slower then launching 2 kernels.

Jamie_K · May 6, 2009, 2:35pm

With #define reduce_policy 0, this will successfully produce 512 partial sums, but it will not perform a full reduce. In particular, the if(blockIdx.x==0u) section has no guarantee of occurring after the other blocks have produced their output values, as __threadfence() does not synchronize blocks. This code is wrong.

_0x_bitfield · May 6, 2009, 2:47pm

But the result of this program in my here is right.

E.D_Riedijk · May 6, 2009, 2:54pm

That does not mean that the program is correct ;) Other hardware might not work.

_0x_bitfield · May 6, 2009, 4:44pm

This`s my project
SIMTreduce.rar (639 KB)

_0x_bitfield · May 7, 2009, 12:37am

May be need to recompile if your platform is`nt VS2005,anyone can test in different hardware and out the result
(Although it can be reduced completely in GPU but is very slower because the large cost due to thread barrier).

Topic		Replies	Views
syncronize all threads from all blocks cudaThreadSynchronize() the only way ? CUDA Programming and Performance	11	8412	November 15, 2010
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29824	April 22, 2015
Question related __threadfence CUDA Programming and Performance	13	5292	January 12, 2016
GPU Synchronization and __threadfence() CUDA Programming and Performance	3	3982	October 18, 2010
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	3071	January 25, 2010
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2812	April 9, 2012
Problems with __threadfence CUDA Programming and Performance	2	3199	November 11, 2009
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10239	February 29, 2008
__syncthreads and __threadfence together in a loop CUDA Programming and Performance	5	3709	October 15, 2010
Global thread barrier CUDA Programming and Performance	78	86819	December 23, 2011

interblock sync without __threadfence() ?

Related topics