interblock sync without __threadfence() ?


after looking through CUDA2.2 beta programming guide i noticed a new functionality, namely threadfence() and threadfence_block()
to enforce shared/global memory consistency…

do it essentially mean that without this functionality (in CUDA 2.1) there is no way to implement “correct” interblock synchronization (that is, only using atomic intrinsics) ?
in other words, no way to ensure that all threads of a block have finished global memory transaction,
such that the block can notify others that its results are available ?


As far as I’m aware, no - there’s no other way to ‘correctly’ synchronize warps within a block - or blocks within a kernel - in relation to global memory transactions.

Of course with atomics, you can synchronize warps/blocks, however - there’s no guarantee any prior memory transactions will have completed…

I’m curious however, what (if any) the compute capability requirements for thread fence functionality are. (If this works on ‘all’ hardware, this is probably the best feature I’ve seen in CUDA, ever. as we currently write ‘all’ kernels for 1.0 compute capability, so no atomics…)

Edit: Actually I could be wrong about what I said - I can’t find any definitive documentation regarding gmem transactions and and synchronous behaviour between threads… I’m just ‘assuming’ gmem transactions are asynchronous, and the order of transactions operations being sent isn’t necessarily the order the card will complete them in.

It depends what you mean by “synchronization” and what you mean by “correct”.

Inter-block communication in general could be considered “incorrect” in the sense that blocks should be completely independent. The kernel should give correct behavior regardless of how many blocks run simultaneously, and regardless of what order they run in. Almost any inter-block communication will violate this, regardless of the mechanism used.

Without atomics and without threadfence, it’s pretty much impossible to violate the model and have a program that works reliably. Race conditions will kill any attempt. With atomics, it’s possible to implement a working block synchronization barrier, but for the above reasons, such a thing is still bad. With threadfence, it may be possible to do other things as well, but it doesn’t change the fact that inter-block dependencies are bad.

yes you are right, the idea behind thread blocks is to have them run independent from each other, such that the kernel can run on the hardware with any number of SMs without modification.

although sometimes it might be desireable to have blocks partially synchronized

(after all, cudaThreadSynchronize is basically the same threadfence but called from the host),

say you’re implementing “fork-join” approach or out-of-memory radix2 FFT, that is,

results of every two thread blocks are consumed by succeeding block,

in this case one can effectively decrease the memory bandwidth by the factor of 2 if every two blocks are synced.

In other words, one can avoid unnecessary global memory transfer (back and forth) for every second block.

… but of course it’s not clear what is the trade-off for block synchronisation… maybe it outweights the costs for memory transfer

I agree that there can be a desire to synchronize blocks, and it is possible to implement kernels which do this. But I believe threadfence() does not help because it does not block. It does no synchronization. It is nothing like cudaThreadSynchronize. It is nothing at all like __syncthreads. It only prevents memory writes becoming visible to other threads in the wrong order.

right, threadfence seems to block only the calling thread (or perhaps warp ?), then __syncthreads + atomics do the rest of the job…

so, looks like we need to wait for CUDA2.2 final to have these features officially available

[font=“Courier New”]

I used this function to implement the red-black GS method and all iterations in one kernel,but it`s very very slow!

this`s my code:


void solver(float* P,float* D,uint pitch,uint cols,uint loops,float delta)


	const uint xloc=__umul24(blockDim.x,blockIdx.x)+threadIdx.x;

	const uint yloc=__umul24(blockDim.y,blockIdx.y)+threadIdx.y;

	const uint gloc=__umul24(pitch,yloc)+xloc;


	if((xloc==0u)|(xloc==cols-1u)|(yloc==0u)|(yloc==pitch-1u)) return;


	uint cc=((threadIdx.y+blockIdx.x)&1u);

	uint counter=loops;





		p  =P[gloc+1u];  //that`s despond the cuda dosen`t support texture array declare so it can`t optimitize with tex cache










	} __threadfence();




		goto __LOOP__;




From my understanding of threadfence, this will not prevent some threads from storing (P[gloc]=p) before other threads have loaded (p+=P[gloc+pitch]), which it appears is what you’re trying to do. Threadfence is not like syncthreads.

I would recommend trying a version that writes to a separate output buffer and see if it gives you the same results. I am betting it won’t.

Jamie - that’s my understanding as well. (in re. “It only prevents memory writes becoming visible to other threads in the wrong order”)

asm - I actually have some kernels in which something similar might be applicable… unfortunately, I don’t know a wait free way to implement it correctly. I’m thinking it could be possible to establish some sort of work array / list, and if a block is left with only tasks that have dependencies, then it would recalculate dependencies (and increment profile counters); all blocks would do this until all work was done.

But i used this function to take all of the reduce operation on GPU and the result is ok although it`s slower.

Hi, im trying to test this out as well but am running into problems,

could you post the code in english please ? i can’t read chinese


#define reduce_policy 0

#define reduce_block 512

#define uint unsigned int

//note:in this simple , the first result element count must be less-equal the size of one block


void reduce(uint* out,const uint* in)



uint stride=__umul24(gridDim.x,blockDim.x<<1);

uint gloc=__umul24(blockDim.x,blockIdx.x)+threadIdx.x;

extern shared uint smem;







#if reduce_block>=512



} __syncthreads();


#if reduce_block>=256



} __syncthreads();


#if reduce_block>=128

if(threadIdx.x< 64u){

    smem[threadIdx.x]+=smem[threadIdx.x+ 64u];

} __syncthreads();


#if reduce_block>= 64

if(threadIdx.x< 32u)




#if reduce_block>=32



#if reduce_block>=16

    smem[threadIdx.x]+=smem[threadIdx.x+ 8u];


#if reduce_block>= 8

    smem[threadIdx.x]+=smem[threadIdx.x+ 4u];


#if reduce_block>= 4

    smem[threadIdx.x]+=smem[threadIdx.x+ 2u];


#if reduce_block>=2

    smem[threadIdx.x]+=smem[threadIdx.x+ 1u];


#if reduce_block>=64



#if reduce_policy>0



#if reduce_policy==1



    if(threadIdx.x<256u){ out[threadIdx.x]+=out[threadIdx.x+256u]; } __syncthreads(); //the result is no right if no use this function 

    if(threadIdx.x<128u){ out[threadIdx.x]+=out[threadIdx.x+128u]; } __syncthreads();

    if(threadIdx.x< 64u){ out[threadIdx.x]+=out[threadIdx.x+ 64u]; } __syncthreads();

    if(threadIdx.x< 32u){



        out[threadIdx.x]+=out[threadIdx.x+ 8u];

        out[threadIdx.x]+=out[threadIdx.x+ 4u];

        out[threadIdx.x]+=out[threadIdx.x+ 2u];

        out[threadIdx.x]+=out[threadIdx.x+ 1u];



#elif reduce_policy ==2



    uint record=1u;












this works ? were in the code do you make sure that all the blocks have finished there work ? in the new cuda docs there is example code for reduce using __threadfence but they also use some attomics to make sure that the block that sums the final sum is the last one working.

becuase of the needed atomics this is actually slower then launching 2 kernels.

With #define reduce_policy 0, this will successfully produce 512 partial sums, but it will not perform a full reduce. In particular, the if(blockIdx.x==0u) section has no guarantee of occurring after the other blocks have produced their output values, as __threadfence() does not synchronize blocks. This code is wrong.

But the result of this program in my here is right.

That does not mean that the program is correct ;) Other hardware might not work.

This`s my project
SIMTreduce.rar (639 KB)

May be need to recompile if your platform is`nt VS2005,anyone can test in different hardware and out the result
(Although it can be reduced completely in GPU but is very slower because the large cost due to thread barrier).