difference between __threadfence_block and __syncthreads

Eremey · August 5, 2009, 10:59am

Hi all,

forgive me my ignorance, but could somebody tell me the difference between the __threadfence_block() and __syncthreads()?
according to the CUDA programming guide 2.2.1 they both wait until all writes to global and shared memory are finished…

Eremey · August 5, 2009, 11:01am

in a block, of course

Sarnath · August 5, 2009, 12:19pm

__syncthreads() must be executed by all threads - Otherwise, your kernel would hang.

but, I think , that is NOT the case with __threadfence().

Gregory_Diamos · August 5, 2009, 1:02pm

This is correct I think.

As an aside, has anyone ever run into a problem that required __threadfence() ? I have tried on numerous occasions to get memory transactions from other threads in the same CTA to be not visible, resulting in incorrect results with no success. I have even tried removing the __threadfence() intrinsics from the threadFenceReduction SDK example with no success. I have tried this with 2.3 on a 280GTX and a 285 GTX with similar results…

EDIT: is it possible that in current hardware, all memory transactions become visible in a consistent order to other threads in the same CTA and all threads in other concurrently running CTAs?

Eremey · August 5, 2009, 1:47pm

thanks, guys!

Sarnath · August 5, 2009, 2:05pm

For threads in same CTA, I think “yes”…

But there is a compiler bug or so… Google “shared mem atomics sarnath cvnguyen sylvain site:http://forums.nvidia.com” and pick the first result link…

tmurray · August 5, 2009, 4:39pm

Oh yes. :)

Gregory_Diamos · August 5, 2009, 5:26pm

Thanks for the link, this was very informative. I found this section particularly humorous:

Yes, exactly:

join.label label7

label5: mov.acquire.b32 $p1|$r0, s[0x0010]

@$p1.lt bra.label label6

bra.label label5

label6: set.eq.u32 $p1|$o127, $r0, c1[0x0004]

mov.b32 $r1, $r0

@$p1.ne mov.b32 $r1, $r2

mov.b32 s[0x0010], $r1

mov.release.b32 s[0x0010], $r1

label7: nop.join

So my understand as of this point is that the current generation hardware enforces a global order on memory operations, which is why I cannot cause code without a __threadfence() to generate an incorrect result. However, future hardware might drop this global ordering property for performance reasons, making __threadfence() necessary for writing portable and correct code. Is this right?

Sylvain_Collange · August 5, 2009, 7:50pm

Glad to see that someone understands my humor! ;)

__threadfence() and __threadfence_block() are two different issues.

I think it is possible to write code doing inter-block communication that need __threadfence() to work properly. (Plus I believe Tim when he says so…)

A way to achieve this in a synthetic benchmark would be to generate heavy contention on one memory partition (refer to http://forums.nvidia.com/index.php?showtopic=96423 for a discussion on partition camping).

For example, make blocks 2 to n continuously read and write to the same memory partition 0, to flood it with requests. In parallel, make block 1 write a word in partition 0, then another word in partition 1. And make block 0 read both words written by block 1 at the same time.

If my intuition is correct (and the whole choreography is timed properly), you will encounter a case when block 0 reads the older value from partition 0 and the newer value from partition 1, that is in the opposite order as it was written.

Now, about __threadfence_block(). I wrote previously that it was currently implemented as NOP because memory ordering is always correct inside one block, but that might change in the future.

Actually, another explanation is that __threadfence_block() tells the compiler to avoid optimizations that may change the order of reads or writes around this boundary. Just like volatile, but more localized.

enemyben88 · November 4, 2009, 10:12pm

Just wondering, does any one have an example of this? I’d be interested in seeing an example where not using __threadfence results in this problem.

tmurray · November 4, 2009, 10:29pm

Inter-block communication during persistent CTA launches. Besides the raytracing guys and me, I don’t know of anyone who does this.

enemyben88 · November 4, 2009, 10:53pm

I am working on an application that uses inter-block communication, so its actually interesting to me to check this out. I tried writing a small testbench as Sylvain described with 2 blocks communicating data.

Block 1 spin-waits for Block 0 to unlock him via atomic ops. Block 0 attempts to queue up a bunch of writes in a single memory partition, then performs one last specific write to a single memory location in that same partition(=12345). Next Block 0 uses the atomic ops to let Block 1 exit the spin-wait. The first thing Block 1 does is read the single memory partition (which may have not seen the 12345 written yet, as the many memory-writes have queued it up).

However, I am always able to see the 12345 without needing the __threadfence(). The code is below, does anyone have any suggestions on how I can modify it to catch the need for __threadfence?

[codebox]include <stdio.h>

include <stdlib.h>

define DATA_INDEX 128 // Pick an index where we try and catch an old value

define NUMTHREADS 256

define PARTITION_STRIDE 384 // In words for 8/9 series, 512 for 200-series GPU

define STRESS 10000

define ARRAYSIZE 1000

device unsigned int g_lock= 0;

global void

feedforwardQ(float* g_idata, float* g_odata){

unsigned int i;

unsigned int old = 0;

unsigned int temp;  //Try and catch an older write

unsigned int tid = threadIdx.x;

if(blockIdx.x==1){    // Block 1 waits for Block 0

if(threadIdx.x==0){

    while(old!=1){  // Spin-wait for Block 0 to set lock

     old = atomicCAS(&g_lock, 1, 0) ;

    }

     temp =  g_idata[DATA_INDEX];  // Get indexed data immediately

}

__syncthreads();

if(threadIdx.x==0){

    g_odata[DATA_INDEX] = temp;  // Copy index data to output

}

else{

    g_odata[threadIdx.x] = g_idata[threadIdx.x]; // Copy inputs to outputs

}

}

else{   //Block 0 queues many writes (memory camping)

for(i=0;i<STRESS;i++){

    g_idata[DATA_INDEX+PARTITION_STRIDE+i%64] = tid;

}

//__threadfence();  

//__syncthreads();

if(threadIdx.x==0){

    g_idata[DATA_INDEX] = 12345;

} 

//__threadfence();  ??

//__syncthreads();

if(threadIdx.x==0){

     atomicInc(&g_lock,1);

 }

}

__syncthreads();

}

int main(){

unsigned int i;

cudaError_t err;

float h_data[ARRAYSIZE];

float* d_data;

float* d_temp;

cudaSetDevice(0);

for(i=0;i<ARRAYSIZE;i++){

h_data[i] = i;

}

cudaMalloc( (void**) &d_data, sizeof(float)*ARRAYSIZE);

cudaMalloc( (void**) &d_temp, sizeof(float)*ARRAYSIZE);

cudaMemcpy(d_temp,&h_data, sizeof(float)*ARRAYSIZE,cudaMemcpyHostToDevice) ;

feedforwardQ<<< 2, NUMTHREADS>>>(d_temp,d_data);

err = cudaThreadSynchronize();

if (err != 0){

    return -1;

}

cudaMemcpy(h_data, d_data, sizeof(float)*ARRAYSIZE,cudaMemcpyDeviceToHost) ;

printf(“data = %f\n”, h_data[DATA_INDEX]);

} [/codebox]

tmurray · November 4, 2009, 11:28pm

One variable won’t need __threadfence on current hardware because of how memory is laid out (physically), but if you write to much larger chunks of memory, there is no guarantee that you won’t need __threadfence.

(this gets into arcane details about CTA scheduling, partitions, and things like that)

enemyben88 · November 4, 2009, 11:51pm

Thanks for the responses. What you say makes sense, and I believe I may have seen instances where I am reading/writing huge chunks of memory, and using the atomic ops to sync between blocks when the writing is “done”. However, when I don’t use __threadfence, its possible for a waiting-Block to get signaled by the Atomic-op variable, yet the huge chunks of writes aren’t completed by the time I begin reading them.

I guess I was just trying to find/create a much smaller program that exhibits the same type of behavior.

tmurray · November 5, 2009, 12:03am

The real fun starts when you get into interactions between atomic operations and __threadfence.

sergeyn · November 5, 2009, 7:53am

How do atomic operations work with other memory load/stores? Do they make other memory requests visible to the others before/after atomic operation?

InformateIt · April 22, 2015, 4:26pm

Thorically if I want to broadcast a float via shared memory to all threads in block I have to call __threadfence_block() after the write to shared memory and before the read?

shared float val;
if (threadIdx.x==0) val= 55; /* set value */
__threadfence_block();
val; // avaliable for all threads in a block

Is this correct?

Robert_Crovella · April 22, 2015, 6:04pm

You want to use __syncthreads()

threadfence_block() guarantees that the individual thread in question will not proceed beyond the barrier until the value is visible. However it makes no claims about other threads.

Topic		Replies	Views
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	2918	January 25, 2010
Question related __threadfence CUDA Programming and Performance	13	5041	January 12, 2016
interblock sync without __threadfence() ? CUDA Programming and Performance	17	8461	May 7, 2009
__threadfence_block() vs __threadfence() ? CUDA Programming and Performance	6	6557	July 13, 2022
What the relationship between __syncthreads() and __threadfence() CUDA Programming and Performance	4	6340	February 28, 2022
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2619	April 9, 2012
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2770	December 8, 2010
Deadlock in busy waiting queue CUDA Programming and Performance cuda	6	271	June 13, 2024
CUDA Memory Consistency CUDA Programming and Performance	23	55454	March 8, 2007
Correct use of _threadfence() to remove the RAW race Cannot remove race condition CUDA Programming and Performance	14	3742	April 23, 2012

difference between __threadfence_block and __syncthreads

Related topics