difference between __threadfence_block and __syncthreads

Hi all,

forgive me my ignorance, but could somebody tell me the difference between the __threadfence_block() and __syncthreads()?
according to the CUDA programming guide 2.2.1 they both wait until all writes to global and shared memory are finished…

in a block, of course

__syncthreads() must be executed by all threads - Otherwise, your kernel would hang.

but, I think , that is NOT the case with __threadfence().

This is correct I think.

As an aside, has anyone ever run into a problem that required __threadfence() ? I have tried on numerous occasions to get memory transactions from other threads in the same CTA to be not visible, resulting in incorrect results with no success. I have even tried removing the __threadfence() intrinsics from the threadFenceReduction SDK example with no success. I have tried this with 2.3 on a 280GTX and a 285 GTX with similar results…

EDIT: is it possible that in current hardware, all memory transactions become visible in a consistent order to other threads in the same CTA and all threads in other concurrently running CTAs?

thanks, guys!

For threads in same CTA, I think “yes”…

But there is a compiler bug or so… Google “shared mem atomics sarnath cvnguyen sylvain site:http://forums.nvidia.com” and pick the first result link…

Oh yes. :)

Thanks for the link, this was very informative. I found this section particularly humorous:

So my understand as of this point is that the current generation hardware enforces a global order on memory operations, which is why I cannot cause code without a __threadfence() to generate an incorrect result. However, future hardware might drop this global ordering property for performance reasons, making __threadfence() necessary for writing portable and correct code. Is this right?

Glad to see that someone understands my humor! ;)

__threadfence() and __threadfence_block() are two different issues.

I think it is possible to write code doing inter-block communication that need __threadfence() to work properly. (Plus I believe Tim when he says so…)

A way to achieve this in a synthetic benchmark would be to generate heavy contention on one memory partition (refer to http://forums.nvidia.com/index.php?showtopic=96423 for a discussion on partition camping).

For example, make blocks 2 to n continuously read and write to the same memory partition 0, to flood it with requests. In parallel, make block 1 write a word in partition 0, then another word in partition 1. And make block 0 read both words written by block 1 at the same time.

If my intuition is correct (and the whole choreography is timed properly), you will encounter a case when block 0 reads the older value from partition 0 and the newer value from partition 1, that is in the opposite order as it was written.

Now, about __threadfence_block(). I wrote previously that it was currently implemented as NOP because memory ordering is always correct inside one block, but that might change in the future.

Actually, another explanation is that __threadfence_block() tells the compiler to avoid optimizations that may change the order of reads or writes around this boundary. Just like volatile, but more localized.

Just wondering, does any one have an example of this? I’d be interested in seeing an example where not using __threadfence results in this problem.

Inter-block communication during persistent CTA launches. Besides the raytracing guys and me, I don’t know of anyone who does this.

I am working on an application that uses inter-block communication, so its actually interesting to me to check this out. I tried writing a small testbench as Sylvain described with 2 blocks communicating data.

Block 1 spin-waits for Block 0 to unlock him via atomic ops. Block 0 attempts to queue up a bunch of writes in a single memory partition, then performs one last specific write to a single memory location in that same partition(=12345). Next Block 0 uses the atomic ops to let Block 1 exit the spin-wait. The first thing Block 1 does is read the single memory partition (which may have not seen the 12345 written yet, as the many memory-writes have queued it up).

However, I am always able to see the 12345 without needing the __threadfence(). The code is below, does anyone have any suggestions on how I can modify it to catch the need for __threadfence?

[codebox]include <stdio.h>

include <stdlib.h>

define DATA_INDEX 128 // Pick an index where we try and catch an old value

define NUMTHREADS 256

define PARTITION_STRIDE 384 // In words for 8/9 series, 512 for 200-series GPU

define STRESS 10000

define ARRAYSIZE 1000

device unsigned int g_lock= 0;

global void

feedforwardQ(float* g_idata, float* g_odata){

unsigned int i;

unsigned int old = 0;

unsigned int temp;  //Try and catch an older write

unsigned int tid = threadIdx.x;

if(blockIdx.x==1){    // Block 1 waits for Block 0

if(threadIdx.x==0){

    while(old!=1){  // Spin-wait for Block 0 to set lock

     old = atomicCAS(&g_lock, 1, 0) ;

    }

     temp =  g_idata[DATA_INDEX];  // Get indexed data immediately

}

__syncthreads();

if(threadIdx.x==0){

    g_odata[DATA_INDEX] = temp;  // Copy index data to output

}

else{

    g_odata[threadIdx.x] = g_idata[threadIdx.x]; // Copy inputs to outputs

}

}

else{   //Block 0 queues many writes (memory camping)

for(i=0;i<STRESS;i++){

    g_idata[DATA_INDEX+PARTITION_STRIDE+i%64] = tid;

}

//__threadfence();  

//__syncthreads();

if(threadIdx.x==0){

    g_idata[DATA_INDEX] = 12345;

} 

//__threadfence();  ??

//__syncthreads();  

if(threadIdx.x==0){

     atomicInc(&g_lock,1);

 }

}

__syncthreads();

}

int main(){

unsigned int i;

cudaError_t err;

float h_data[ARRAYSIZE];

float* d_data;

float* d_temp;

cudaSetDevice(0);

for(i=0;i<ARRAYSIZE;i++){

h_data[i] = i;

}

cudaMalloc( (void**) &d_data, sizeof(float)*ARRAYSIZE);

cudaMalloc( (void**) &d_temp, sizeof(float)*ARRAYSIZE);

cudaMemcpy(d_temp,&h_data, sizeof(float)*ARRAYSIZE,cudaMemcpyHostToDevice) ;

feedforwardQ<<< 2, NUMTHREADS>>>(d_temp,d_data);

err = cudaThreadSynchronize();

if (err != 0){

    return -1;

}

cudaMemcpy(h_data, d_data, sizeof(float)*ARRAYSIZE,cudaMemcpyDeviceToHost) ;

printf(“data = %f\n”, h_data[DATA_INDEX]);

} [/codebox]

One variable won’t need __threadfence on current hardware because of how memory is laid out (physically), but if you write to much larger chunks of memory, there is no guarantee that you won’t need __threadfence.

(this gets into arcane details about CTA scheduling, partitions, and things like that)

Thanks for the responses. What you say makes sense, and I believe I may have seen instances where I am reading/writing huge chunks of memory, and using the atomic ops to sync between blocks when the writing is “done”. However, when I don’t use __threadfence, its possible for a waiting-Block to get signaled by the Atomic-op variable, yet the huge chunks of writes aren’t completed by the time I begin reading them.

I guess I was just trying to find/create a much smaller program that exhibits the same type of behavior.

The real fun starts when you get into interactions between atomic operations and __threadfence.

How do atomic operations work with other memory load/stores? Do they make other memory requests visible to the others before/after atomic operation?

Thorically if I want to broadcast a float via shared memory to all threads in block I have to call __threadfence_block() after the write to shared memory and before the read?

shared float val;
if (threadIdx.x==0) val= 55; /* set value */
__threadfence_block();
val; // avaliable for all threads in a block

Is this correct?

You want to use __syncthreads()

threadfence_block() guarantees that the individual thread in question will not proceed beyond the barrier until the value is visible. However it makes no claims about other threads.