forgive me my ignorance, but could somebody tell me the difference between the __threadfence_block() and __syncthreads()?
according to the CUDA programming guide 2.2.1 they both wait until all writes to global and shared memory are finished…
As an aside, has anyone ever run into a problem that required __threadfence() ? I have tried on numerous occasions to get memory transactions from other threads in the same CTA to be not visible, resulting in incorrect results with no success. I have even tried removing the __threadfence() intrinsics from the threadFenceReduction SDK example with no success. I have tried this with 2.3 on a 280GTX and a 285 GTX with similar results…
EDIT: is it possible that in current hardware, all memory transactions become visible in a consistent order to other threads in the same CTA and all threads in other concurrently running CTAs?
Thanks for the link, this was very informative. I found this section particularly humorous:
So my understand as of this point is that the current generation hardware enforces a global order on memory operations, which is why I cannot cause code without a __threadfence() to generate an incorrect result. However, future hardware might drop this global ordering property for performance reasons, making __threadfence() necessary for writing portable and correct code. Is this right?
__threadfence() and __threadfence_block() are two different issues.
I think it is possible to write code doing inter-block communication that need __threadfence() to work properly. (Plus I believe Tim when he says so…)
A way to achieve this in a synthetic benchmark would be to generate heavy contention on one memory partition (refer to http://forums.nvidia.com/index.php?showtopic=96423 for a discussion on partition camping).
For example, make blocks 2 to n continuously read and write to the same memory partition 0, to flood it with requests. In parallel, make block 1 write a word in partition 0, then another word in partition 1. And make block 0 read both words written by block 1 at the same time.
If my intuition is correct (and the whole choreography is timed properly), you will encounter a case when block 0 reads the older value from partition 0 and the newer value from partition 1, that is in the opposite order as it was written.
Now, about __threadfence_block(). I wrote previously that it was currently implemented as NOP because memory ordering is always correct inside one block, but that might change in the future.
Actually, another explanation is that __threadfence_block() tells the compiler to avoid optimizations that may change the order of reads or writes around this boundary. Just like volatile, but more localized.
I am working on an application that uses inter-block communication, so its actually interesting to me to check this out. I tried writing a small testbench as Sylvain described with 2 blocks communicating data.
Block 1 spin-waits for Block 0 to unlock him via atomic ops. Block 0 attempts to queue up a bunch of writes in a single memory partition, then performs one last specific write to a single memory location in that same partition(=12345). Next Block 0 uses the atomic ops to let Block 1 exit the spin-wait. The first thing Block 1 does is read the single memory partition (which may have not seen the 12345 written yet, as the many memory-writes have queued it up).
However, I am always able to see the 12345 without needing the __threadfence(). The code is below, does anyone have any suggestions on how I can modify it to catch the need for __threadfence?
unsigned int old = 0;
unsigned int temp; //Try and catch an older write
unsigned int tid = threadIdx.x;
if(blockIdx.x==1){ // Block 1 waits for Block 0
if(threadIdx.x==0){
while(old!=1){ // Spin-wait for Block 0 to set lock
old = atomicCAS(&g_lock, 1, 0) ;
}
temp = g_idata[DATA_INDEX]; // Get indexed data immediately
}
__syncthreads();
if(threadIdx.x==0){
g_odata[DATA_INDEX] = temp; // Copy index data to output
}
else{
g_odata[threadIdx.x] = g_idata[threadIdx.x]; // Copy inputs to outputs
}
}
else{ //Block 0 queues many writes (memory camping)
for(i=0;i<STRESS;i++){
g_idata[DATA_INDEX+PARTITION_STRIDE+i%64] = tid;
}
//__threadfence();
//__syncthreads();
One variable won’t need __threadfence on current hardware because of how memory is laid out (physically), but if you write to much larger chunks of memory, there is no guarantee that you won’t need __threadfence.
(this gets into arcane details about CTA scheduling, partitions, and things like that)
Thanks for the responses. What you say makes sense, and I believe I may have seen instances where I am reading/writing huge chunks of memory, and using the atomic ops to sync between blocks when the writing is “done”. However, when I don’t use __threadfence, its possible for a waiting-Block to get signaled by the Atomic-op variable, yet the huge chunks of writes aren’t completed by the time I begin reading them.
I guess I was just trying to find/create a much smaller program that exhibits the same type of behavior.
Thorically if I want to broadcast a float via shared memory to all threads in block I have to call __threadfence_block() after the write to shared memory and before the read?
shared float val;
if (threadIdx.x==0) val= 55; /* set value */
__threadfence_block();
val; // avaliable for all threads in a block
threadfence_block() guarantees that the individual thread in question will not proceed beyond the barrier until the value is visible. However it makes no claims about other threads.