Hello,
I have some questions in understanding how __threadfence() works. I have noticed that __threadfence() will be compiled to membar.gl in ptx. From what I have read, membar.gl is able to make all the memory requests (like store) visible to all the threads able to observe the effect by the thread calling __threadfence().
However, while I continue reading, I have noticed that fence.sc is a synonym for membar.gl.
In the introduction of fence.sc, I have noticed that the document described fence-SC order.
- A
fence.scoperation X synchronizes with afence.scoperation Y if X precedes Y in the Fence-SC order.
I have a problem understanding the term “synchronizes with”.
Is this synchronization uni-directional?
Meaning, if X precedes Y in the Fence-SC order, the memory effect from X’s thread will be visible in Y, but Y’s thread’s memory operation will not be visible in X’s thread?
Or, this synchronization means the all the threads in this scope should agree an order, that for all the threads in this scope, the memory effect in X’s thread will always happen before Y’s thread. However, all the threads in this scope should be able to see the effect by both threads if fence.sc has been called by both of them.
Could you also please describe how fence-fence synchronization work here? If referring to __threadfence()or membar.gl explanation in the documents, they should only be called by the calling thread that they prevent calling thread memory operation reordering, and they make the memory modification visible. However, while I was looking at the Inter-thread Synchronization example under this link. I can’t fully understand why the __threadfence() after the atomic load.
__global__ void kernel(int* flag, int* data) {
if (threadIdx.x == 0) {
// Consumer: blocks until flag is set by producer, then reads data
while(atomicAdd(flag, 0) == 0); // Load with Relaxed Read-Modify-Write
__threadfence(); // SequentiallyConsistent fence
if (*data != 42) __trap(); // Errors if wrong data read
} else if (threadIdx.x == 1) {
// Producer: writes data then sets flag
*data = 42;
__threadfence(); // SequentiallyConsistent fence
atomicExch(flag, 1); // Store with Relaxed Read-Modify-Write
}
}
From my understanding based on the programming guide, when __threadfence() after *data = 42 has been called, we already guaranteed that *data = 42 is visible to all the threads happening before atomicExch(flag, 1). Thus, when thread 1 gets flag = 1, *data = 42 should already be there, why we need another threadfence() after that?
Thanks very much