And I also test on my envoriment.
By setting compiler configure to -gencode arch=compute_80,code=sm_80.
I get following SASS for __threadfence_system()
The underlying implementations are the same, so the overhead of MEMBAR.SC.SYS and MEMBAR.ALL.SYS should be similar. The acquire-release fence is semantically weaker than the sequentially-consistent fence that NVSHMEM currently uses and using the weaker fence could potentially enable some compiler optimizations. The NVSHMEM barrier is meant to synchronize application issued ld/st operations to symmetric buffers, so we would need to evaluate whether acquire-release fence is strong enough for this broad use case.
OK, I can get what you said about maybe acq_rel maybe not strong enough.
but is there a real situation that we must use sequentially-consistent fence rather than acq_rel for barrier call.
Besides, I found something in cppreference, it said
Sequential ordering may be necessary for multiple producer-multiple consumer situations where all consumers must observe the actions of all producers occurring in the same order
but I don’t get it when translate this multiple producer-multiple consumer. into nvshmem barrier situation.
the red line here actually is nvshmemi_threadgroup_sync call in nvshmem_barrier,it can make sure that all thread wait here at the same time. In this exactly time, we are sure that both fence.sc.sys(__threadfence_system()) in T1 and T2 are executed in some order. We can assume that F1 preceed F2. So we have W1->F1->F2->R1, which make R2 see the change from T1’s W1. we also have W2->F2->R1, which make R1 see the change from T2’s W2.
Store buffering is the canonical litmus test for sequential consistency. However, NVSHMEM is adding a barrier synchronization (i.e. updating and polling on a flag variable) that synchronizes threads after F1/F2 and prior to R1/R2, which may change the strength of memory barrier needed. It would require some careful consideration to determine if acq_rel can be used. As I mentioned earlier, there hasn’t been a performance argument for this, so we’ve used the stronger memory fence. Please let us know if you’re observing better performance with acq_rel fence.