Can __threadfence_system in function nvshmemi_barrier_threadgroup changed to fence.acq_rel.sys ？

2353645174hx · June 25, 2025, 6:52am

Hi, Team:

I have found in the nvshmem source code, it use __threadfence_system in device barrier function,

template <threadgroup_t SCOPE>
__device__ NVSHMEMI_DEVICE_ALWAYS_INLINE void nvshmemi_barrier_threadgroup(nvshmem_team_t team) {
    int myIdx = nvshmemi_thread_id_in_threadgroup<SCOPE>();
    nvshmemi_threadgroup_sync<SCOPE>();
    if ((nvshmemi_device_state_d.job_connectivity > NVSHMEMI_JOB_GPU_LDST)) {
        nvshmemi_transfer_quiet<SCOPE>(true);
    } else if (!myIdx) {
        __threadfence_system();
    }
    nvshmemi_threadgroup_sync<SCOPE>();

    nvshmemi_sync_algo_threadgroup<SCOPE>(team);

    if (!myIdx) {
        if (nvshmemi_device_state_d.job_connectivity > NVSHMEMI_JOB_GPU_PROXY)
            nvshmemi_transfer_enforce_consistency_at_target(false);
    }
    nvshmemi_threadgroup_sync<SCOPE>();
}

Can the __threadfence_system changed to fence.acq_rel.sys ptx, which actually used in nccl, and it is less expensive.

Looking forward to your replay !!!

jdinan · June 25, 2025, 1:06pm

Yes, these are both system-scope fences. They’re equivalent in the current implementation and should have the same overhead.

2353645174hx · June 25, 2025, 1:13pm

Thanks for your replay.

But from the CUDA C++ Programming Guide — CUDA C++ Programming Guide view,

I think they are different for fence.acq_rel.sys and __threadfence_system() (which actually is fence.sc.sys)

And I also test on my envoriment.
By setting compiler configure to -gencode arch=compute_80,code=sm_80.
I get following SASS for __threadfence_system()

        /*0040*/                   MEMBAR.SC.SYS ;         /* 0x0000000000007992 */
                                                           /* 0x000fec0000003000 */
        /*0050*/                   ERRBAR;                 /* 0x00000000000079ab */
                                                           /* 0x000fc00000000000 */
        /*0060*/                   CCTL.IVALL ;            /* 0x00000000ff00798f */

get following SASS for asm volatile(“fence.acq_rel.sys;” ::: “memory”);

        /*0010*/                   MEMBAR.ALL.SYS ;        /* 0x0000000000007992 */
                                                           /* 0x000fec000000b000 */
        /*0020*/                   ERRBAR;                 /* 0x00000000000079ab */
                                                           /* 0x000fc00000000000 */
        /*0030*/                   CCTL.IVALL ;            /* 0x00000000ff00798f */

jdinan · June 25, 2025, 4:44pm

The underlying implementations are the same, so the overhead of MEMBAR.SC.SYS and MEMBAR.ALL.SYS should be similar. The acquire-release fence is semantically weaker than the sequentially-consistent fence that NVSHMEM currently uses and using the weaker fence could potentially enable some compiler optimizations. The NVSHMEM barrier is meant to synchronize application issued ld/st operations to symmetric buffers, so we would need to evaluate whether acquire-release fence is strong enough for this broad use case.

2353645174hx · June 26, 2025, 2:36am

OK, I can get what you said about maybe acq_rel maybe not strong enough.

but is there a real situation that we must use sequentially-consistent fence rather than acq_rel for barrier call.

Besides, I found something in cppreference, it said

Sequential ordering may be necessary for multiple producer-multiple consumer situations where all consumers must observe the actions of all producers occurring in the same order

but I don’t get it when translate this multiple producer-multiple consumer. into nvshmem barrier situation.

2353645174hx · June 26, 2025, 7:47am

Hi, I found a case maybe can explain it. which I found in cuda ptx doc

the red line here actually is nvshmemi_threadgroup_sync call in nvshmem_barrier，it can make sure that all thread wait here at the same time. In this exactly time, we are sure that both fence.sc.sys(__threadfence_system()) in T1 and T2 are executed in some order. We can assume that F1 preceed F2. So we have W1->F1->F2->R1, which make R2 see the change from T1’s W1. we also have W2->F2->R1, which make R1 see the change from T2’s W2.

Is’t Right?

jdinan · June 26, 2025, 2:17pm

Store buffering is the canonical litmus test for sequential consistency. However, NVSHMEM is adding a barrier synchronization (i.e. updating and polling on a flag variable) that synchronizes threads after F1/F2 and prior to R1/R2, which may change the strength of memory barrier needed. It would require some careful consideration to determine if acq_rel can be used. As I mentioned earlier, there hasn’t been a performance argument for this, so we’ve used the stronger memory fence. Please let us know if you’re observing better performance with acq_rel fence.

2353645174hx · June 27, 2025, 1:45am

OK, The analyse before is theoretical analysis. I think fence.cs.sys is essential here. And will let you know if acq_rel is better.

Topic		Replies	Views
CUDA analog to OpenCL barrier(CLK_GLOBAL_MEM_FENCE) CUDA NVCC Compiler	1	578	February 26, 2022
Is __threadfence() strong enough to gurantee that CUDA IPC via NVLink accesses the latest data? CUDA Programming and Performance	0	418	June 13, 2023
What the relationship between __syncthreads() and __threadfence() CUDA Programming and Performance	4	6936	February 28, 2022
Difference between __syncthreads() and __threadfence() Nsight Compute	3	1824	March 17, 2022
__threadfence_block() vs __threadfence() ? CUDA Programming and Performance	6	6848	July 13, 2022
Question related __threadfence CUDA Programming and Performance	13	5111	January 12, 2016
CUDA memory model & threadfence CUDA NVCC Compiler	1	50	February 12, 2025
The different behaviors of different scopes' CUDA Memory Fence? CUDA Programming and Performance	1	330	April 16, 2024
__syncthreads and __threadfence together in a loop CUDA Programming and Performance	5	3612	October 15, 2010
barrier before sync necessary? CUDA Programming and Performance	0	277	January 3, 2019

Can __threadfence_system in function nvshmemi_barrier_threadgroup changed to fence.acq_rel.sys ？

Related topics