The LL protocol in NCCL depends on the atomic store of the st instruction:
"asm volatile(“st.volatile.global.v4.u32 [%0], {%1,%2,%3,%4};” :: “l”(&dst->i4), “r”
((uint32_t)val), “r”(flag), “r” ((uint32_t)(val >> 32)), “r”(flag) : “memory”);
There are 4 data in the instruction to be stored in global memory, do the whole 4 data store atomically?