Does the st.v4 and multimem.v4 instruction atomic?

The LL protocol in NCCL depends on the atomic store of the st instruction:

"asm volatile(“st.volatile.global.v4.u32 [%0], {%1,%2,%3,%4};” :: “l”(&dst->i4), “r”
((uint32_t)val), “r”(flag), “r” ((uint32_t)(val >> 32)), “r”(flag) : “memory”);

There are 4 data in the instruction to be stored in global memory, do the whole 4 data store atomically?

In general (not answering your specific question) Cuda can do 128 bit atomic accesses since compute capability 9.0.