Threadfence and cache levels

If I have 2 A100 GPUs, who communicate between each other in the same kernel via buffers shared on cudaIPC.

Now if there is a producer-consumer relationship between the two, where GPU A stages some data, does a _threadfence() and signals GPU B via a flag(which is also on cudaIPC). When GPU B sees the flag is set, is it guaranteed that the data will be fully visible?

What I am really trying to understand is that, does _threadfence() guarantee L1 level cache coherence between these two SMs on separate GPUs? or do we need to bypass L1 on both producer and consumer side to make this work?

threadfence doesn’t say anything explicit about caches. It is a visibility statement concerning global memory (i.e. the logical global space).

This is what __threadfence() says:

write_to_global(A);
__threadfence();
write_to_global(B);

No other thread in the device will observe B and not A. They may observe neither A nor B. They may observe A only (i.e. A and not B). They may also observe A and B. They will never observe B and not A.

Threadfence applies to writes made by the thread that issued the threadfence.

__threadfence_system() extends the above description “the device” to:

" the device, host threads, and all threads in peer devices "

A peer device is a device that has been explicitly placed into a peer relationship by the CUDA programmer.