Is __threadfence() strong enough to gurantee that CUDA IPC via NVLink accesses the latest data?

I have a distributed program that runs on multiple GPU and communicates through CUDA IPC/NVLink. Here’s a simplified version of what it does:

Process 0 runs with GPU 0. Process 1 runs with GPU 1. GPU 0 and GPU 1 are connected via NVLink. Both are A100. Buffer 0 resides on GPU 0, Buffer 1 resides on GPU 1. Process 0 has an IPC handle of Buffer 1, and similarly Process 1 has an IPC handle of Buffer 0.

Now Process 0 writes to Buffer 0. After the write finishes, Process 0 calls __threadfence_system() so that the newly written values become visible system wise. Then, Process 0 flips a flag in Buffer 1. Process 1 spins on this flag in Buffer 1 until it’s flipped, and then Process 1 reads the content of Buffer 0.

__threadfence_system() can not be omitted. Otherwise, Process 1 risks reading Buffer 0’s old values . My understanding of why __threadfence_system() gurantees cross-GPU correctness is that it forces a flush of the new values to L2 cache, and that’s exactly what NVLink goes through too. Then, I came across this thread What the relationship between __syncthreads() and __threadfence() - #3 by LibAndLab which indicates that __threadfence()also flushes to L2 cache. I replaced __threadfence_system with __threadfence() in my program and it still functioned correctly.

Could someone help confirm that my way of using __threadfence() is actually safe? It is more preferable than __threadfence_system() since the performance hit is much smaller with the former. Thanks!