I have a distributed program that runs on multiple GPU and communicates through CUDA IPC/NVLink. Here’s a simplified version of what it does:
Process 0 runs with GPU 0. Process 1 runs with GPU 1. GPU 0 and GPU 1 are connected via NVLink. Both are A100. Buffer 0 resides on GPU 0, Buffer 1 resides on GPU 1. Process 0 has an IPC handle of Buffer 1, and similarly Process 1 has an IPC handle of Buffer 0.
Now Process 0 writes to Buffer 0. After the write finishes, Process 0 calls __threadfence_system()
so that the newly written values become visible system wise. Then, Process 0 flips a flag in Buffer 1. Process 1 spins on this flag in Buffer 1 until it’s flipped, and then Process 1 reads the content of Buffer 0.
__threadfence_system()
can not be omitted. Otherwise, Process 1 risks reading Buffer 0’s old values . My understanding of why __threadfence_system()
gurantees cross-GPU correctness is that it forces a flush of the new values to L2 cache, and that’s exactly what NVLink goes through too. Then, I came across this thread What the relationship between __syncthreads() and __threadfence() - #3 by LibAndLab which indicates that __threadfence()
also flushes to L2 cache. I replaced __threadfence_system
with __threadfence()
in my program and it still functioned correctly.
Could someone help confirm that my way of using __threadfence()
is actually safe? It is more preferable than __threadfence_system()
since the performance hit is much smaller with the former. Thanks!