I have a distributed program that runs on multiple GPU and communicates through CUDA IPC/NVLink. Here’s a simplified version of what it does:
Process 0 runs with GPU 0. Process 1 runs with GPU 1. GPU 0 and GPU 1 are connected via NVLink. Both are A100. Buffer 0 resides on GPU 0, Buffer 1 resides on GPU 1. Process 0 has an IPC handle of Buffer 1, and similarly Process 1 has an IPC handle of Buffer 0.
Now Process 0 writes to Buffer 0. After the write finishes, Process 0 calls
__threadfence_system() so that the newly written values become visible system wise. Then, Process 0 flips a flag in Buffer 1. Process 1 spins on this flag in Buffer 1 until it’s flipped, and then Process 1 reads the content of Buffer 0.
__threadfence_system() can not be omitted. Otherwise, Process 1 risks reading Buffer 0’s old values . My understanding of why
__threadfence_system() gurantees cross-GPU correctness is that it forces a flush of the new values to L2 cache, and that’s exactly what NVLink goes through too. Then, I came across this thread What the relationship between __syncthreads() and __threadfence() - #3 by LibAndLab which indicates that
__threadfence()also flushes to L2 cache. I replaced
__threadfence() in my program and it still functioned correctly.
Could someone help confirm that my way of using
__threadfence() is actually safe? It is more preferable than
__threadfence_system() since the performance hit is much smaller with the former. Thanks!