On Ampere, if two GPUs talk to each other in a long running kernel (just 1 thread block is launched) through cudaIPC
Now suppose GPU B will be reading from the buffer of GPU A shared via cudaIPC:
- is it correct that this traffic goes through NVLink?
- will this read be loading data from GPU A’s L1 cache, L2 cache, or global memory?
- wrt threadfence, does its enforcement take effect for this type of read as well? just like GPU B reading from its own memory?