NVLink and Cache Levels

On Ampere, if two GPUs talk to each other in a long running kernel (just 1 thread block is launched) through cudaIPC

Now suppose GPU B will be reading from the buffer of GPU A shared via cudaIPC:

  1. is it correct that this traffic goes through NVLink?
  2. will this read be loading data from GPU A’s L1 cache, L2 cache, or global memory?
  3. wrt threadfence, does its enforcement take effect for this type of read as well? just like GPU B reading from its own memory?