Atomic operation to a peer GPU’s memory?


My query is regarding atomic operations. It is mentioned on this link: that compute capability 6.x allows to widen or narrow the scope of an atomic operation. It is also quite straightforward to use atomicAdd_system and atomicAdd.

My confusion is regarding the following line (especially the highlighted section): If the GPU attempts an atomic operation to a peer GPU’s memory , the operation appears as a regular read followed by a write to the peer GPU, and the two operations are not done as one single atomic operation”. Unified memory is accessible to both the host and the device(s) and can perform atomic operations using above-mentioned routines. My query is how a GPU can access its peers’ memory to perform atomic operation? I went through the CUDA toolkit documentation and a couple of sample programs (e.g. p2pBandwidthLatencyTest) but still I could not find a routine/way to perform atomic operation on peer’s memory.

Thank you in advance.