How does “cudaMemcpyPeer” implement?

How does “cudaMemcpyPeer” implement ? Is it device1 mem → host mem → device2mem ? If there is nvlink, does this API use nvlink or gpu-direct?

The normal usage of this API would be to precede it with checks of Peer support, followed by enablement of Peer support. See the simpleP2P CUDA sample code for an example.

If Peer support has been enabled, then the flow of data is directly from device1 mem → device2 mem, using the fabric (PCIE, or NVLINK). If NVLINK is available, it is used. If NVLINK is not available, PCIE is used.

In the above scenario, the data will not touch CPU/host memory, and depending on the system topology, may not even enter the CPU socket (if there are PCIE switches in the topology).