peer-to-peer copy using cuMemcpy rather than cuMemcpyPeer

On Fermi GPUs that support UVA but don’t support peer-to-peer memory copies (i.e., a GF100 and a GF104) with CUDA 4.0.17, it seems that it is possible to perform peer-to-peer copies using a simple cuMemcpy when the source and destination pointers correspond to memory on different GPUs. Does cuMemcpyPeer effectively perform the same operation as cuMemcpy, or is there some performance difference? (I can’t perform any comparisons at the present because I don’t have access to multiple Teslas on a system with CUDA 4.0.17. (The system I have contains a GTX 460 and GTX 470 and runs version 280.13 of the Linux NVIDIA driver.)

This is mentioned in section 3.2.6.5 of the Programming Guide:

Peer-to-Peer Memory Copy

Memory copies can be performed between the memories of two different devices. When a unified address space is used for both devices (see Section 3.2.7), this is done using the regular memory copy functions mentioned in Section 3.2.2. Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(), cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync() […]

If your devices do not support peer-to-peer memory access or if it is not enabled with cudaDeviceEnablePeerAccess(), the peer-to-peer copies are staged through the host which entails a performance penalty.